How To: Fetching Web Pages with HTTP
by Joe Mayo, 3/10/02, 9/19/04
Introduction
HTTP is the primary transport mechanism for communicating with resources over
the World-Wide-Web. A developer will often want to obtain web pages for
different reasons to include: search engine page caching, obtaining info on a
particular page, or even implementing browser-like capabilities. To help
with this task, the .NET Framework includes classes that make this easy.
Getting an HTTP Page
The HTTP classes in the .NET framework are HTTPWebRequest and HTTPWebResponse.
The steps involved require specifying a web page to get with a HTTPWebRequest
object, performing the actual request, and using a HTTPWebResponse object
to receive the page. Thereafter, you would use stream operations to
extract page information. Listing 1 demonstrates how this process works.
Listing 1: Getting a Web Page: WebFetch.cs
using System;
using System.IO;
using System.Net;
using System.Text;
/// <summary>
/// Fetches a Web Page
/// </summary>
class WebFetch
{
static void Main(string[] args)
{
// used to build entire input
StringBuilder sb = new StringBuilder();
// used on each read operation
byte[] buf = new byte[8192];
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)
WebRequest.Create("http://www.mayosoftware.com");
// execute the request
HttpWebResponse response = (HttpWebResponse)
request.GetResponse();
// we will read data via the response stream
Stream resStream = response.GetResponseStream();
string tempString = null;
int count = 0;
do
{
// fill the buffer with data
count = resStream.Read(buf, 0, buf.Length);
// make sure we read some data
if (count != 0)
{
// translate from bytes to ASCII text
tempString = Encoding.ASCII.GetString(buf, 0, count);
// continue building the string
sb.Append(tempString);
}
}
while (count > 0); // any more data to read?
// print out page source
Console.WriteLine(sb.ToString());
}
}
The program in Listing 1 will request the main page of a web site and
display the HTML on the console screen. Because the page data will be
returned in bytes, we set up a byte array, named buf, to hold
results. You'll see how this is used in a couple paragraphs.
The first step in getting a web page is to instantiate a HttpWebRequest object.
This occurs when invoking the static Create() method of the WebRequest
class. The parameter to the Create() method is a string
representing the URL of the web page you want. A similar overload of the Create()
method accepts a single Uri type instance. The Create() method
returns a WebRequest type, so we need to cast it to an HttpWebRequest
type before assigning it to the request variable. Here's the line
creating the request object:
// prepare the web page we will be asking for
HttpWebRequest request = (HttpWebRequest)
WebRequest.Create("http://www.mayosoftware.com");
Once you have the request object, use that to get a response object.
The response object is created by using the GetResponse() method
of the request object that was just created. The GetResponse()
method does not accept parameters and returns a WebResponse object which
must be cast to an HttpWebResponse type before we can assign it to the response
object. The following line shows how to obtain the HttpWebResponse
object.
// execute the request
HttpWebResponse response = (HttpWebResponse)
request.GetResponse();
The response object is used to obtain a Stream object, which is a
member of the System.IO namespace. The GetResponseStream() method
of the response instance is invoked to obtain this stream as follows:
// we will read data via the response stream
Stream resStream = response.GetResponseStream();
Remember the byte array we instantiated at the beginning of the algorithm?
Now we'll use it in the Read() method, of the stream we just got, to
retrieve the web page data. The Read() method accepts three
arguments: The first is the byte array to populate, second is the
beginning position to begin populating the array, and the third is the maximum
number of bytes to read. This method returns the actual number of bytes
that were read. Here's how the web page data is read:
// fill the buffer with data
count = resStream.Read(buf, 0, buf.Length);
We now have an array of bytes with the web page data in it. However, it is
a good idea to transform these bytes into a string. That way we can use
all the built-in string manipulation methods available with .NET. I chose
to use the static ASCII class of the Encoding class in the System.Text
namespace for this task. The ASCII class has a GetString() method
which accepts three arguments, similar to the Read() method we just
discussed. The first parameter is the byte array to read bytes from,
which we pass buf to. Second is the beginning position in buf
to begin reading. Third is the number of bytes in buf to
read. I passed count, which was the number of bytes returned from
the Read() method, as the third parameter, which ensures that only the
required number of bytes were read. Here's the code that translates bytes
in buf to a string and appends the results to a StringBuilder object.
// translate from bytes to ASCII text
tempString = Encoding.ASCII.GetString(buf, 0, count);
// continue building the string
sb.Append(tempString);
The buffer size is set at 8192, but that is only large enough to hold a small
web page. To get around this, the code that reads the response stream
must be wrapped in a loop that keeps reading until there isn't any more bytes
to return. Listing 1 uses a do loop because we have to make at
least one read. Recall that every read() returns a count of
items that were actually read. The while condition of the do
loop checks the count to make sure something was actually read. Also,
notice the if statement that makes sure we don't try to translate bytes when
nothing was read. Because we used a loop, we needed to collect the
results of each iteration, which is why we append the result of each iteration
to a StringBuilder.
Summary
The HttpWebRequest and HttpWebResponse classes from the .NET Base
Class Library make it easy to request web pages over the internet. The Httprequest
object identifies the Web page to get and contains a GetResponse() method
for obtaining a HttpWebResponse object. With a HttpWebResponse
object, we retrieve a stream to read bytes from. Iterating until all the
bytes of a Web page are read, translating bytes to strings, and holding the
string, makes it possible to obtain the entire Web page.
Your feedback is very important and I appreciate any constructive contributions
you have. Please feel free to contact me for any questions or comments you may
have about this article.
Feedback
I want to support this site.
Copyright © 2000-2004 C# Station, All Rights Reserved