by
Last Update:
Code for this article:
http://www.west-wind.com/presentations/dotnetWebRequest/dotnetWebRequest.zip
Sidebars:
http://www.west-wind.com/presentations/dotnetWebRequest/dotnetWebRequest_Sidebars.htm
HTTP content retrieval is an important component for applications these days. Although .NET reduces the need to explicitly retrieve content from the Web through built-in mechanisms in the Web Services framework, ADO.NET and the XML classes, there are still many needs to retrieve Web content directly and manipulate it as text or data downloaded into files. In this article Rick describes the functionality of the HttpWebRequest and HttpWebResponse classes and provides an easy to use wrapper class. The class simplifies the HTTP access and provides most of the common features in an easy to use single interface while still providing full access to the base functionality of the HttpWebRequest class. In the process Rick describes some sticky issues like string encoding and Cookie handling and some related topics like implementing events and running multiple threads to service Web requests.
Last week I decided I needed a good, useful project to throw at .NET to continue my learning curve while actually building something I have a use for. A few years back I'd written a Web Monitoring package that allows me to monitor a set of Web sites and send out alerts when the sites are down and not responding. The app has been showing its age and since it was developed using C++ it has a clunky user interface that's not very maintainable. I thought this would be a good 'training' application to re-build for .NET. This application exercises HTTP functionality built into the .NET Framework, requires setting up and running multiple threads, hooking up events and managing a small set of data without a database backend and finally providing a Windows Form UI. In the future converting this application to work as a Windows service would also be a nice feature. This application is one that lets me explore a wide variety of features of a programming environment. In this month's article I'll describe a few of the features that I needed to build focusing specifically on the HTTP retrieval mechanism.
The .NET Framework provides new tools for retrieving HTTP content that are powerful and scalable in a single package. If you've ever worked in pre-.NET applications and tried to retrieve HTTP content you probably know that there are a number of different tools available: WinInet (Win32 API), XMLHTTP (part of MSXML) and recently the new WinHTTP COM library. These tools invariably all worked in some situations, but none of them really fit the bill for all instances. For example, WinInet can't scale on the server with no multi-threading support. XMLHTTP was too simple and didn't support all aspects of the HTTP model. WinHTTP which is the latest Microsoft tool for COM solves many of these problems but it doesn't work at all on Win9x, which makes it a bad choice for a client tool integrated into broad distribution apps at least for the moment until XP take a strong hold.
The.NET framework greatly simplifies HTTP access with a pair of classes HttpWebRequest and HttpWebResponse. These classes provide just about all of the functionality provided through the HTTP protocol in a straightforward manner. The basics of returning content from the Web requires very little code (see Listing 1).
Listing 1: Simple retrieval of Web data over HTTP.
Pretty simple, right? But beneath this simplicity lies a lot of power too. Let's start with looking at how this works.
Start by creating the HttpWebRequest object which is the base object used to initiate a Web request. A call to the static WebRequest.Create() method is used to parse the URL and pass the resolved URL into the request object. This call will throw an exception if the URL passed has invalid URL syntax.
The request portion controls how the outbound HTTP request is structured. As such it handles configuration of the HTTP headers, the most common of which are expressed as properties of the HttpWebRequest object. A few examples are UserAgent, ContentType, Expires and even a Cookies collection which map directly to header values that get set when the response is sent. Headers can also be set explicitly using the Headers string collection to which you can add either a whole header string or a key value pair. Generally the properties address all common headers, so you'll rarely need to resort to setting headers explicitly most likely only to support special protocols (for example, SoapAction for SOAP requests).
In the example, I do nothing much with the request other than setting a couple of the optional properties – the UserAgent (the client 'browser' which is blank otherwise) and the Timeout for the request. If you need to POST data to the server you'll need to do a little more work – I'll talk about this a little later.
Once the HTTP Request is configured for sending the data, a call to GetResponse() actually goes out and sends the HTTP request to the Web Server. At this point the request sends the headers and retrieves the first HTTP result buffer from the Web Server.
When the code above performs the GetResponse() call only a small chunk of data is returned from the Web server. The first chunk contains the HTTP header and the very first part of the data, which is simply buffered internally until read from the stream itself. The data from this initial request is used to set the properties of the HttpWebResponse object, so you can look at things like ContentType, ContentLength, StatusCode, Cookies and much more.
Next a stream is returned using the GetResponseStream() method. The stream points at the actual binary HTTP response from the Web server. Streams give you a lot of flexibility in handling how data is retriveved from the web server (see Streams and StreamReader sidebar). As mentioned, the call to GetResponse() only returned an initial internal buffer – to retrieve the actual data and read the rest of the result document from the Web server you have to read the stream.
In the example above I use a StreamReader object to return a string from the data in a single operation. But realize that because a stream is returned I could access the stream directly and read smaller chunks to say provide status information on progress of the HTTP download.
Notice also that when the StreamReader is created I had to explicitly provide an encoding type – in this case CodePage 1252 which is the Windows default codepage. This is important because the data is transferred as a byte stream and without the encoding it would result in invalid character translations for any extended characters. CodePage 1252 works fairly well for English or European language content, as well as binary content. Ideally though you will need to decide at runtime which encoding to use – for example a binary file probably should write a stream out to file or other location rather than converting to a string, while a page from Japan should use the appropriate Unicode encoding for that language. (For more information see the Encoding sidebar).
I use the StreamReader object which provides an easy mechanism to retrieve the contents of a stream into strings or arrays of characters. It also provides the handy ReadToEnd() method which retrieves the entire stream in a single batch. The operation of reading the stream is what actually retrieves the data from the Web server (except for the initial block that was read to retrieve the headers). In this case a single read operation is called and retrieves the data with the request blocking until the data has been returned. If you wanted to provide feedback you can also read the data in chunks using the StreamReader's Read() method which lets you specify the size of the data to read. You'd run this in a loop and provide whatever status info you need on each read. With this mechanism you can retrieve the data and provide progress information.
StreamReader also exposes the underlying raw stream using the BaseStream property, so StreamReader is a good object to use to pass streamed data around.
The example above only retrieves data which is essentially an HTTP GET request. If you want to send data to the server you can use an HTTP POST operation. POSTing data refers to the process of taking data and sending it to the Web server as part of the request payload. A POST operation both sends data to the server and retrieves a response.
Posting uses a stream to send the data to the server, so the process of posting data is pretty much the reverse of retrieving the data (see listing 2).
Listing 2: POSTing data to the Web Server
Make sure you use the this POST code immediately before the HttpWebRequest.GetResponse() call. All other manipulation of the Request object has no effect as the headers get send with the POST buffer. The rest of the code is identical to what was shown before – You retrieve the Response and then read the stream to grab the result data.
POST data needs to be properly encoded when sent to the server. If you're posting information to a Web page you'll have to make sure to properly encode your POST buffer into key value pairs and using URLEncoding for the values. You can utilize the static method System.Web.HttpUtility.UrlEncode() to encode the data. In this case make sure to include the System.Web namespace in your project. Note this is necessary only if you're posting to a typical HTML page – if you're posting XML or other application content you can just post the raw data as is. This is all much easier to do using a custom class like the one included with this article. This class has an AddPostKey method and depending on the POST mode it will take any parameters and properly encode them into an internally manage stream which is then POSTed to the server.
To send the actual data in the POST buffer the data has to be converted to a byte array first. Again we need to properly encode the string. Using Encoding.GetEncoding(1252) encoding with the GetBytes() method which returns a byte array using the Windows standard ANSI code page. You should then set the ContentLength property so the server can know the size of the data stream coming in. Finally you can write the POST data to the server using an output stream returned from HttpWebRequest.GetRequestStream(). Simply write the entire byte array out to the stream in one Write() method call with the appropriate size of the byte array. This writes the data and waits for completion. As with the retrieval operation the stream operations are what actually causes data to be sent to the server so if you want to provide progress information you can send smaller chunks and provide feedback to the user if needed.
The basic operation of using HttpWebRequest/HttpWebResponse is straight forward. But if you build typical applications that use HTTP access quite a bit you'll find that you'll have to set a number of additional properties; object properties in particular. One nice feature of the .NET framework is the consistency of common objects being reused for many areas of the framework.
If you use Authentication in your ASP.NET applications the objects used on the server have the same interface as they do on the client side. In this section I'll address the topics of Authentication, Proxy configuration and using Cookies which are a few things that you probably will have to do as part of installing HttpWebRequest as a client solution. All three of these objects used by HttpWebRequest/Response are standard objects that you'll find in other places of the framework.
Logging into Web content is very common for distributed applications as a security measure. Web authentication typically consists of either Basic Authentication (which is application driven prompting for an operating system account usually) or NTLM (integrated file security).
To authenticate a user you use the Credentials property:
If you're using Basic Authentication only the username and password are meaningful while with NTLM you can also pass a domain name. If you're authenticating against an NTLM resource (permissions set on the server's file system) from a Windows client application, you can also use the currently logged on user's credentials like this:
HttpWebRequest handles navigation of the Authentication HTTP protocol requests so an authenticated request operates like any other if it validates. If the request fails due to authentication an exception is thrown.
If you want to build a solid Web front end into a client application you will have to deal with clients that sit behind a firewall/proxy and you're application will have to handle these settings. Luckily HttpWebRequest makes this fairly painless with a WebProxy class member that takes the information. To configure a proxy you can use code like the following:
How much detail is provided to the Proxy object depends on the particular Proxy server. For example a bypass list is not required, and most proxies don't require a username and password in which case you don't need to provide the credentials.
WebProxy can also cram all the parameters into the constructor like this:
HTTP Cookies are a state management implementation of the HTTP protocol and many Web pages require them. If you're using remote HTTP functionality to drive a Web site (following URLs and the like) you will in many case have to be able to support cookies.
Cookies work by storing tokens on the client side, so the client side is really responsible for managing any cookie created. Normally a browser manages all of this for you, but here there's no browser to help out in an application front end and we're responsible for tracking this state ourselves. This means when the server assigns a cookie for one request, the client must hang on to it and send it back to the server on the next request where it applies (based on the Web site and virtual directory). HttpWebRequest and HttpWebResponse provide the container to hold cookies both for the sending and receiving ends but it doesn't automatically persist them so that becomes your responsibility.
Because the Cookie collections are nicely abstracted in these objects it's fairly easy to save and restore them. The key to make this work is to have a persistent object reference to the cookie collection and then reuse the same cookie store each time.
To do this let's assume you are running the request on a form (or some other class – this in the example below). You'd create a property called Cookies:
On the Request end of the connection before the request is sent to the server you can then check whether there's a previously saved set of cookies and if so use them:
So, if you previously had retrieved cookies, they were stored in the Cookies property and then added back into the Request's CookieContainer property. CookieContainer is a collection of cookie collections – it's meant to be able to store cookies for multiple sites. Here I only deal with tracking a single set of cookies for a single set of requests.
On the receiving end once the request headers have been retrieved after the call to GetWebResponse(), you then use code like the following:
This saves the cookies collection until the next request when it is then reassigned to the Request which sends it to the server. Note, that this is a very simplistic cookie management approach that will work only if a single or a single set of cookies is set on a given Web site. If multiple cookies are set in multiple different places of the site you will actually have to retrieve the individual cookies and individually store them into the Cookie collection. Here's some code that demonstrates:
This should give you a good starting point. This code still doesn't deal with things like domains and virtual paths and also doesn't deal with saved cookies, but for most applications the above should be sufficient.
By now you're probably getting an idea of the power that is provided by the HttpWebRequest object. While using these objects is a straight forward process it does require a fair amount of code and , using these objects required a fair amount of code and knowledge of a number of classes.
Since I use HTTP access in just about every application I create, I decided to create a wrapper class called wwHttp that simplifies the whole process quite a bit (the class is included in the code download for this article). Rather than creating two separate Request and Response objects the single class handles in a single object with simple string properties. The class handles setting up POST variables for you, creating any authentication and proxy settings from strings rather than objects, manages cookie and provides an optional simplified error handler that sets properties instead of throwing exceptions. It does this while allowing access to the base objects – you can pass in a WebRequest object and retrieve a reference to both the Request and Response objects, so you can get the best of both worlds with simplicity without giving up any of the features of the framework classes. The class also provides several overloaded methods for returning strings, streams and running output to file. The class can be set up to fire events at buffer retrieval points when data is downloaded to provide feed back in a GUI application.
Start by adding the namespace:
Using the class you can retrieve content from a site as simply as this:
Most of those property settings are optional but just about everything in the class is accessible with simple strings. AddPostKey() automates the creation of UrlEncoded strings. Several different POST modes are supported including UrlEncoded (0), Multi-Part (2) and raw XML (4) POSTs.
A corresponding The GetUrl() methods has several different signatures. You can pass in an optional WebRequest object that is preconfigured so if you need to set one of the to override some properties that are not exposed you can override them there. For example:
wwHttp also exposes Error and ErrorMsg properties that can be used to quickly check for error conditions:
Explict error retrieval is the default, but you can use the ThrowExceptions property to have the class pass forward exceptions to your code.
As I mentioned earlier POST data is important in Web request applications and getting the data into the proper format for posting can be tricky requiring possibly a fair amount of code. wwHttp abstracts the process with several overloads of the AddPostKey() method which handles the different POST modes: 1 – URLEncoded form variables, 2 – Multi-part form variables and files, 4 – XML or raw POST buffers.The base method looks as shown in Listing 3.
Listing 3: wwHttp::AddPostKey handles POST data
This method relies on a stream oPostStream to hold the cumulative POST data a user might send to the server. A BinaryWriter object is used to actually write to the stream sequentially without having to count bytes as you have to do with the raw stream.
Next the actual POST data is actually written into the stream using the Write() of the BinaryWriter. Note that this version of AddPostKey() accepts a byte[] input parameter rather than a string so the data can be written in its raw format.
The BinaryWriter Write() method has overloads that allow for string parameters, however, this didn't work correctly for me as the encoding was screwed up in the output. Instead the code above explicitly performs the translation for any strings (including static strings like the ones for the multipart form vars) from string to byte array, using the proper encoding as discussed previously. Once again, this was tricky to figure out as you can set an encoding on the BinaryWriter, but which didn't appear to have any effect. The code shown above was the only working solution that runs correctly.
There are several overloads to this method. Most importantly is a string version:
which does little more than converting the string into a byte array with the proper encoding. Another version accepts a single POST buffer, which is typically used for XML or binary content.
This one writes directly to the binary writer.
Finally there's an AddPostFile() method which allows you to POST a file to the server when running with multi-part forms (PostMode=2) to provide HTML file upload capabilities.
Listing 3.1: HTTP file upload method for multi-part forms
The AddPost style methods handle collecting POST data before the request is sent. The actual sending occurs in the main GetUrlStream() method of the wwHttp class.
Listing 3.2: Sending the POST data to the server
This code finalizes the request for POST data by checking whether we've already written something into the POST buffer and if so configuring the POST request by specifying the content type. In the case of multi-part form POST an epilogue string is added to the end of the content.
Writing out the data entails taking the data from the memory stream that holds our accumulated POST data and writing it out to the request stream (loPostData) which actually sends the POST data to the server.
As you can see a lot of things are happening in order to properly POST data to the server and wwHttp takes care of the details for you with no additional code.
Other versions of the GetUrl() method return a StreamReader object and yet another version GetUrlEvents() that fires an OnReceiveData event whenever data arrives in the buffer. The event provides a current byte and total byte count (if available) as well as two flags Done and Cancel. Done lets you know when the request is finished, while the Cancel flag lets your code stop downloading data.
To run with the event enabled you just hook up an event handler to the event:
Make sure to disconnect the handler at the end of your request, or set it up in a static location that runs only one time. The event handler method can then do some work with the data in the OnReceiveDataArgs obhect returned to you (Listing 3):
Listing 4: Implementing the wwHttp::OnReceiveData event
Using the event is easy. Creating the event on the wwHttp class is a bit more involved and requires three steps:
First the actual event needs to be defined on the class:
Next the event's arguments need to be wrapped up into a class contains the arguments as properties:
Creating a public Delegate which acts as the method signature for the Event to be called:
You only need to define this delegate if you want to pass custom parameters. If no parameters are required you can just define your event use the standard System.EventHandler delegate. These three pieces make up the event interface.
To actually fire the event you can simply run your code and call the call the function pointer that the user assigned to the event. Here's the relevant code from wwHttp that demonstrates how the Response loop is read and how the event is fired on each update cycle.
Listing 5: Reading the Response Stream and firing events
The key to this code is the delegate OnReceiveData (see the Delegates Sidebar for more on delegates). It acts a function pointer which points the assigned method on the form in the example above. From within the stream reading loop this method is called every time a new buffer is retrieved.
Events are cool, but they're not all that useful if you're running in blocking mode as I've shown so far. HttpWebRequest/HttpWebResponse can also be run in asynchronous modes using the BeginGetResponse/EndGetResponse methods. Most of the stream classes provide this mechanism which allows you to specify a callback method to collect output retrieved from these requests (you can also send data asynchronously this way).
However, after playing with this for a while and then looking at the native thread support in the .NET framework it turned out to be easier to create a new thread of my own and encapsulate the thread operation in a class. The following example runs multiple wwHttp objects on a couple of threads simultaneously while also updating the form with information from the OnReceiveData event. Figure 1 shows what the form looks like. While the HTTP requests are retrieved the main form thread is still available to perform other tasks, so you can move around the form, so the form's UI remains active all the while.
Figure 1 – this example form runs two HTTP requests simultaneously firing events that update the form's status fields. While these requests run the form remains 'live'.
This process is surprisingly simple in .NET partly because .NET makes it easy to route thread methods into classes. This make it both easy to create a thread's processing into a nicely encapsulated format as well as providing an easy packaging mechanism for passing data into a thread and keeping that data isolated from the rest of an application.
Make sure you add the System.Threading namespace to your forms that use threads. The following code defines the thread handler class that fires the HTTP request with the FireUrls() method (Listing 5).
Listing 6: Implementing a Thread class
There's nothing special about this class – in fact any class would do as thread handler (as long as you write thread safe code). This simplified implementation includes a reference back to the ParentForm that makes it possible to access the status labels on the form. The Instance property is used to identify which request is updating which form control. The actual code here is very much like code I've previously shown using the wwHttp object. Note that this code assigns the event handler a method of the thread action class. This method then calls back to the ParentForm and updates the labels.
The calling code on the form creates two threads that call the FireUrls method as follows (Listing 6):
Listing 7: Creating and running the Thread
To start a thread the ThreadStart() function is called which takes a function pointer (basically a reference that points at a specific method in a class) as an argument. This returns a Delegate that can be used to create a new thread and tell it to start running with this pointer. You can pass either an instance variable of a static address of static class method. In most cases you'll want to use a method of a dynamic instance object variable, because it gives you the ability to fully set up the instance by setting properties that you'll need as part of the processing. Think of the actual thread implementation class as wrapper that is used as the high level calling mechanism and parameter packager to your actual processing code. If you need to pass data back to some other object you can make this instance a member of another object. For example, I could have made this object part of the form which would then allow the form to access the members of the 'thread' class and both could share the data.
Creating threads and running them is very easy but make sure you manage your data cleanly to prevent access of shared data from different threads at the same time. Shared data needs to be protected with synchronization of some sort. In fact, you can see why this is an issue if you click the Go link on the sample form a few times while downloads are still running. You'll see the numbers jump back and forth as multiple threads update the same fields on the form. As multiple instances are writing to the labels at the same time the code will eventually blow up. The workaround for this is to use synchronized methods to handle the updates or to use separate forms to display the update information (new form for each request). The topic of synchronization is beyond the scope of this article and I’ll cover basic multi-threading concepts in a future article.
This cliché has been so overused, but I can say .NET really delivers on this promise in a number of ways. Yes you could do all of this and most other Internet related things before, but .NET brings a consistent model to the tools. In addition you get straight forward support for advanced features like easy to implement multi-threading and (albeit somewhat complex) event support that make it easy to created complex applications that utilize the tools in ways that were either not possible in, or took a lot more work previously.
The tools are all there to access the Web very easily whether it's through the high level tools in .NET or the more lower level network protocol classes. The HttpWebRequest class is a fairly easy and powerful implementation that provides an excellent balance between flexibility and ease of use. For me the wwHttp class has been an exercise in understanding and utilizing the HttpWebRequest/Response classes. I hope you find the class useful if not as is then as an example of a variety of things that need to be done with HTTP requests frequently. The class is not complete and some features will need expansion in the future, but it's good starting point.