You need to get the HTML returned from a web server in order to examine it for items of interest. For example, you could examine the returned HTML for links to other pages or for headlines from a news site.
We can use the methods for web communication we have set up in
Recipe 13.5 and Recipe 13.6 to make the HTTP request and verify the
response; then, we can get at the HTML via the
ResponseStream
property of the
HttpWebResponse
object:
public static string GetHTMLFromURL(string url) { if(url.Length == 0) throw new ArgumentException("Invalid URL","url"); string html = ""; HttpWebRequest request = GenerateGetOrPostRequest(url,"GET",null); HttpWebResponse response = (HttpWebResponse)request.GetResponse( ); try { if(VerifyResponse(response)== ResponseCategories.Success) { // get the response stream. Stream responseStream = response.GetResponseStream( ); // use a stream reader that understands UTF8 StreamReader reader = new StreamReader(responseStream,Encoding.UTF8); try { html = reader.ReadToEnd( ); } finally { // close the reader reader.Close( ); } } } finally { response.Close( ); } return html; }
The GetHTMLFromURL
method is set up to get a web
page using the
GenerateGetOrPostRequest
and GetResponse
methods, verify the response using
the
VerifyResponse
method,
and then, once we have a valid response, we start looking for the
HTML that was returned.
The
GetResponseStream
method on the HttpWebResponse
provides access to
the body of the message that was returned in a
System.IO.Stream
object. In order to read the
data, we instantiate a StreamReader
with the
response stream and the UTF8 property of the Encoding class to allow
for the UTF8-encoded text data to be read correctly from the stream.
We then call ReadToEnd
on the
StreamReader
, which puts all of the content in the
string variable called html
and return it.