CHAPTER 32. The Hypertext Transfer Protocol

SOME OF THE MAIN TOPICS IN THIS CHAPTER ARE


The Beginning of HTTP 628

Defining HTTP 629

URLs, URIs, and URNs 630

The Internet is certainly familiar to readers of this book. Web pages are coded using the Hypertext Markup Language (HTML), as well as Java, ASP, and many other technologies. Underlying all of these technologies is an Application layer protocol—the Hypertext Transfer Protocol (HTTP). HTTP is the underlying application protocol that is used to deliver Web pages to your browser. Like other protocols (FTP, Telnet, and others) the transport protocol for HTTP is usually the TCP/IP suite of protocols, using port 80. HTTP can be sent over other network protocols, but that option is rarely used today. Port 80 is not a requirement for HTTP. Other ports can be used, and often are. To specify a different port, use a colon character (:) at the end of the URL, followed by the port number. However, in the original RFC specification for HTTP, port 80 was used.

image To learn about the basics of TCP/IP, ports, and related protocols and applications, see Chapter 24, “Overview of the TCP/IP Protocol Suite.”

The protocols TCP and IP were developed long before HTTP was created. HTTP, like other application protocols, is transmitted on the network using TCP/IP as the underlying protocol to ensure a timely, reliable transport.

Yet HTTP is the protocol that is placed between HTML and TCP/IP. Keep in mind that HTML is just a language used for creating Web pages. HTTP is used to transfer these pages to end users, and HTTP is transported across the Internet using TCP/IP.

This chapter is not going to help you learn the many versions of HTML, or other programming languages (such as Java and C#). Instead, you will learn about HTTP.


Note

HTTP is currently defined by version 1.1. In this chapter you will learn about some of the history of HTTP, including the concepts presented by the original version as well as HTTP v.1.1.


The Beginning of HTTP

The beginning of HTTP started with a vendor-supported consortium called the World Wide Web consortum (W3C). Although WC3 was responsible for the creation of many Internet Web standards, the most prominent are the Hypertext Transfer Protocol and other Internet standards. In 1989, CERN (the High Energy Particle Physics Laboratory in Geneva, Switzerland) scientist Dr. Tim Berners-Lee developed the first version of HTTP, which was to help the World Wide Web gain popularity and grow dramatically. Instead of the usual email, FTP, and other utilities that the Internet was using at that time, the new HTTP allowed an easier way to share information quickly.

Because of the time involved in ongoing development of HTTP, CERN partnered with INRIA (the French National Institute of Research for Computer Science and Control). Today, many other organizations are involved in continuing the development of HTTP, such as the Massachusetts Institute of Technology (MIT) Laboratory for Computer Science, and the Internet Engineering Task Force (IETF). Thus, you can find RFC (Request for Comments) documents on the Web about current and future development of the protocol.


Note

The W3C is not a government organization. It is an industry-supported consortium whose purpose is to promote standards for the Web, including interoperability among Web protocols and software. W3C does help to establish standards to achieve this goal.


Current proposed, informational, and standards RFCs include the following:

image RFC 1945, “Hypertext Transfer Protocol – HTTP/1.0.” Written in 1996 by Berners-Lee, R. Fielding, and H. Frystyk, this informational RFC was the beginning of the standardization process within the Internet community.

image RFC 2145, “Use and Interpretation of HTTP Version Numbers.” This is also an informational RFC that further specifies how version numbers of the HTTP protocol should be used.

image RFC 2519, “HTTP Extensions for Distributed Authoring – WEBDAV.” This is a proposed standard.

image RFC 2831, “Using Digest Authentication as a SASL Mechanism.” This RFC is also a proposed standard, and it discusses using SASL (Simple Authentication and Security Layer) to provide support for connection-based protocols, such as HTTP.

image RFC 2935, “Internet Open Trading Protocol (IOTP) HTTP Supplement.” IOTP messages are transported as XML (Extensible Markup Language) documents. The goal of this RFC is to ensure that XML documents are successfully exchanged between the parties involved in the communication.

image RFC 3229, “Delta Encoding in HTTP.” This RFC proposes a method for conserving valuable bandwidth on the Net by downloading only changes to cached Web pages. Rather than sending the entire data transported by HTTP, only changes, called delta encoding, are sent.

image RFC 3230, “Instance Digests in HTTP.” This is another proposed standard for HTTP version 1.1 that describes the use of MD5 (Message Digest v. 5) to ensure reliable transport of data carried by HTTP. MD5, created by Ronald L. Rivest of MIT, is the third version of this encryption technique. The previous versions were MD2 and MD4.

image RFC 3310, “Hypertext Transfer Protocol (HTTP) Digest Authentication Using Authentication and Key Agreement (AKA).” This is another informational RFC discussing authentication for use with HTTP.

The preceding RFCs (and others referenced in these RFCs) are recommended reading for those who want to pursue newer developments that may become part of the HTTP protocol in the near future.

Defining HTTP

HTTP was created to enable HTTP to transport hypertext through the Internet. Hypertext technology was first developed by Ted Nelson—and was officially known then as the Xanadu system. Xanadu was a method of creating documents on the Web, using one or more authors. One of the main features was the use of hyperlinks. Although Nelson’s original ideas never caught on, they were instrumental in the development of HTML as well as HTTP.

HTTP is basically a protocol that enables the transfer of text, images, and other data between computers on the Web. Although HTML might seem to be a protocol, it is not. Without HTTP, or a similar protocol, there would be no HTML pages on the Internet. HTTP relies on the underlying TCP/IP protocols for transport through the Internet, and thus HTTP can be considered an application protocol.

Although HTTP has been in use on the Net since 1990, in RFC 1945, first published in 1996, the Hypertext Transfer Protocol (version 1, commonly referred to as HTML/1) was described by Berners-Lee and other authors. HTTP is a stateless protocol, similar to IP. It is also an application protocol because it uses TCP/IP as a transport mechanism. The term stateless means that there is no requirement for a session, such as with a TCP session in which parameters are exchanged between the endpoints of a connection (the setup phase) before data exchanges can occur. Instead, a request is sent to a server via the Net, and provided that no errors occur, a response is sent back.

HTTP Mechanics

As previously indicated, HTTP is a client/server protocol. The client application (such as a browser) sends a request to the server that hosts the information the user needs (typically a Web page). The server sends back a response. The data object the client requests is identified by a Uniform Resource Identifier (URI), such as a Uniform Resource Locator (URL). Both of these are described later in this chapter.

The data object is encapsulated by HTTP and returned to the requestor. Although HTTP commands are terminated using the combination of <CR><LF> (carriage return/line feed), the object encapsulated in HTTP (the payload) does not have to adhere to this rule. Instead, the payload (referred to as the entity-body in HTTP terminology) is determined by the type of information being transferred. For example, plain ASCII text may use the <CR><LF> combination to mark the end of a record, whereas Unix/Linux systems use just the <LF> character. And graphics files can be composed in many different formats, from GIF to JPEG, among others. The important thing to remember is that the entity-body carried by HTTP is independent of the HTTP protocol.

Most all browsers today also maintain a cache, which stores recently requested pages. At the top of your browser, there should be a button you can use to refresh a page—send a request to the server to get the most up-to-date version of a page instead of one stored in the cache. Some pages are marked by the server so that they will not be stored in the requestor’s cache. These pages are refreshed from the HTTP server each time you reference the data source.

HTTP Header Fields

HTTP header fields (not to be confused with headers that may exist in the entity-body, or payload being carried by HTTP) can vary depending on the version of HTTP, as well as the content being carried. Each HTTP header field is made up of a name followed by the colon character (:), then a space, and finally the value for the particular header. Names for fields are case insensitive.

Some examples of HTTP headers include a content type field to identify the entity-body, as well as the length of the data. Another example, which can affect the time the content from a request is cached, is the Expires field. Browsers that recommend this field will not display Web pages/data to the user from the cache after the information has expired. Instead, a new request will be sent to the HTTP server.

Many other header fields are also defined, and you can find out about them by reading the RFCs listed at the beginning of this chapter. Here, just the basics are presented, as well as the syntax for forming header fields.

Although most users are familiar with using the Address field in a browser to enter a URL, most do not know what a URL is.

URLs, URIs, and URNs

Most any user of the Internet understands that you need to put a URL (Uniform Resource Locator) in the Address field of a browser to send a request to a Web server. However, the URL is only one of many URIs (Uniform Resource Identifiers—although in the original HTTP RFC, URL was termed Universal Resource Identifier). You specify a URL by using the prefix http:// in the address space of your browser. However, other URIs (identifiers) can be used, such as ftp:// if you want to use a browser to download files from a remote server.

The important thing to remember here is that URLs are just a subset of URIs, and there are many URIs. However, URLs are probably the most widely used URIs.

RFC 1630, written by Berners-Lee, also discusses URNs (Uniform Resource Names), which refer to a namespace that is more persistent than objects that refer to URLs.

Although this definition is not considered to be a standard, Berners-Lee describes the URI syntax this way:

image It should be extensible so that new naming schemes can be added later as determined by how the Web evolves.

image The syntax should be complete so that any naming scheme can be encoded in a URI.

image The URI should be “printable,” meaning that any URI should be able to be described using 7-bit ASCII characters.

To provide for the extensible characteristic of the syntax, this RFC assumed that new URI prefixes (http://, ftp://, and so on) can be an arbitrary string of characters, but also should be registered by some authority to ensure uniformity on the Web. The text that follows the prefixed URI designator is dependent on the prefix. For example, http:// would assume that a Web server address follows the prefix. For ftp://, the text following this prefix should be in conformance with FTP conventions, in order to specify an address and file to be downloaded.

This RFC also requires that a colon character (:) follow the prefix. The use of slashes (//) is used to indicate a hierarchy of some sort, such as a path through a naming convention that leads to the eventual location of information, or the object sought by the prefix.


Note

The use of the slash character should not be confused with the character used in some operating systems as a directory hierarchy specification. There is no relationship between the text following a URI and the text that follows, even if it contains the slash character.


Because some characters (such as the space character) can cause conflicts (especially when URIs are used in email messages, and are so long that the text is wrapped), an escape character is used. The percent sign (%) is used as the escape character. This character should be used for only this purpose, and nothing else.

Other characters, such as the hash character (#) and the question mark (?), also serve a particular purpose. The # character is used to separate the object of a URI from an identifier related to the specific URI. The ? character is used to separate the URI from an object that can be queried. In other words, the ? means that the text that follows it is used to pass data to a query based on the original object that is referenced by the URI. You will see this character appear in many URLs when you reference a Web site. This character is used in many URLs after you enter text (in a search engine, for example) to create the final URL that is used to apply the syntax of your query to the object you referenced in the URL that you entered. You can try this by visiting just about any major Web site, such as Microsoft, or a search engine. Watch the Address field on your browser and you will see a longer string of what appears to be a meaningless string of characters. It is, however, the syntax that the search engine (or other Web site) uses to apply your query to find the information you are looking for.


Note

Although the use of spaces in a URL or URI is discouraged, the plus sign (+) is used to indicate a space. If you want to use + in the URI or URL, it must be escaped (in other words, the text that follows the escape character should be interpreted literally). The escape, as explained in the main text, is the percent character (%). To identify a specific character, you would first use the escape character followed by the ASCII hex value for the character. A literal plus sign (ASCII code 2B) would therefore be represented as %2B.


Other reserved characters, which can be used by any URI and which apply to the syntax of those URIs, are the asterisk (*) character and the exclamation mark (!). In other words, these characters do not mean the same thing for all URIs. Each URI can use these characters for a meaning specific to the particular URI.

If this sounds confusing, just go to a search engine and look at the string of characters that follows your query. In Figure 32.1 you can see that entering the URL www.google.com brings up the initial query page for this search engine.

image

Figure 32.1. You can enter a URL to bring up a particular Web page, such as a search engine.

Yet when you enter text into this search engine’s Search field, and click on the Search button, the URL in the Address field of your browser is translated to a query that the search engine uses to locate resources related to your query, as shown in Figure 32.2.

image

Figure 32.2. Your query can change after you enter text in a search engine.

In Figure 32.2, notice the long string that was created by the search engine to satisfy your search request. Also notice that the Web site for Yoko Ono is the first result to show up. This Web site is the premier site for all information related to Yoko Ono, and is the first Web site to show up on the search engine.


Note

Because some characters are not allowed by the 7-bit URI scheme described in this chapter, you can escape them using the % character, followed by the hexadecimal equivalent of the character you want to use.



Tip

Whereas binary notation uses just two numbers, 0 and 1 (also called base 2), and the octal numbering scheme uses numbers 0–7 (base 8), hexadecimal (base 16) is a numbering system that uses the numbers 0–9, and then the alphabetic characters A–F (in decimal, the numbers 10–15). In decimal notation (base 10), the value after the numerical representation 9 is 10. The number 9 is the upper limit of representing numbers in a decimal scheme. Binary uses just two characters, zero and one, so the equivalent of ten in binary is 1010. Because only zeros and ones are allowed in binary, a longer string of numbers is needed to represent the same two-digit representation of ten in decimal. Hexadecimal is another thing altogether. Instead of being a subset of base 10, Hexadecimal (hex) expands on base 10, by adding the alphabetic characters needed to denote base 16. So decimal value 10 is represented in hexadecimal as the letter A.


The RFC goes on to explain URIs for specific applications, such as gopher, news, and mail. Some of these have been superceded by other RFCs. However, RFC 1630 should serve as a beginning document for those readers who want to study the details of URIs and URLs. URIs and URLs are also discussed in other RFCs than those discussed in this chapter. However, URLs are the most common when you consider the intense growth of the Web. There are even “hidden” URLs that don’t appear until you click on an embedded link in a Web page. Each link (or hyperlink in some RFCs) in a Web page that refers to another Web page simply provides, using HTML syntax, another URL request that will be sent to the server defined in that link.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset