08 April 2011

The World Wide Web: HTTP

The World Wide Web (WWW) is a repository of information linked together from points all over the world. The WWW has a unique combination of flexibility, portability, and user-friendly features that distinguish it from other services provided by the Internet.
The WWW project was initiated by CERN (European Laboratory for Particle Physics) to create a system to handle distributed resources necessary for scientific research.

HTTP: A killer Application level protocol from a networking point of view

In the 1980s the Internet was used by researchers, academics and university students to:

login to remote hosts,
to transfer files from local hosts to remote hosts and vice versa,
to receive and send news, and
to receive and send electronic mail.

Although these applications were(and continue to be) extremely useful, the Internet was essentially unknown outside the academic and research communities.

Then in early 1990s the Internet's killer application arrived on the scene -- the World Wide Web.

The Web is the Internet application that caught the general public's eye.
It is dramatically changing how people interact inside and outside their work environments.
It has spawned thousands of start-up companies.
It has elevated the Internet from just one of many data networks (including online networks such as Prodigy, America On Line and Compuserve, national data networks such as Minitel/Transpac in France, and private X.25 and frame relay networks) to essentially the one and only data network.

History is sprinkled with the arrival of electronic communication technologies that have had major societal impacts.

The first such technology was the telephone, invented in the 1870s. The telephone allowed two persons to orally communicate in real-time without being in the same physical location. It had a major impact on society -- both good and bad.
The next electronic communication technology was broadcast radio/television, which arrived in the 1920s and 1930s. Broadcast radio/television allowed people to receive vast quantities of audio and video information. It also had a major impact on society -- both good and bad.
The third major communication technology that has changed the way people live and work is the Web.

Perhaps what appeals the most to users about the Web is that it is on demand. Users receive what they want, when they want it. This is unlike broadcast radio and television, which force users to "tune in" when the content provider makes the content available. In addition to being on demand, the Web has many other wonderful features that people love and cherish.

It is enormously easy for any individual to make any content available over the Web; everyone can become a publisher at extremely low cost.
Hyperlinks and search engines help us navigate through an ocean of Web sites.
Graphics and animated graphics stimulate our senses.
Forms, Java applets, Active X components, as well as many other devices enable us to interact with pages and sites.
And more and more, the Web provides a menu interface to vast quantities of audio and video material stored in the Internet, audio and video that can be accessed on demand.

Overview of HTTP

The Hypertext Transfer Protocol (HTTP), the Web's application - layer protocol, is at the heart of the Web.

The Hypertext Transfer Protocol (HTTP) is a protocol used mainly to access data on the World Wide Web. HTTP functions as a combination of FTP and SMTP.

It is similar to FTP because it transfers files and uses the services of TCP.

However, it is much simpler than FTP because it uses only one TCP connection.

There is no separate control connection; only data are transferred between the client and the server.

HTTP is like SMTP because the data transferred between the client and the server look like SMTP messages. In addition, the format of the messages is controlled by MIME-like headers.

Unlike SMTP, the HTTP messages are not destined to be read by humans;

they are read and interpreted by the HTTP server and HTTP client (browser).

SMTP messages are stored and forwarded, but HTTP messages are delivered immediately.
The commands from the client to the server are embedded in a request message.
The contents of the requested file or other information are embedded in a response message.

HTTP uses the services of TCP on well-known port 80.

HTTP is implemented in two programs:

a client program and
server program.

The client program and server programs, executing on different end systems, talk to each other by exchanging HTTP messages.

HTTP defines the structure of these messages and how the client and server exchange the messages.

Now, it is useful to review some Web terminology:

The WWW today is a distributed client/server service, in which a client using a browser can access a service using a server. However, the service provided is distributed over many locations called sites.

Each site holds one or more documents, referred to as Web pages. Each Web page can contain a link to other pages in the same site or at other sites. The pages can be retrieved and viewed by using browsers.

The client needs to see some information that it knows belongs to site A. It sends a request through its browser, a program that is designed to fetch Web documents. The request, among other information, includes the address of the site and the Web page, called the URL, which we will discuss shortly.
The server at site A finds the document and sends it to the client. When the user views the document, she finds some references to other documents, including a Web page at site B. The reference has the URL for the new site.
The user is also interested in seeing this document. The client sends another request to the new site, and the new page is retrieved.

In other words a Web page (also called a document) consists of objects. An object is a simply file -- such as a HTML file, a JPEG image, a GIF image, a Java applet, an audio clip, etc. -- that is addressable by a single URL.

Most Web pages consist of a base HTML file and several referenced objects.

For example, if a Web page contains HTML text and five JPEG images, then the Web page has six objects: the base HTML file plus the five images. The base HTML file references the other objects in the page with the objects' URLs.

A client that wants to access a Web page needs the address. To facilitate the access of documents distributed throughout the world, HTTP uses locators. The uniform resource locator (URL) is a standard for specifying any kind of information on the Internet. The URL defines four things(see Figure 27.3):

protocol,
host computer,
(optionally)port, and
path

So each URL has four components:

The protocol is the client/server program used to retrieve the document. Many different protocols can retrieve a document; among them are FTP or HTTP. The most common today is HTTP.
The host is the computer (that houses the object) on which the information is located, although the name of the computer can be an alias. Web pages are usually stored in computers, and computers are given alias names that usually begin with the characters "www". This is not mandatory, however, as the host can be any name given to the computer that hosts the Web page.
The URL can optionally contain the port number of the server. If the port is
included, it is inserted between the host and the path, and it is separated from the host by a colon.
Path is the (object's)pathname of the file where the information is located. Note that the path can itself contain slashes that, in the UNIX operating system, separate the directories from the subdirectories and files.

For example, the URL www.someSchool.edu/someDepartment/picture.gif
has:

www.someSchool.edu for a host name and
/someDepartment/picture.gif for a path name.

Browsers
A browser is a user agent for the Web; it displays to the user the requested Web page and provides numerous navigational and configuration features.

Web browsers also implement the client side of HTTP. Thus, in the context of the Web, we will interchangeably use the words "browser" and "client".

Popular Web browsers include Mozilla's Firefox , Microsoft's Internet Explorer , Google's Chrome (Chromium), Opera etc.

Each browser usually consists of three parts:

a controller: receives input from the keyboard or the mouse and uses the client programs to access the document.

After the document has been accessed, the controller uses one of the interpreters to display the document on the screen.

client protocol: The client protocol can be one of the protocols described previously such as FTP or HTTP.
interpreters: The interpreter can be HTML, Java, or JavaScript, depending on the type of document. We discuss the use of these interpreters based on the document type later in the post's end (see Figure 27.2).

As we say earlier a Web server houses Web objects, each addressable by a URL.The Web page is stored at the server. Each time a client request arrives, the corresponding document is sent to the client.

To improve efficiency, servers normally store requested files in a cache in memory; memory is faster to access than disk.
A server can also become more efficient through multithreading or multiprocessing. In this case, a server can answer more than one request at a time.

Web servers
Web servers also implement the server side of HTTP. Popular Web servers (Netcraft provides a nice survey of Web server penetration) include Apache, Microsoft Internet Information Server(IIS), Nginx, lighttpd, etc.

HTTP defines how Web clients (i.e., browsers) request Web pages from servers (i.e., Web servers) and how servers transfer Web pages to clients.

The general idea is illustrated in Figure 2.2-1.

When a user requests a Web page (e.g., clicks on a hyperlink), the browser sends HTTP request messages for the objects in the page to the server.
The server receives the requests and responds with HTTP response messages that contain the objects.

Through 1997 essentially all browsers and Web servers implement version HTTP/1.0, which is defined in [RFC 1945].

Beginning in 1998 Web servers and browsers began to implement version HTTP/1.1, which is defined in [RFC 2068].
HTTP/1.1 is backward compatible with HTTP/1.0; a Web server running 1.1 can "talk" with a browser running 1.0, and a browser running 1.1 can "talk" with a server running 1.0.

Both HTTP/1.0 and HTTP/1.1 use TCP as their underlying transport protocol (rather than running on top of UDP).

The HTTP client:

first initiates a TCP connection with the server.
Once the connection is established, the browser and the server processes access TCP through their socket interfaces.

As we just know, on the client side the socket interface is the "door" between the:

client process and
the TCP connection;

conversely on the server side it is the "door" between the:

server process and
the TCP connection.

The client sends HTTP request messages into its socket interface and receives HTTP response messages from its socket interface.

Similarly, the HTTP server receives request messages from its socket interface and sends response messages into the socket interface.

Once the client sends a message into its socket interface, the message is "out of the client's hands" and is "in the hands of TCP".

Recall that TCP provides a reliable data transfer service to HTTP. This implies that each HTTP request message emitted by a client process eventually arrives intact at the server; similarly, each HTTP response message emitted by the server process eventually arrives intact at the client.

Here we see one of the great advantages of a layered architecture - HTTP need not worry about lost data, or the details of how TCP recovers from loss or reordering of data within the network. That is the job of TCP and the protocols in the lower layers of the protocol stack.

TCP also employs a congestion control mechanism . We only mention here that this mechanism forces each new TCP connection to initially transmit data at a relatively slow rate, but then allows each connection to ramp up to a relatively high rate when the network is uncongested. The initial slow-transmission phase is referred to as slow start.

It is important to note that the server sends requested files to clients without storing any state information about the client. If a particular client asks for the same object twice in a period of a few seconds, the server does not respond by saying that it just served the object to the client; instead, the server resends the object, as it has completely forgotten what it did earlier. Because an HTTP server maintains no information about the clients, HTTP is said to be a stateless protocol.

Non-Persistent and Persistent Connections
HTTP can use both non-persistent connections and persistent connections.

Non-persistent connections is the default mode for HTTP/1.0. Conversely, persistent connections is the default mode for HTTP/1.1.

Non-Persistent Connections
Let us walk through the steps of transferring a Web page from server to client for the case of non-persistent connections. Suppose the page consists of a base HTML file and 10 JPEG images, and that all 11 of these objects reside on the same server. Suppose the URL for the base HTML file is www.someSchool.edu/someDepartment/home.index . Here is what happens:

The HTTP client initiates a TCP connection to the server www.someSchool.edu.

Port number 80 is used as the default port number at which the HTTP server will be listening for HTTP clients that want to retrieve documents using HTTP.

The HTTP client sends a HTTP request message into the socket associated with the TCP connection that was established in step 1.

The request message either includes the entire URL or simply the path name /someDepartment/home.index.

The HTTP server:

receives the request message via the socket associated with the connection that was established in step 1,
retrieves the object /someDepartment/home.index from its storage (RAM or disk),
encapsulates the object in a HTTP response message, and
sends the response message into the TCP connection.

The HTTP server tells TCP to close the TCP connection.

(But TCP doesn't actually terminate the connection until the client has received the response message intact.)

The HTTP client receives the response message.

The TCP connection terminates. The message indicates that the encapsulated object is an HTML file.
The client extracts the file from the response message, parses the HTML file and finds references to the ten JPEG objects.

The first four steps are then repeated for each of the referenced JPEG objects.

As the browser receives the Web page, it displays the page to the user. Two different browsers may interpret (i.e., display to the user) a Web page in somewhat different ways.

HTTP has nothing to do with how a Web page is interpreted by a client. The HTTP specifications ([RFC 1945] and [RFC 2068]) only define the communication protocol between the client HTTP program and the server HTTP program.

The steps above use non-persistent connections because each TCP connection is closed after the server sends the object -- the connection does not persist for other objects. Note that each TCP connection transports exactly one request message and one response message. Thus, in this example, when a user requests the Web page, 11 TCP connections are generated.

In general this strategy, for N different objects in different files, the connection must be opened and closed N times.

The non-persistent strategy imposes high overhead on the server because the server needs N different buffers and requires a slow start procedure each time a connection is opened.

In the steps described above, we were intentionally vague about whether the client obtains the 10 JPEGs over ten serial TCP connections, or whether some of the JPEGs are obtained over parallel TCP connections.

Indeed, users can configure modern browsers to control the degree of parallelism. In their default modes, most browsers open five to ten parallel TCP connections, and each of these connections handles one request-response transaction(i.e. in Mozilla Firefox 3.6.16 open per default 6 // (persistent)connections trough -network.http.max-persistent-connections-per-server;6 item --accessible in about:config--).

If the user prefers, the maximum number of parallel connections can be set to one, in which case the ten connections are established serially.

As we shall see later, the use of parallel connections:

shortens the response time since it cuts out some of the RTT and slow-start delays.
Parallel TCP connections can also allow the requesting browser to steal a larger share of its fair share of the end-to-end transmission bandwidth.

Before continuing, let's do a "back of the envelope calculation" to estimate the amount of time from when a client requests the base HTML file until the file is received by the client.

To this end we define the round-trip time RTT, which is the time it takes for a small packet to travel from client to server and then back to the client.

The RTT includes packet propagation delays, packet queuing delays in intermediate routers and switches, and packet processing delays.

Now consider what happens when a user clicks on a hyperlink.

This causes the browser to initiate a TCP connection between the browser and the Web server; this involves a "three-way handshake" -- the client sends a small TCP message to the server, the server acknowledges and responds with a small message, and finally the client acknowledges back to the server.
One RTT elapses after the first two parts of the three-way handshake.
After completing the first two parts of the handshake, the client sends the HTTP request message into the TCP connection, and

TCP "piggybacks" the last acknowledgment (the third part of the three-way handshake) onto the request message.

Once the request message arrives at the server, the server sends the HTML file into the TCP connection. This HTTP request/response eats up another RTT.

Thus, roughly, the total response time is 2*RTT plus the transmission time at the server of the HTML file.

Persistent Connections
Non-persistent connections have some shortcomings.

First, a brand new connection must be established and maintained for each requested object. For each of these connections, TCP buffers must be allocated and TCP variables must be kept in both the client and server. This can place a serious burden on the Web server, which may be serving requests from hundreds of different clients simultaneously.
Second, as we just described, each object suffers two RTTs -- one RTT to establish the TCP connection and one RTT to request and receive an object.
Finally, each object suffers from TCP slow start because every TCP connection begins with a TCP slow-start phase.

However, the accumulation of RTT and slow start delays is partially alleviated by the use of parallel TCP connections.

With persistent connections, the server leaves the TCP connection open after sending responses. Subsequent requests and responses between the same client and server can be sent over the same connection.

In particular, an entire Web page (in the example above, the base HTML file and the ten images) can be sent over a single persistent TCP connection;
moreover, multiple Web pages residing on the same server can be sent over one persistent TCP connection.

Typically, the server can close the connection at the request of a client or if a time-out has been reached(when it isn’t used for a certain time). The timeout interval is often configurable.(in Firefox defaults network.http.keep-alive.timeout;115) The sender usually sends the length of the data with each response. However, there are some occasions when the sender does not know the length of the data. This is the case when a document is created dynamically or actively. In these cases, the server informs the client that the length is not known and closes the connection after sending the data so the client knows that the end of the data has been reached.

There are two versions of persistent connections:

without pipelining:

the client issues a new request only when the previous response has been received. In this case, each of the referenced objects (the ten images in the example above) experiences one RTT in order to request and receive the object. Although this is an improvement over non-persistent's two RTTs, the RTT delay can be further reduced with pipelining.
Another disadvantage of no pipelining is that after the server sends an object over the persistent TCP connection, the connection hangs -- does nothing -- while it waits for another request to arrive. This hanging wastes server resources.

with pipelining: The default mode of HTTP/1.1 uses persistent connections with pipelining.

In this case, the HTTP client issues a request as soon as it encounters a reference. Thus the HTTP client can make back-to-back requests for the referenced objects.
When the server receives the requests, it can send the objects back-to-back. If all the requests are sent back-to-back and all the responses are sent back-to-back, then only one RTT is expended for all the referenced objects (rather than one RTT per referenced object when pipelining isn't used).
Furthermore, the pipelined TCP connection hangs for a smaller fraction of time.

In addition to reducing RTT delays, persistent connections (with or without pipelining) have a smaller slow-start delay than non-persistent connections.

This is because that after sending the first object, the persistent server does not have to send the next object at the initial slow rate since it continues to use the same TCP connection. Instead, the server can pick up at the rate where the first object left off.

The interested reader is also encouraged to see [Heidemann 1997] and [Nielsen 1997].

HTTP Message Format(HTTP Transaction)
The HTTP specifications 1.0 [RFC 1945] and 1.1 [RFC 2068] define the HTTP message formats. There are two types of HTTP messages, request messages and response messages, both of which are discussed below.

Figure 27.12 illustrates the HTTP transaction between the client and server. Although HTTP uses the services of TCP, HTTP itself is a stateless protocol.

The client initializes the transaction by sending a request message.
The server replies by sending a response.

Messages
The formats of the request and response messages are similar; both are shown in Figure 27.13. A request message consists of:

a request line,
a header, and sometimes
a body.

A response message consists of:

a status line,
a header, and
sometimes a body.

Request and Status Lines

for a error the rest is wipe out. If someone has a (few days old) copy of that post please let me know ... Sorry for that

Harrykar's Techies Blog

Total Pageviews

Search: This Blog, Linked From Here, The Web, My fav sites, My Blogroll

Translate

08 April 2011

The World Wide Web: HTTP

HTTP: A killer Application level protocol from a networking point of view

No comments:

Post a Comment