please use the link and the file to do this homework Infrastructure vs. I

OBJECTIVES In this chapter you will learn ...

• The history of the Internet and World Wide Web • Fundamental concepts and protocols that support the Internet • About the hardware and software that supports the Internet • How a web page is actually retrieved and interpreted This chapter introduces the World W ide W eb (WWW). The W WW relies on a number of systems, protocols, and tech nolo gies all working together in unison. Before learning about HTML ma rkup, C SS styling, J avaScript, and PHP programming, you must understand how the Internet makes web applications possible. This chapt er begins w ith a brief history of the Internet and p rovides an overview of k ey Internet and WWW technologies applicable to the web developer. To truly understand these concepts in depth, one would n ormally take courses i n co mputer science or information technology (IT) covering n etworking principles. If you find some of these topics too in-depth or a dva nced, you may de cide to skip over some of the details here and r eturn to th em l ater. I How the Web Works 1 2 CHAPTER 1 How the Web Works 1.1 Definitions and History The World Wide Web (WWW or simply the Web) is certainly what most people think of when they see the word "Internet." But the WWW is only a subset of the Internet, as illustrated in Figure 1.1.

1.1.1 A Short History of the Internet The history of telecommunication and data transport is a long one. There is a stra­ tegic advantage in being able to send a message as quickly as possible (or at least, more quickly than your competition). The Internet is not alone in providing instan­ taneous digital communication. Earlier technologies like radio, telegraph, and the telephone provided the same speed of communication, albeit in an analog form. Telephone networks in particular provide a good starting place to learn about modern digital communications. In the telephone networks of old, calls were routed through operators who physically connected caller and receiver by connecting a wire co a switchboard to complete a circuit. These operators were around in some areas for almost a century before being replaced with automatic mechanical switches, which did the same job: physically connect caller and receiver.

One of the weaknesses of having a physical connection is that you must estab­ lish a link and maintain a dedicated circuit for the duration of the call. This type of network connection is sometimes referred to as circuit switching and is shown in Figure 1.2.

The problem with circuit switching is that it can be difficult to have multiple conversations simultaneously (which a computer might want to do). It also requires more bandwidth since even the silences are transmitted (chat is, unused capacity in the network is not being used efficiently). FIGURE 1.1 The web as a subset of the Internet Thou map of woe, that thus dost talk in signs!

FIGURE 1 .2 Telephone network as example of circuit switching 1.1 Definitions and History 3 Bandwidth is a measurement of how much data can (maximally) be transmitted along an Internet connection. Normally measured in bits per second (bps), this mea­ surement differs according to the type of Internet access technology you are using.

A dial-up 56-Kbps modem has far less bandwidth than a 10-Gbps fiber optic connection. In the 1960s, as researchers explored digital communications and began to construct the first networks, the research network ARPANET was created. ARPANET did not use circuit switching but instead used an alternative communications method called packet switching. A packet-switched network does not require a continuous connection. Instead it splits the messages into smaller chunks called packets and routes them to the appropriate place based on the destination address. The packets can take different routes to the destination, as shown in Figure 1.3. This may seem a more complicated and inefficient approach than circuit switching, but is in fact more robust (it is not reliant on a single pathway that may fail) and a more efficient use of network resources (since a circuit can communicate data from multiple connections).

This early ARPANET network was funded and controlled by the United States government, and was used exclusively for academic and scientific purposes. The early network started small with just a handful of connected university campuses and research institutions and companies in 1969 and grew to a few hundred by the early 1980s.

At the same time, alternative networks were created like X.25 in 1974, which allowed (and encouraged) business use. USENET, built in 1979, had fewer restric­ tions still, and as a result grew quickly to 550 hosts by 1981. Although there was growth in these various networks, the inability for them to communicate with each 4 CHAPTER 1 How the Web Works Sender address Original message broken into FIGURE 1.3 Internet network as example of packet switching Thou map of woe, that thus dost talk in signs!

A B 3 talk in signs Original message reassembled from packets other was a real limitation. To promote the growth and unification of the disparate networks, a suite of protocols was invented to unify the networks. A protocol is the name given to a formal set of publicly available rules that manage data exchange between two points. Communications protocols allow any two computers to talk to one another, so long as they implement the protocol.

By 1981 protocols for the Internet were published and ready for use. 1 ,2 New networks built in the United States began to adopt the TCP/IP (Transmission Contro l Protocol/Internet Protocol) communication model (discussed in the next section), while older networks were transitioned over to it.

Any organization, private or public, could potentially connect to this new network so long as they adopted the TCP/IP protocol. On January 1, 1983, TCP/IP was adopted across all of ARPANET, marking the end of the research network that spawned the lnternet. 3 Over the next two decades, TCP/IP networking was adopted across the globe.

1.1.2 The Birth of the Web The next decade saw an explosion in the numbers of users, but the Internet of the late 1980s and the very early 1990s did not resemble the Internet we know today. During these early years, email and text-based systems were the extent of the Internet experience. 1.

1 Definitions and History 5 This transition from the old terminal and text-only Internet of the 1980s to the Internet of today is of course due to the invention and massive growth of the World Wide Web. This invention is usually attributed to the British Tim Berners-Lee (now Sir Tim Berners-Lee), who, along with the Belgian Robert Cailliau, published a proposal in 1990 for a hypertext system while both were working at CERN in Switzerland. Shortly thereafter Berners-Lee developed the main features of the web. 4 This early web incorporated the following essential elements that are still the core features of the web today:

A Uniform Resource Locator (URL) to uniquely identify a resource on the WWW.

The Hypertext Transfer Protocol (HTTP) to describe how requests and responses operate.

A software program (later called web server software) that can respond to HTTP requests.

Hypertext Markup Language (HTML) to publish documents.

A program (later called a browser) that can make HTTP requests from URLs and that can display the HTML it receives.

HTML will require several chapters to cover in this book. URLs and the HTTP are covered in this chapter. This chapter will also provide a little bit of insight into the nature of web server software; Chapter 20 will examine the inner workings of server software in more detail.

So while the essential outline of today's web was in place in the early 1990s, the web as we know it did not really begin until Mosaic , the first popular graphi­ cal browser application, was developed at the National Center for Supercomputing Applications at the University of Illinois Urbana-Champaign and released in early 1993 by Eric Bina and Marc Andreessen (who was a computer science under­ graduate student at the time). Andreessen later moved to California and cofounded Netscape Communications, which released Netscape Navigator in late 1994.

Navigator quickly became the principal web browser, a position it held until the end of the 1990s, when Microsoft's Internet Explorer (first released in 1995) became the market leader, a position it would hold for over a decade.

Also in late 1994, Berners-Lee helped found the World Wide Web Consortium ( W3C) , which would soon become the international standards organization that would oversee the growth of the web. This growth was very much facilitated by the decision of CERN to not patent the work and ideas done by its employee and instead leave the web protocols and code-base royalty free.

To illustrate the growth of the Internet, Figure 1.4 graphs the count of hosts connected to the Internet from 1990 until 2010. You can see that the last decade in particular has seen an enormous growth, during which social networks, web 6 CHAPTER 1 How the Web Works 0 .......

::c: .!!! +' IU Q) '-' C: "' ... '-' Q) ·- .... E .: ..c: 1,000,000,000 100,000,000 10,000,000 o · � t � ..c 0 E;::. :, z 1, 000 ,00 � -- 1990 1995 2000 Year 2005 2010 FIGURE 1.4 Growth in Internet hosts/servers based on data from the Internet Systems Consortium. 5 services, asynchronous applications, the semantic web, and more have all been cre­ ated (and will be described fully in due course in this textbook).

� BACKGROUND The Request for Comments (RFC) archive lists all of the Internet and WWW protocols, concepts, and standards. It started out as an unofficial reposi­ tory for ARPANET information and eventually became the de facto official record. Even today new standards are published there.

1.1.3. Web Applications in Comparison to Desktop Applications The user experience for a website is unlike the user experience for traditional desk­ top software. The location of data storage, limitations with the user interface, and limited access to operating system features are just some of the distinctions.

However, as web applications have become more and more sophisticated, the dif­ ferences in the user experience between desktop applications and web applications are becoming more and more blurred.

There are a variety of advant ages and disadvantages to web-based applications in comparison to desktop applications. Some of the advan tages of web applications include:

Accessible from any Internet-enabled computer.

Usable with different operating systems and browser applications. 1 .1 Defi nitions and History 7 Easier to roll out program updates since only software on the server needs to be updated and not on every desktop in the organization.

Centralized storage on the server means fewer security concerns about local storage (which is important for sensitive information such as health care data).

Unfortunately, in the world of IT, for every advantage, there is often a corre­ sponding disadvantage; this is also true of web applications. Some of these disad­ vantages include:

Requirement to have an active Internet connection (the Internet is not always available everywhere at all times).

Security concerns about sensitive private data being transmitted over the Internet.

Concerns over the storage, licensing, and use of uploaded data.

Problems with certain websites on certain browsers not looking quite right.

Restrictions on access to the operating system can prevent software and hardware from being installed or accessed (like Adobe Flash on iOS).

In addition, clients or their IT staff may have additional plugins added to their browsers, which provide added control over their browsing experience, but which might interfere with JavaScript, cookies, or advertisements. We will continually try to address these challenges throughout the book.

BACKGROUND One of the more common terms you might encounter in web development is the term "intranet " (with an "a"), which refers to an Internet network that is local to an organization or business. Intranet resources are often private, meaning that only employees (or authorized external parties such as customers or suppliers) have access to those resources. Thus Internet (with an "e") is a broader term that encompasses both private (intranet) and public networked resources.

Intranets are typically protected from unauthorized external access via security features such as firewalls or private IP ranges, as shown in Figure 1.5.

Because intranets are private, search engines such as Google have limited or no access to content within them.

Due to this private nature, it is difficulc to accurately gauge, for instance, how many web pages exist within intranets, and what technologies are more common in them. Some especially expansive estimates guess that almost half of all web resources are hidden in private intranets.

Being aware of intranets is also important when one considers the job market and market usage of different web technologies. If one focuses just on the ( continued) 8 CHAPTER 1 How the Web Works Customers and corporate partners might be able to access internal system.

FIGURE 1.5 Intranet versus Internet public Internet, it will appear that PHP, MySQL, and WordPress are the most commonly used web development stack. But when one adds in the private world of corporate intranets, other technologies such as ASP.NET, JSP, SharePoint, Oracle, SAP, and IBM WebSphere are just as important.

1.1.4 Static Websites versus Dynamic Websites In the earliest days of the web, a webmastcr (the term popular in the 1990s for the person who was responsible for creating and supporting a website) would publish web pages and periodically update them. Users could read the pages but could not provide feedback. The early days of the web included many encyclopedic, collection­ style sites with lots of content to read (and animated icons to watch).

In those early days, the skills needed to create a website were pretty basic: one needed knowledge of the HTML and perhaps familiarity with editing and creating images. This type of website is commonly referred to as a static website , in that it consists 0 Browser displays files 1.1 Definitions and History 9 � E) Server retrieves files , _ t from its hard drive 0 Server "sends" HTML and then later the image to browser Ji .. lEJ picture.jpg FIGURE 1.6 Static website only of HTML pages that look identical for all users at all times. Figure 1.6 illustrates a simplified representation of the interaction between a user and a static website.

Within a few years of the invention of the web, sites began to get more compli­ cated as more and more sites began to use programs running on web servers to generate content dynamically. These server-based programs would read content from databases, interface with existing enterprise computer systems, communicate with financial institutions, and then output HTML that would be sent back to the users' browsers. This type of website is called here in this text a dynamic website because the page content is being created at run time by a program created by a programmer; this page content can vary from user to user. Figure 1. 7 illustrates a very simplified representation of the interaction between a user and a dynamic website.

So while knowledge of HTML was still necessary for the creation of these dynamic websites, it became necessary to have programming knowledge as well.

And by the late 1990s, other knowledge and skills were becoming necessary, such as CSS, usability, and security.

1.1.5 Web 2.0 and Beyond In the mid-2000s, a new buzzword entered the computer lexicon: Web 2.0. This term had two meanings, one for users and one for developers. For the users, Web 2.0 10 CHAPTER 1 How the Web Works G Browser displays files Q Server "sends" generated HTML and the image file to user.

FIGURE 1.7 Dynamic website 0 I want to see vacation.php )o '""' """'" or interprets h A t e script.

v Scripts "outputs" HTML ( � vacation.php referred to an interactive experience where users could contribute and consume web content, thus creating a more user-driven web experience. Some of the most popular websites fall into this category: Facebook, YouTube, and Wikipedia. This shift to allow feedback from the user, such as comments on a story, threads in a message board, or a profile on a social networking site has revolutionized what it means to use a web application.

For software developers, Web 2.0 also referred to a change in the paradigm of how dynamic websites are created. Programming logic, which previously existed only on the server, began to migrate to the browser. This required learning JavaScript, a rather tricky programming language that runs in the browser, as well as mastering the rather difficult programming techniques involved in asynchronous communication.

Web development in the Web 2.0 world is significantly more complicated today than it was even a decade ago. While this book attempts to cover all the main topics in web development, in practice, it is common for a certain division of labor to exist.

The skills to create a good-looking static web page are not the same skill set that is required to write software that facilitates user interactions. Many programmers are 1.2 Internet Protocols 11 poor visual user interface designers, and most designers can't program. This separa­ tion of software system and visual user interface is essential to any Web 2.0 application. Chapters on HTML and CSS are essential for learning about layout and design best practices. Later chapters on server and client-side programming build on those design skills, but go far beyond them. To build modern applications you must have both sets of skills on your team.

� BACKGROUND When a system is known by a 1.0 and 2.0, people invariably speculate on what the 3.0 version will look like. If there is a Web 3.0, it is currently uncertain and still under construction. Some people have, however, argued that Web 3.0 will be something called the semantic web.

Semantic is a word from linguistics that means, quite literally, "meaning." The semantic web thus adds context and meaning to web pages in the form of special markup. These semantic elements would allow search engines and other data mining agents to make sense of the content.

Currently a block of text on the web could be anything: a poem, an article, or a copyright notice. Search engines at present mainly just match the text you are searching for with text in the page. Currently these search engines have to use sophisticated algorithms to try to figure out the meaning of the page. The goal of the semantic web is to make it easier to figure out those meanings, thereby dra­ matically improving the nature of search on the web. Currently there are a num­ ber of semi-standardized approaches for adding semantic qualifiers to HTML; some examples include RDF (Resource Description Framework), OWL (Web Ontology Language), and SKOS (Simple Knowledge Organization System).

1. 2 Internet Protocols The Internet exists today because of a suite of interrelated communications proto­ cols. A protocol is a set of rules that partners in communication use when they communicate. We have already mentioned one of these essential Internet protocols, namely TCP/IP.

These protocols have been implemented in every operating system, and make fast web development possible. If web developers had to keep track of packet rout­ ing, transmission details, domain resolution, checksums, and more, it would be hard to get around to the matter of actually building websites. Despite the fact that these protocols work behind the scenes for web developers, having some general aware­ ness of what the suite of Internet protocols does for us can at times be helpful. 12 CHAPTER 1 How the Web Works I. 2 .1 A Layered Architecture The TCP/IP Internet protocols were originally abstracted as a four-layer stack.6 ,7 Later abstractions subdivide it further into five or seven layers.8 Since we are focused on the top layer anyhow, we will use the earliest and simplest four-layer network model shown in Figure 1.8.

Layers communicate information up or down one level, but needn't worry about layers far above or below. Lower layers handle the more fundamental aspects of transmitting signals through networks, allowing the higher layers to think about how a client and server interact. The web requires all layers to operate, although in web development we will focus on the highest layer, the application layer.

I. 2. 2 Link Layer The link layer is the lowest layer, responsible for both the physical transmission across media (wires, wireless) and establishing logical links. It handles issues like FIGURE 1 .8 Four-layer network model 1 .2 Internet Protocols 13 packet creation, transm1ss1on, reception, error detection, collisions, line sharing, and more. The one term here that is sometimes used in the Internet context is that of MAC (media access control) addresses . These are unique 48-or 64-bit identifiers assigned to network hardware and which are used at the physical networking level.

We will not focus on this layer, although you can learn more in a computer networking course or text.

1.2.3 Internet Layer The Internet layer (sometimes also called the IP Layer) routes packets between com­ munication partners across networks. The Internet layer provides "best effort" communication. It sends out the message to the destination, but expects no reply, and provides no guarantee the message will arrive intact, or at all.

The Internet uses the Internet Protocol (IP) addresses to identify destinations on the Internet. As can be seen in Figure 1.9, every device connected to the Internet has an IP addres s, which is a numeric code that is meant to uniquely identify it.

The details of the IP addresses can be important to a web developer. There are occasions when one needs to track, record, and compare the IP address of a given web request. Online polls, for instance, need to compare IP addresses to ensure the same address does not vote more than once.

1Pv4 'address 142.108 .149.36 IP Address IP Address 0-_ , IP:

BootP Static IP:

10.238.28.131 10 .239 .28 .131 FIGURE 1.9 IP addresses and the Internet IP:

22.15.216.13 • t-- llff"""fr1noft llilndaw5 Ners 1on (.:l'..1 6011 copyright (c) 2009 .. ,crosoh corr,or.:ation. o1,l1 rtght:s. ru•rved .

�:\>1pc0fl f1g �rndo,,s II' CO!'lfig,.ir.\tiOt> Eth•rn•t adapttr Local Aru Co11n•ct1on : 1��:���=�!� e � ,fi � OhS Su f ftit : ; 19 2.168 .123 .

2 5 4 oehult �u ... 3y ..

:\> IP: 142.181.80.3 14 CHAPTER 1 How the Web Works HANDS -ON EXERCISES LAB 1 EXERCISE Your IP address 1Pv4 2 32 addresses 1Pv6 2 128 addresses 4-8 bit components (32 bits) _/11_ � 192 .168 . 123 . 254 8-16 bit components (128 bits) �� 3fae:7a10:4545:9:291:e8ff:fe21:37ca FIGURE 1. 10 1Pv4 and 1Pv6 comparison There are two types of IP addresses: IPv4 and IPv6. IPv4 addresses are the IP addresses from the original TCP/IP protocol. In IPv4, 12 numbers are used (imple­ mented as four 8-bit integers), written with a dot between each integer (Figure 1.10).

Since an unsigned 8-bit integer's maximum value is 255, four integers together can encode approximately 4.2 billion unique IP addresses.

Your IP address will generally be assigned to you by your Internet service pro­ vider (ISP). In organizations, large and small, purchasing extra IP addresses from the ISP is not cost effective. In a local network, computers can share a single external IP address between them. IP addresses in the range of 192.168.0.0 to 192.168.255 , for example, are reserved for exactly this local area network use. Your connection therefore might have an internal IP of 192.168.0.15 known only to the internal network, and another public IP address that is your address to the world.

The decision to make IP addresses 32 bits limited the number of hosts to 4.2 billion. As more and more devices connected to the Internet the supply was becom­ ing exhausted, especially in some local areas that had already distributed their share.

To future-proof the Internet against the 4.2 billion limit, a new version of the IP protocol was created, IPv6 . This newer version uses eight 16-bit integers for 2128 0 BACKGROUND You may be wondering who gives an ISP its IP addresses. The answer is ultimately the Internet Assigned Numbers Authority (JANA). This group is actu­ ally a department of ICANN, the Internet Corporation for Assigned Names and Numbers, which is an internationally organized nonprofit organization respon­ sible for the global coordination of IP addresses, domains, and Internet protocols.

IANA allocates IP addresses from pools of unallocated addresses to Regional Internet Registries such as AfriNIC (for Africa) or ARIN (for North America). 1 .2 Internet Protocols 15 unique addresses, over a billion billion times the number in IPv4. These 16-bit integers are normally written in hexadecimal, due to their longer length. This new addressing system is currently being rolled out with a number of transition mechanisms, making the rollout seamless to most users and even developers.

Figure 1.10 compares the 1Pv4 and 1Pv6 address schemes.

1. 2 .4 Transport Layer The transport layer ensures transmissions arrive in order and without error. This is accomplished through a few mechanisms. First, the data is broken into packets formatted according to the Transmission Control Protocol (TCP ). The data in these packets can vary in size from O to 64K, though in practice typical packet data size is around 0.5 to lK. Each data packet has a header that includes a sequence number, so the receiver can put the original message back in order, no matter when they arrive. Secondly, each packet is acknowledged back to the sender so in the event of a lost packet, the transmitter will realize a packet has been lost since no ACK arrived for that packet. That packet is retrans­ mitted, and although out of order, is reordered at the destination, as shown in Figure 1.11. This means you have a guarantee that messages sent will arrive and in order. As a consequence, web developers don't have to worry about pages not getting to the users.

0 Message broken into packets with a sequence number.

Thou map of woe, that thus dost ralk in signs! [ 1 [ Thou map of woe, I [ 2 [ that thus dost [ [ 3 J talk in signs! [ C) Eventually. sender will resend any packets that didn't get an ACK back.

FIGURE 1.11 TCP packets f) For each TCP packet sent, an ACK (acknowledgement) must be received back.

Thou map of woe, that thus dost ralk in signs!

() Message reassembled from packets and ordered according to their sequence numbers. 16 CHAPTER 1 How the Web Works Sometimes we do not want guaranteed transmission of packets.

Consider a live multicast of a soccer game, for example. Millions of sub­ scribers may be streaming the game, and we can't afford to track and retransmit every lost packet. A small loss of data in the feed is acceptable, and the customers will still see the game. An Internet protocol called User Datagram Protocol (UDP ) is used in these scenarios in lieu of TCP. Other examples of UDP services include Voice Over IP, many online games, and Domain Name System (DNS).

1.2.5 Application Layer With the application layer , we are at the level of protocols familiar to most web developers. Application layer protocols implement process-to-process communica­ tion and are at a higher level of abstraction in comparison to the low-level packet and IP address protocols in the layers below it.

There are many application layer protocols. A few that are useful to web developers include:

HTTP. The Hypertext Transfer Protocol is used for web communication.

SSH. The Secure Shell Protocol allows remote command-line connections to servers.

FTP. The File Transfer Protocol is used for transferring files between computers.

POP/IMAP/SMTP. Email-related protocols for transferring and storing email.

DNS. The Domain Name System protocol used for resolving domain names to IP addresses.

We will discuss the HTTP and the DNS protocols later in this chapter. SSH will be covered later in the book in the chapter on security.

1. 3 The Client-Server Model The web is sometimes referred to as a client-server model of communications. In the client-server model, there are two types of actors: clients and servers. The server is a computer agent that is normally active 24 hours a day, 7 days a week, listening 1.3 The Client-Server Model 17 for queries from any client who make a request. A client is a computer agent that makes requests and receives responses from the server, in the form of response codes, images, text files, and other data.

1.3.1 The Client Client machines are the desktops, laptops, smart phones, and tablets you see everywhere in daily life. These machines have a broad range of specifications regarding operating system, processing speed, screen size, available memory, and storage. In the most familiar scenario, client requests for web pages come through a web browser. But a client can be more than just a web browser. When your word processor's help system accesses online resources, it is a client, as is an iOS game that communicates with a game server using HTTP. Sometimes a server web program can even act as a client. For instance, later in Chapter 17, our sample PHP websites will consume web services from service providers such as Flickr and Microsoft; in those cases, our PHP application will be acting as a client.

The essential characteristic of a client is that it can make requests to particular servers for particular resources using URLs and then wait for the response. These requests are processed in some way by the server.

1.3.2 The Server The server in this model is the central repository, the command center, and the central hub of the client-server model. It hosts web applications, stores user and program data, and performs security authorization tasks. Since one server may serve many thousands, or millions of client requests, the demands on servers can be high.

A site that stores image or video data, for example, will require many terabytes of storage to accommodate the demands of users.

The essential characteristic of a server is that it is listening for requests, and upon getting one, responds with a message. The exchange of information between the client and server is summarized by the request-response loop.

1.3.3 The Request-Response Loop Within the client-server model, the request-response loop is the most basic mechanism on the server for receiving requests and transmitting data in response. The client initiates a request to a server and gets a response that could include some resource like an HTML file, an image, or some other data, as shown in Figure 1.12. This response can also contain other information about the request, or the resource provided such as response codes, cookies, and other data. 18 CHAPTER 1 How the Web Works FIGURE 1.12 Request-response loop 1 .3.4 The Peer-to-Peer Alternative It may help your understanding to contrast the client-server model with a different network topology. In the peer-to-peer model, shown in Figure 1.13, where each computer is functionally identical, each node is able to send and receive data directly with one another. In such a model, each peer acts as both a client and server, able to upload and download information. Neither is required to be con­ nected 24/7, and with each computer being functionally equal, there is less distinc­ tion between peers. The client-server model, in contrast, defines clear and distinct roles for the server. Video chat and bit torrent protocols are examples of the peer­ to-peer model.

1.3.5 Server Types In Figure 1.12, the server was shown as a single machine, which is fine from a con­ ceptual standpoint. Clients make requests for resources from a URL; to the client, the server is a single machine.

However, most real-world websites are typically not served from a single server machine, but by many servers. It is common to split the functionality of a website between several different types of server, as shown in Figure 1.14. These include:

Web servers . A web server is a computer servicing HTTP requests. This typically refers to a computer running web server software such as Apache or Microsoft IIS (Internet Information Services). Reqoest and/ FIGURE 1.13 Peer-to-peer model 1.3 The Client-Server Model 19 Application servers. An application server is a computer that hosts and executes web applications, which may be created in PHP, ASP.NET, Ruby on Rails, or some other web development technology.

Database servers. A database server is a computer that is devoted to running a Database Management System (DBMS), such as MySQL, Oracle, or SQL Server, that is being used by web applications.

Mail servers. A mail server is a computer creating and satisfying mail requests, typically using the Simple Mail Transfer Protocol (SMTP).

Media servers. A media server (also called a streaming server) is a special type of server dedicated to servicing requests for images and videos . It may run special software that allows video content to be streamed to clients.

Authentication servers. An authentication server handles the most common security needs of web applications. This may involve interacting with local networking resources such as LDAP (Lightweight Directory Access Protocol) or Active Directory.

In smaller sites, these specialty servers are often the same machine as the web server. 20 CHAPTER 1 How the Web Works FIGURE 1.14 Different types of server 1.3.6 Real-World Server Installations The previous section briefly described the different types of server that one might find in a real-world website. In such a site, not only are there different types of server, but there is often replication of each of the different server types. A busy site can receive thousands or even tens of thousands of requests a second; globally popular sites such as Facebook receive millions of requests a second.

A single web server that is also acting as an application or database server will be hard-pressed to handle more than a few hundred requests a second, so the usual strategy for busier sites is to use a server farm. The goal behind server farms is to distribute incoming requests between clusters of machines so that any given web or data server is not excessively overloaded, as shown in Figure 1.15. Special devices called load balancers distribute incoming requests to available machines.

Even if a site can handle its load via a single server, it is not uncommon to still use a server farm because it provides failover redundancy ; that is, if the hardware fails in a single server, one of the replicated servers in the farm will maintain the site's availability.

In a server farm, the computers do not look like the ones in your house. Instead, these computers are more like the plates stacked in your kitchen cabinets. That is, a farm will have its servers and hard drives stacked on top of each other in server FIGURE 1.15 Server farm 1.3 The Client- Server Model 21 racks. A typical server farm will consist of many server racks, each containing many servers, as shown in Figure 1.16.

Server farms are typically housed in special facilities called data centers. A data center will contain more than just computers and hard drives; sophisticated air con­ ditioning systems, redundancy power systems using batteries and generators, and security personnel are all part of a typical data center, as shown in Figure 1.17.

To prevent the potential for site down times, most large websites will exist in mirrored data centers in different parts of the country, or even the world. As a con­ sequence, the costs for multiple redundant data centers are quite high (not only due to the cost of the infrastructure but also due to the very large electrical power con­ sumption used by data centers), and only larger web companies can afford to create and manage their own. Most web companies will instead lease space from a third­ party data center.

The scale of the web farms and data centers for large websites can be astonish­ ingly large. While most companies do not publicize the size of their computing infrastructure, some educated guesses can be made based on the publicly known IP address ranges and published records of a company's energy consumption and their power usage effectiveness.

For instance, a 2012 estimate argued that Amazon Web Services is using almost half a million servers spread across seven different data centers. 9 In 2012, an 22 CHAPTER 1 How the Web Works FIGURE 1.16 Sample server rack Fiber channel switches Rack management server Test server Keyboard tray and flip-up monitor Patch panel Production web server Production data server RAID HD arrays Patch panel Production web server Production data server Batteries and UPS infrastructure engineer at Amazon using a much more conservative estimation algo­ rithm concluded that Face book is using about 200,000 servers while Google is using around a million servers. lo 1.4 Where Is the Internet? 23 Air conditioning l)ii.l.t tl.

FIG URE 1.17 Hypothetical data center (f:o BACKGROUND UPS (batteries) Backup generators It is also common for the reverse to be true-that is, a single server machine may host multiple sites. Large commercial web hosting companies such as GoDaddy, BlueHost, Dreamhost, and others will typically host hundreds or even thousands of sites on a single machine (or mirrored on several servers).

This type of server is sometimes referred to as a virtual server ( or virtual private server). In this approach, each virtual server runs its own copy of the operating system web server software and thus emulates the operations of a dedicated physical server.

1.4 Where Is the Internet?

It is quite common for the Internet to be visually represented as a cloud, which is perhaps an apt way to think about the Internet given the importance of light and magnetic pulses to its operation. To many people using it, the Internet does seem to lack a concrete physical manifestation beyond our computer and cell phone screens.

But it is important to recognize that our global network of networks does not work using magical water vapor, but is implemented via millions of miles of copper wires and fiber optic cables, as well as via hundreds of thousands or even millions HANDS-ON EXERCISES LAB 1 EXERCISE Tracing a Packet 24 CHAPTER 1 How the Web Works of server computers and probably an equal number of routers, switches, and other networked devices, along with many thousands of air conditioning units and specially constructed server rooms and buildings.

The big picture of all the networking hardware involved in making the Internet work is far beyond the scope of this text. We should, however, try to provide at least some sense of the hardware that is involved in making the web possible.

1.4.1 From the Computer to the Local Provider Andrew Blum, in his eye-opening book, Tubes: A Journey to the Center of the Internet, tells the reader that he decided to investigate the question "Where is the Internet" when a hungry squirrel gnawing on some outdoor cable wires disrupted his home connection thereby making him aware of the real-world texture of the Internet.

While you may not have experienced a similar squirrel problem, for many of us, our main experience of the hardware component of the Internet is that which we experi­ ence in our homes. While there are many configuration possibilities, Figure 1.18 does provide an approximate simplification of a typical home to local provider setup.

The broadband modem (also called a cable modem or DSL modem) is a bridge between the network hardware outside the house (typically controlled by a phone or cable company) and the network hardware inside the house. These devices are often supplied by the ISP.

FIGURE 1.18 Internet hardware from the home computer to the local Internet provider 1 .4 Where Is the Internet?

25 The wireless router is perhaps the most visible manifestation of the Internet in one's home, in that it is a device we typically need to purchase and install. Routers are in fact one of the most important and ubiquitous hardware devices that make the Internet work. At its simplest, a router is a hardware device that forwards data packets from one network to another network. When the router receives a data packet, it examines the packet's destination address and then forwards it to another destination by deciding the best path to send the packets.

A router uses a routing table to help determine where a packet should be sent.

It is a table of connections between target addresses and the node (typically another router) to which the router can deliver the packet. In Figure 1.19, the different rout­ ing tables use next-hop routing, in which the router only knows the address of the next step of the path to the destination; it leaves it to the next step to continue rout­ ing the packet to the appropriate destination. The packet thus makes a variety of successive hops until it reaches its destination. There are a lot of details that have been left out of this particular illustration. Routers will make use of submasks, Sender address Sr 142 .109 .149 .46 ! 142.109 .149.46 ,209.202 .161 .2 401 1 !Thoumapofwoe,I � Sender address Destination address Router address Router address Routing table Address Next Hop 0.0.0.0 127.0.0.1 65.47.242.9 90.124.1.2 Address 208.68.17.3 142 .109 .149 .146 66.37.223.130 Next Hop 140.239.191.1 65.47.242.9 1 66.37 .223 .130 I 209 .202 . 161 .240 66.37.223 .130 Router address Address 142 .109 .149 .146 etc.

FIGURE 1.19 Simplified routing tables Next Hop 142 .1 09 .1 49.146 209 .202 . 161 .240 204. 70.19 8.182 etc. etc. 26 CHAPTER 1 How the Web Works timestamps, distance metrics, and routing algorithms to supplement or even replace routing tables; but those are all topics for a network architecture course.

Once we leave the confines of our own homes, the hardware of the Internet becomes much murkier. In Figure 1.18, the various neighborhood broadband cables (which are typically using copper, aluminum, or other metals) are aggregated and connected to fiber optic cable via fiber connection boxes. Fiber optic cable (or sim­ ply optical fiber) is a glass-based wire that transmits light and has significantly greater bandwidth and speed in comparison to metal wires. In some cities (or large buildings), you may have fiber optic cable going directly into individual buildings; in such a case the fiber junction box will reside in the building.

These fiber optic cables eventually make their way to an ISP's head -end , which is a facility that may contain a cable modem termination system (CMTS) or a digi­ tal subscriber line access multiplexer (DSLAM) in a DSL-based system. This is a special type of very large router that connects and aggregates subscriber connections to the larger Internet. These different head-ends may connect directly to the wider Internet, or instead be connected to a master head-end, which provides the connec­ tion to the rest of the Internet.

1.4.2 From the Local Provider to the Ocean's Edge Eventually your ISP has to pass on your requests for Internet packets to other net­ works. This intermediate step typically involves one or more regional network hubs.

Your ISP may have a large national network with optical fiber connecting most of the main cities in the country. Some countries have multiple national or regional net­ works, each with their own optical network. Canada, for instance, has three national networks that connect the major cities in the country as well as connect to a couple of the major Internet exchange points in the United States, as well as several provincial networks that connect smaller cities within one or two provinces. Alternatively, your smaller regional ISP may have transit arrangements with a larger national network (that is, they lease the use of part of their optical fiber network's bandwidth).

A general principle in network design is that the fewer the router hops (and thus the more direct the path), the quicker the response. Figure 1.20 illustrates some hypothetical connections between several different networks spread across four countries. As you can see, just like in the real world, the countries in the illustration differ in their degree of internal and external interconnectedness.

The networks in Country A are all interconnected, but rely on Network Al to connect them to the networks in Country Band C. Network Bl has many connec­ tions to other countries' networks. The networks within Country C and D are not interconnected, and thus rely on connections to international networks in order to transfer information between the two domestic networks. For instance, even though the actual distance between a node in Network Cl and a node in C2 might only be a few miles, those packets might have to travel many hundreds or even thousands of miles between networks Al and/or Bl. FIGURE 1.20 Connecting different networks within and between countries 1.4 Where Is the Internet? 27 Clearly this is an inefficient system, but is a reasonable approximation of the state of the Internet in the late 1990s (and in some regions of the world this is still the case), when almost all Internet traffic went through a few Network Access Points (NAP ), most of which were in the United States.

This type of network configuration began to change in the 2000s, as more and more networks began to interconnect with each other using an Internet exchange point (IX or IXP). These IXPs allow different ISPs to peer with one another (that is, interconnect) in a shared facility, thereby improving performance for each partner in the peer relationship.

Figure 1.21 illustrates how the configuration shown in Figure 1.20 changes with the use of IXPs.

As you can see, IXPs provide a way for networks within a country to intercon­ nect. Now networks in Countries C and D no longer need to make hops out of their Country A FIGURE 1.21 National and regional networks using Internet exchange points 28 CHAPTER 1 How the Web Works country for domestic communications. Notice as well that for each of the IXPs, there are connections not just with networks within their country, but also with other countries' networks as well. Multiple paths between IXPs provide a powerful way to handle outages and keep packets flowing. Another key strength of IXPs is that they provide an easy way for networks to connect to many other networks at a single location. 11 As you can see in Figure 1.22, different networks connect not only to other networks within an IXP, but now large websites such as Microsoft and Facebook are also connecting to multiple other networks simultaneously as a way of improving the performance of their sites. Real IXPs, such as at Palo Alto (PAIX), Amsterdam (AMS-IX), Frankfurt (CE-CIX), and London (LINX), allow many hundreds of networks and companies to interconnect and have throughput of over 1000 gigabits per second. The scale of peering in these IXPs is way beyond that shown in Figure 1.22 (which shows peering with only five others); companies within these IXPs use large routers from Cisco and Brocade that have hundreds of ports allowing hundreds of simultaneous peering relationships. In recent years, major web companies have joined the network companies in making use of IXPs. As shown in Figure 1.23, this sometimes involves mirroring (duplicating) a site's infrastructure (i.e., web and data servers) in a data center located near the IXP. For instance, Equinix Ashburn IX in Ashburn, Virginia, is surrounded by several gigantic data centers just across the street from the IXP.

FIGURE 1.22 Hypothetical Internet exchange point FIGURE 1 .23 IXPs and data centers 1.4 Where Is the Internet?

29 This concrete geography to the digital world encapsulates an arrangement that benefits both the networks and the web companies. The website will have incre­ mental speed enhancements (by reducing the travel distance for these sites) across all the networks it is peered with at the IXP, while the network will have improved performance for its customers when they visit the most popular websites.

1.4.3 Across the Oceans Eventually, international Internet communication will need to travel underwater.

The amount of undersea fiber optic cable is quite staggering and is growing yearly.

As can be seen in Figure 1.24, over 250 undersea fiber optic cable systems operated by a variety of different companies span the globe. For places not serviced by under­ sea cable (such as Antarctica, much of the Canadian Arctic islands, and other small islands throughout the world), Internet connectivity is provided by orbiting satel­ lites. It should be noted that satellite links (which have smaller bandwidth in com­ parison to fiber optic) account for an exceptionally small percentage of oversea Internet communication. 30 CHAPTER 1 How the Web Works HANDS-ON EXERCISES LAB 1 EXERCISE Name Servers < > "" " ..

' ' "' , .. .

i , .. !i

,o, ' � i C ' "-· '"*'' .... =�·- '" ;11:.� •• .:..----,.,,,, . ..,-,.

.,c,·�·•� FIGURE 1.24 Undersea fiber optic cables (courtesy TeleGeography/www.submarinecablemap.com) 1.5 Domain Name System Back in Section 1.2, you learned about IP addresses and how they are an essential feature of how the Internet works. As elegant as IP addresses may be, human beings do not enjoy having to recall long strings of numbers. One can imagine how unpleasant the Internet would be if you had to remember IP addresses instead of domains. Rather than google.com , you'd have to type 173.194.33.32. If you had to type in 69.171 .237.24 to visit Facebook, it is quite likely that social networking would be a less popular pastime.

Even as far back as the days of ARPANET, researchers assigned domain names to IP addresses. In those early days, the number of Internet hosts was small, so a list of a few hundred domain and IP addresses could be downloaded as needed from the Stanford Research Institute (now SRI International) as a hosts file (see Pro Tip).

Those key-value pairs of domain names and IP addresses allowed people to use the domain name rather than the IP address. 12 As the number of computers on the Internet grew, this hosts file had to be replaced with a better, more scalable, and distributed system. This system is called the Domain Name System (DNS) and is shown in its most simplified form in Figure 1.25. I need to go to www.funwebdev.com O What's the IP address of www.funwebdev.com?

0 Here it is ...

FIGURE 1 .25 DNS overview 1 .5 Domain Name System 31 DNS is one of the core systems that make an easy-to-use Internet possible (DNS is used for email as well). The DNS system has another benefit besides ease of use.

By separating the domain name of a server from its IP location, a site can move to a different location without changing its name. This means that sites and email systems can move to larger and more powerful facilities without disrupting service.

Since the entire request-response cycle can take less than a second, it is easy to forget that DNS requests are happening in all your web and email applications.

Awareness and understanding of the DNS system is essential for success in develop­ ing, securing, deploying, troubleshooting, and maintaining web systems.

A remnant of those earliest days still exists on most modern computers, namely the hosts file. Inside that file (in Unix systems typically at /etc/hosts ) you will see domain name mappings in the following format:

127.0.0.1 Localhost SomelocalDomainName.com This mechanism will be used in this book to help us develop websites on our own computers with real domain names in the address bar.

( continued) 32 CHAPTER 1 How the Web Works The same hosts file mechanism could also allow a malicious user to reroute traffic destined for a particular domain. If a malicious user ran a server at 123.56.789.1 they could modify a user's hosts to make facebook.com point to their malicious server. The end client would then type facebook.com into his browser and instead of routing that traffic to the legitimate facebook.com servers, it would be sent to the malicious site, where the programmer could phish, or steal data.

123.456.678.1 facebook.com For this reason many system administrators and most modern operating systems do not allow access to this file without an administrator password.

1.5.1 Name Levels A domain name can be broken down into several parts. They represent a hierarchy, with the rightmost parts being closest to the root at the "top" of the Internet naming hierarchy. All domain names have at least a top-l eve l domain (TLD ) name and a second-level domain (SLD) name. Most websites also maintain a third-level WWW subdomain and perhaps others. Figure 1.26 illustrates a domain with four levels.

The rightmost portion of the domain name (to the right of the rightmost period) is called the top-level domain. For the top level of a domain, we are limited to two broad categories, plus a third reserved for other use. They are:

Third -level domain l Top-level d r main (TLD) serverl.www.funwebdev.com Most general Most specific t t Fourth-level domain Second -level domain (SLD) Top -level domain (TLD) com Secon d-level domain (SLD) funwebdev � Third -level domain www I Fourth -level domain serverl FIGURE 1.26 Domain levels Generic top-level domain (gTLD) o Unrestricted. TLDs include .com, .net , .org , and .info . 1.5 Domain Name System 33 o Sponsored. TLDs including .gov, .mil , .edu , and others. These domains can have requirements for ownership and thus new second-level domains must have permission from the sponsor before acquiring a new address.

o New. From January to May of 2012, companies and individuals could submit applications for new TLDs. TLD application results were announced in June 2012, and include a wide range of both contested and single applicant domains. These include corporate ones like .apple , .google , and .macdonalds , and contested ones like .buy , .news , and .music .13 Country code top-level domain (ccTLD ) o TLDs include .us, .ca , .uk , and .au. At the time of writing, there were 252 codes registered. 14 These codes are under the control of the countries which they represent, which is why each is administered differently. In the United Kingdom, for example, commercial entities and businesses must register subdomains to co.uk rather than second-level domains directly.

In Canada .ca domains can be obtained by any person, company, or organization living or doing business in Canada. Other countries have peculiar extensions with commercial viability (such as .tv for Tuvalu) and have begun allowing unrestricted use to generate revenue.

o Since some nations use nonwestern characters in their native languages, the concept of the internationa lized top-level domain name (TON ) has also been tested with great success in recent years. Some IDNs include Greek, Japanese, and Arabic domains (among others) which have test domains at http://rrapaonyµa.oo1<1µri, http://f�1J� . TA 1-- , and http:// �.:.IJ. !t.:.iyl .; , respectively.

arpa o The domain .arpa was the first assigned top-level domain. It is still assigned and used for reverse DNS lookups (i.e., finding the domain name of an IP address).

In a domain like funwebdev.com , the ".com " is the top-level domain and fun­ webdev is called the second-level domain. Normally it is the second-level domains that one registers.

There are few restrictions on second-level domains aside from those imposed by the registrar (defined in the next section below). Except for internationalized domain names, we are restricted to the characters A-Z, 0-9, and the "-" character. Since domain names are case-insensitive characters, a-z can also be used interchangeably . 34 CHAPTER 1 How the Web Works The owner of a second-level domain can elect to have subdomains if they so choose, in which case those subdomains are prepended to the base hostname. For example, we can create exam-answers.webdevfun.com as a domain name, where exam-answers is the subdomain (don't bother checking ... it doesn't exist).

We could go further creating sub-subdomains if we wanted to. Each fur­ ther level of subdomain is prepended to the front of the hostname. This allows third level, fourth, and so on. This can be used to identify individual computers on a network all within a domain.

1.5.2 Name Registration As we have seen, domain names provide a human-friendly way to identify comput­ ers on the Internet. How then are domain names assigned? Special organizations or companies called domain name registrars manage the registration of domain names.

These domain name registrars are given permission to do so by the appropriate generic top-level domain (gTLD) registry and/or a country code top-level domain (ccTLD) registry.

In the 1990s, a single company (Networks Solutions Inc.) handled the com , net , and org registries. By 1999, the name registration system changed to a market sys­ tem in which multiple companies could compete in the domain name registration business. A single organization-the nonprofit Internet Corporation for Assigned Names and Numbers (I CANN )-still oversees the management of top-level domains, accredits registrars, and coordinates other aspects of DNS. At the time of writing this chapter, there were almost 1000 different !CANN-accredited registrars worldwide.

Figure 1.27 illustrates the process involved in registering a domain name.

1.5.3 Address Resolution While domain names are certainly an easier way for users to reference a website, eventually your browser needs to know the IP address of the website in order to request any resources from it. DNS provides a mechanism for software to discover this numeric IP address. This process is referred to here as address resolution .

As shown back in Figure 1.25, when you request a domain name, a computer called a domain name server will return the IP address for that domain. With that IP address, the browser can then make a request for a resource from the web server for that domain. I want the domain funwebdev.com 1.

5 Domain Name System 35 0 Decide on a top-level domain (.com) and a 0 Choose a domain registrar or a rese ller (a company such as a web host that works with a registrar). for domain to TLD name server.

C) Enjoy the new domain .

You now have purchased the rights to use it.

FIGURE 1.27 Domain name registration process TLD (.com) registry C) Complete the registration procedures which includes WHOIS contact information (includes DNS information) and payment.

While Figure 1.25 provides a clear overview of the address resolution process, it is quite simpli fied. What actually happens during address resolution is more com­ plicated, as can be seen in Figure 1.28.

DNS is sometimes referred to as a distributed database system of name servers .

Each server in this system can answer, or look for the answer to questions about domains, caching results along the way. From a client's perspective, this is like a phonebook, mapping a unique name to a number .

Figure 1.28 is one of the more complicated ones in this text, so let's examine the address resolution process in more detail.

1. The resolution process starts at the user's computer. When the domain www .funwebdev.com is requested (perhaps by clicking a link or typing in a URL), the browser will begin by seeing if it already has the IP address for the domain in its cache. If it does, it can jump to step Gin the diagram.

2. If the browser doesn't know the IP address for the requested site, it will delegate the task to the DNS resolver , a software agent that is part of the operating 36 CHAPTER 1 How the Web Works ', 0 I want to visit www. funwebdev. com e If IP for this site is not in browser's cache, it delegates task to operating system's e If the primary DNS DNS Resolver. e . .

server doesn't have If not 1n its DNS cache, the requested domain resolver makes requ 1est in its DNS cache, it for IP address to ISP s sends out the request to the root name server. O Root name server returns IP of name server for requested TLD (in this case the com name server).

FIGURE 1.28 Domain name address resolution process system. The DNS resolver also keeps a cache of frequently requested domains; if the requested domain is in its cache, then the process jumps to step G.

3. Otherwise, it must ask for outside help, which in this case is a nearby DNS server , a special server that processes DNS requests. This might be a computer at your Internet service provider (ISP) or at your university or corporate IT department. The address of this local DNS server is usually stored in the network settings of your computer's operating system, as can be seen in Figure 1.9. This server keeps a more substantial cache of domain name/IP address pairs. If the requested domain is in its cache, then the process jumps to step 0).

4. If the local DNS server doesn't have the IP address for the domain in its cache, then it must ask other DNS servers for the answer. Thankfully, the domain system has a great deal of redundancy built into it. This means that in general there are many servers that have the answers for any given DNS request. This redundancy exists not only at the local level (for instance, in Figure 1.28, the ISP has a primary DNS server and an alternative one as well) but at the global level as well.

5. If the local DNS server cannot find the answer to the request from an alternate DNS server, then it must get it from the appropriate top-level 1.5 Domain Name System 37 domain (TLD) name server. For funwebdev.com this is .com . Our local DNS server might already have a list of the addresses of the appropriate TLD name servers in its cache. In such a case, the process can jump to step O.

6. If the local DNS server does not already know the address of the requested TLD server (for instance, when the local DNS server is first starting up it won't have this information), then it must ask a root name server for that information. The DNS root name servers store the addresses of TLD name servers. JANA (Internet Assigned Numbers Authority) authorizes 13 root servers, so all root requests will go to one of these 13 roots. In practice, these 13 machines are mirrored and distributed around the world (see http:// www.root-servers.org/ for an interactive illustration of the current root servers); at the time of writing there are a total of 350 root server machines.

With the creation of new commercial top-level domains in 2012, approximately 2000 or so new TLDs will be coming online; this will create a heavier load on these root name servers.

7. After receiving the address of the TLD name server for the requested domain, the local DNS server can now ask the TLD name server for the address of the requested domain. As part of the domain registration process (see Figure 1.27), the address of the domain's DNS servers are sent to the TLD name servers, so this is the information that is returned to the local DNS server in step ().

8. The user's local DNS server can now ask the DNS server (also called a second-level name server) for the requested domain (www.funwebdev.com ); it should receive the correct IP address of the web server for that domain.

This address will be stored in its own cache so that future requests for this domain will be speedier. That IP address can finally be returned to the DNS resolver in the requesting computer, as shown in step G.

9. The browser will eventually receive the correct IP address for the requested domain, as shown in step 4S. Note: If the local DNS server was unable to find the IP address, it would return a failed response, which in turn would cause the browser to display an error message.

10. Now that it knows the desired IP address, the browser can finally send out the request to the web server, which should result in the web server responding with the requested resource (step «,).

This process may seem overly complicated, but in practice it happens very quickly because DNS servers cache results. Once the server resolves funwebdev .com , subsequent requests for resources on funwebdev.com will be faster, since we can use the locally stored answer for the IP address rather than have to start over again at the root servers.

To facilitate system-wide caching, all DNS records contain a time to live (TTL) field, recommending how long to cache the result before requerying the name 38 CHAPTER 1 How the Web Works server. Although this mechanism improves the efficiency and response time of the DNS system, it has a consequence of delaying propagation of changes throughout all servers. This is why administrators, after updating a DNS entry, must wait for propagation to all client ISP caches.

For more hands-on practice with the Domain Names System, please refer to Chapter 19 on Deployment.

Every web developer should understand the practice of pointing the name servers to the web server hosting the site. Quite often, domain registrars can convince customers into purchasing hosting together with their domain. Since most users are unaware of the distinction, they do not realize that the company from which you buy web space does not need to be the same place you registered the domain. Those name servers can then be updated at the registrar to point to any name servers you use. Within 48 hours, the IP-to-domain name mapping should have propagated throughout the DNS system so that anyone typing the newly registered domain gets directed to your web server.

1. 6 Uniform.

Resource Locators In order to allow clients to request particular resources from the server, a naming mechanism is required so that the client knows how to ask the server for the file.

For the web that naming mechanism is the Uniform Resource Locator (URL). As illustrated in Figure 1.29, it consists of two required components: the protocol used to connect, and the domain (or IP address) to connect to. Optional components of the URL are the path (which identifies a file or directory to access on that server), the port to connect to, a query string, and a fragment identifier.

1.6.l Protocol The first part of the URL is the protocol that we are using. Recall that in Section 1.2 we listed several application layer protocols on the TCP/IP stack. Many of those protocols can appear in a URL, and define what application protocols to use.

Requesting ftp://example.com/abc.txt sends out an FTP request on port 21, while http://example.com/abc.txt would transmit on port 80.

http ://www.funwebdev.com /index.php ?page =l7 #article -.- --,----- Protocol Domain FIG URE 1.29 URL components Path Query String Fragment 1.6.2 Domain 1.6 Uniform Resource Locators 39 The domain identifies the server from which we are requesting resources. Since the DNS system is case insensitive, this part of the URL is case insensitive. Alternatively, an IP address can be used for the domain.

1.6.3 Port The optional port attribute allows us to specify connections to ports other than the defaults defined by the IANA authority. A port is a type of software connection point used by the underlying TCP/IP protocol and the connecting computer. If the IP address is analogous to a building address, the port number is analogous to the door number for the building.

Although the port attribute is not commonly used in production sites, it can be used to route requests to a test server, to perform a stress test, or even to circumvent Internet filters. If no port is specified, the protocol component of a URL determines which port to use.

The syntax for the port is to add a colon after the domain, then specify an inte­ ger port number. Thus for instance, to connect to our server on port 888 we would specify the URL as http://funwebdev.com:888/ .

1.6.4 Path The path is a familiar concept to anyone who has ever used a computer file system.

The root of a web server corresponds to a folder somewhere on that server. On many Linux servers that path is /var/www/html/ or something similar (for Windows US machines it is often /inetpub/wwwroot/ ). The path is case sensitive, though on Windows servers it can be case insensitive.

The path is optional. However, when requesting a folder or the top-level page of a domain, the web server will decide which file to send you. On Apache servers it is generally index.html or index.php . Windows servers sometimes use Default .html or Default.aspx . The default names can always be configured and changed.

1.6.5 Query String Query strings will be covered in depth when we learn more about HTML forms and server-side programming. They are the way of passing information such as user form input from the client to the server. In URLs, they are encoded as key-value pairs delimited by " & " symbols and preceded by the "?" symbol. The components for a query string encoding a username and password are illustrated in Figure 1.30.

1.6.6 Fragment The last part of a URL is the optional fragment. This is used as a way of requesting a portion of a page. Browsers will see the fragment in the URL, seek out the 40 CHAPTER 1 How the Web Works HANDS-ON EXERCISES LAB 1 EXERCISE Seeing HTTP Headers ?username =john&password =abcdefg L 1d- values_J Delimiters FIGURE 1.30 Query string components fragment tag anchor in the HTML, and scroll the website down to it. Many early websites would have one page with links to content within that page using frag­ ments and "back to top" links in each section.

I. 7 Hypertext Transfer Protocol There are several layers of protocols in the TCP/IP model, each one building on the lower ones until we reach the highest level, the application layer, which allows for many different types of services, like Secure Shell (SSH), File Transfer Protocol (FTP), and the World Wide Web's protocol, i.e., the Hypertext Transfer Protocol ( H TTP ).

While the details of many of the application layer protocols are beyond the scope of this text, some, like HTIP, are an essential part of the web and hence require a deep understanding for a developer to build atop them successfully. We will come back to the HTTP protocol at various times in this book; each time we will focus on a different aspect of it. However, here we will just try to provide an overview of its main points.

The HTTP establishes a TCP connection on port 80 (by default). The server waits for the request, and then responds with a response code, headers, and an optional message (which can include files) as shown in Figure 1.31.

The user experience for a website is unlike a user experience for traditional desktop software. Users do not download software; they visit a URL. While we as web users might be tempted to think of an entire page being returned in a single HTTP response, this is not in fact what happens.

In reality the experience of seeing a single web page is facilitated by the client's browser, which requests the initial HTML page, then parses the returned HTML to find all the resources referenced from within it, like images, style sheets, and scripts.

Only when all the files have been retrieved is the page fully loaded for the user, as shown in Figure 1.32. A single web page can reference dozens of files and requires many HTTP requests and responses.

The fact that a single web page requires multiple resources, possibly from dif­ ferent domains, is the reality we must work with and be aware of. Modern browsers provide the developer with tools that can help us understand the HTTP traffic for a HTTP /1. l 200 OK Date: Mon, 22 Oct 2012 02:43:49 GMT Server: Apache Vary: Accept-Encoding Content-Encoding: gzip Content-Length: 4538 Connection: close Content-Type: text/html; charset=UTF-8

...

FIGURE 1.31 HTTP illustrated GET /index.html HTTP/1.l Host: example.com 1.

7 Hypertext Transfer Protocol 41 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1 Accept: text/html,application/xhtml+xml Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip, deflate Connection: keep-alive Cache-C ontrol: max-age=O 0 GET /vacation.html For each resource referenced in the HTML, the browser makes additional requests.

Q When all resources have arrived, the browser can lay out and display the page to the user.

FIGURE 1.32 Browser parsing HTML and making subsequent requests 42 CHAPTER 1 How the Web Works � h, n,., cbdN .

iii • __ "'" • '< CS\ DOM ..... � . .. c- p....,.GHn,1. css JS XH'I N9ft Ft,,.. r.w.

-L DoMain 0 �J)lt',el 6tt:IVll.t """"� -Cllnl ��-Clll'I 0)58 )IU J J 1(11 7) U 1101 I FIGURE 1.33 Distribution of load times f t in fr ·- p Ollm,, given page. Figure 1.33 shows a screen from the Firefox plugin FireBug (an HTML/ JavaScript debugger), which lists the resources requested for a current page and the breakdown of the load times for each component.

1.7.1 Headers Headers are sent in the request from the client and received in the response from the server. These encode the parameters for the HTTP transaction, meaning they define what kind of response the server will send. Headers are one of the most powerful aspects of HTTP and unfortunately few developers spend any time learning about them. Although there are dozens of headers, 15 we will cover a few of the essential ones to give you a sense of what type of information is sent with each and every request.

Request headers include data about the client machine (as in your personal computer). Web developers can use this information for analytic reasons and for site customization. Some of these include:

Host. The host header was introduced in HTTP 1.1, and it allows multiple websites to be hosted off the same IP address. Since requests for different domains can arrive at the same IP, the host header tells the server which domain at this IP address we are interested in.

User-Agent. The User-Agent string is the most referenced header in modern web development. It tells us what kind of operating system and browser Browser OS Additional details (32/ Gecko Browser 64 bit, build versions) Build Date 1.7 Hypertext Transfer Protocol 43 Firefox version Mozilla/6.0 (Windows NT 6.2; WOW64; rv:16.0.1) Gecko/20121011 Firefox/16.0.1 FIGURE 1 .34 User-Agent components the user is running. Figure 1.34 shows a sample string and the components encoded within. These strings can be used to switch between different style sheets and to record statistical data about the site's visitors.

Accept. The Accept header tells the server what kind of media types the client can receive in the response. The server must adhere to these constraints and not transmit data types that are not acceptable to the client. A text browser, for example, may not accept attachment binaries, whereas a graphical browser can do so.

Accept-Encoding. The Accept-Encoding headers specify what types of modifications can be done to the data before transmission. This is where a browser can specify that it can unzip or "deflate" files compressed with certain algorithms. Compressed transmission reduces bandwidth usage, but is only useful if the client can actually deflate and see the content.

Connection. This header specifies whether the server should keep the connection open, or close it after response. Although the server can abide by the request, a response Connection header can terminate a session, even if the client requested it stay open.

Cache-Control. The Cache header allows the client to control caching mechanisms.

This header can speci fy, for example, to only download the data if it is newer than a certain age, never redownload if cached, or always redownload. Proper use of the Cach e-Control header can greatly reduce bandwidth.

Response headers have information about the server answering the request and the data being sent. Some of these include:

Server. The Server header tells the client about the server. It can include what type of operating system the server is running as well as the web server software that it is using.

The Server header can provide additional information to hackers about your infrastructure. If, for example, you are running a vulnerable version of a plugin, and your Server header declares that information to any client that asks, you could be scanned, and subsequently attacked based on that header alone. For this reason, many administrators limit this field to as little info as possible. 44 CHAPTER 1 How the Web Works Last-Modified. Last-Modified contains information about when the requested resource last changed. A static file that does not change will always transmit the same last modified timestamp associated with the file. This allows cache mechanisms (like the Cache-Control request header) to decide whether to download a fresh copy of the file or use a locally cached copy.

Content-Length. Content-Length specifies how large the response body (message) will be. The requesting browser can then allocate an appropriate amount of memory to receive the data. On dynamic websites where the Last-Modified header changes each request, this field can also be used to determine the "freshness" of a cached copy.

Content-Type. To accompany the request header Accept, the response header Content- Type tells the browser what type of data is attached in the body of the message. Some media-type values are text/html, i mage/j peg, image/png, application/xml, and others. Since the body data could be binary, specifying what type of file is attached is essential.

Content-Encoding. Even though a client may be able to gzip decompress files and specified so in the Accept-Encoding header, the server may or may not choose to encode the file. In any case, the server must specify to the client how the content was encoded so that it can be decompressed if need be.

Although compressing pages before transm1ss1on reduces bandwidth, it requires CPU cycles and memory to do so. On busy servers, sometimes it can be more efficient to transmit dynamic content uncompressed, saving those CPU cycles to respond to requests.

I. 7 .2 Request Methods The HTTP protocol defines several different types of requests, each with a different intent and characteristics. The most common requests are the GET and POST request, along with the HEAD request. Other requests, such as PUT, DELETE, CONNECT, TRACE, and OPTIONS are seldom used, and are not covered here.

The most common type of HTTP request is the GET request . In this request one is asking for a resource located at a specified URL to be retrieved. Whenever you click on a link, type in a URL in your browser, or click on a book mark, you are usually making a GET request.

Data can also be transmitted through a GET request, something you will be learning about more in Chapter 4.

The other common request method is the POST request . This method is normally used to transmit data to the server using an HTML form (though as we will learn in 1.7 Hypertext Transfer Protocol 45 Chapter 4, a data entry form could use the GET method instead). In a POST request, data is transmitted through the header of the request, and as such is not subject to length limitations like with GET. Additionally, since the data is not transmitted in the URL, it is seen to be a safer way of transmitting data (although in practice all post data is transmitted unencrypted, and can be read nearly as easily as GET data). Figure 1.35 illustrates a GET and a POST request in action.

A HEAD request is similar to a GET except that the response includes only the header information, and not the body that would be retrieved in a full GET. Search engines, for example, use this request to determine if a page needs to be reindexed without making unneeded requests for the body of the resource, saving bandwidth.

1. 7. 3 Response Codes Response codes are integer values returned by the server as part of the response header. These codes describe the state of the request, including whether it was suc­ cessful, had errors, requires permission, and more. For a complete listing, please refer to the HTTP specification. Some commonly encountered codes are listed on the following page to provide a taste of what kind of response codes exist.

Artist: Picasso Year: 1906 Nationality: l_� sp_ a _in __ �H POST /FormProcess.php http/1.1 j Subm iii:;;;;;;;;;;;;;;; ==�===== ===== ==== =:::'.

Web server Hy perlin k GET /SomePage.php http/1.1 FIGURE 1 .35 GET versus POST requests 46 CHAPTER 1 How the Web Works Table 1.1 lists the most common response codes. The codes use the first digit to indicate the category of response. 2## codes are for successful responses, 3## are for redirection-related responses, 4## codes are client errors, while 5## codes are server errors.

1. 8 Web Servers A web server is, at a fundamental level, nothing more than a computer that responds to HTTP requests. The first web server was hosted on Tim Berners-Lee's desktop Code Description 200: OK 301: Moved Permanently 304: Not Modified 307: Temporary redirect The 200 response code means that the request was successful.

Te lls the client that the requested resource has permanently moved. Codes like this allow search engines to update their databases to reflect the new location of the resource. Normally the new location for that resource is returned in the response.

If the client so requested a resource with appropriate Cache-Contra l headers, the response might say that the resource on the server is no newer than the one in the client cache. A response like this is just a header , since we expect the client to use a cached copy of the resource.

This code is similar to 301, except the redirection should be considered temporary.

400: Bad Request If something about the headers or HTTP request in general is not correctly adhering to HTTP protocol, the 400 response code will inform the client.

401: Unauthorized Some web resources are protected and require the user to provide credentials to access the resource. If the client gets a 401 code, the request will have to be resent, and the user will need to provide those credentials.

404: Not found 404 codes are one of the only ones known to web users. Many browsers will display an HTML page with the 404 code to them when the requested resource was not found.

414: Request URI too long UR Ls have a length limitation, which varies depending on the server software in place. A 414 response code likely means too much data is likely trying to be submitted via the URL.

500: Internal server error This error provides almost no information to the client except to say the server has encountered an error.

TABLE 1.1 HTTP Response Codes 1.8 Web Servers 47 computer; later when you begin PHP development in Chapter 8, you may find your­ self turning your own computer into a web server.

Real-world web servers are often more powerful than your own desktop com­ puter, and typically come with additional software to make them more reliable and replaceable. And as we saw in Section 1.3.6, real-world websites typically have many web servers configured together in web farms.

Regardless of the physical characteristics of the server, one must choose an application stack to run a website. This stack will include an operating system, web server software, a database, and a scripting language to process dynamic requests.

Web practitioners often develop an affinity for a particular stack (often without rationale). Throughout this textbook we will rely on the LAMP software stack, which refers to the Linux operating system, Apache web server, MySQL database, and PHP scripting language. Since Apache and MySQL also run on Windows and Mac operating systems, variations of the LAMP stack can run on nearly any com­ puter (which is great for students). The Apple OSX MAMP software stack is nearly identical to LAMP, since OSX is a Unix implementation, and includes all the tools available in Linux. The W AMP software stack is another popular variation where Windows operating system is used.

Despite the wide adoption of the LAMP stack, web developers need to be aware of alternate software that could be used to support their websites. Many corpora­ tions, for instance, make use of the Microsoft WISA software stack, which refers to Windows operating system, IIS web server, SQL Server database, and the ASP.NET server-side development technologies.

1.8.1 Operating Systems The choice of operating system will constrain what other software can be installed and used on the server. The most common choice for a web server is a Linux­ based OS, although there is a large business-focused market that uses Microsoft Windows IIS.

Linux is the preferred choice for technical reasons like the higher average uptime, lower memory requirements, and the easier ability to remotely administer the machine from the command line, if required. The free cost also makes it an excellent cool for students and professionals alike looking to save on licensing costs.

Organizations that have already adopted Microsoft solutions across the organi­ zation are more likely to use a Windows server OS to host their websites, since they will have in-house Windows administrators familiar with the Microsoft suite of tools.

1. 8.2 Web Server Software If running Linux, the most likely server software is Apac he, which has been ported to run on Windows, Linux, and Mac, making it platform agnostic. Apache is also 48 CHAPTER 1 How the Web Works well suited to textbook discussion since all of its configuration options can be set through text files (although graphical interfaces exist).

IIS, the Windows server software, is preferred largely by those using Windows in their enterprises already or who prefer the .NET development framework. The most compelling reason to choose an IIS server is to get access to other Microsoft tools and products, including ASP.NET and SQL Server.

1.8.3 Database Software The moment you decide your website will be dynamic, and not just static HTML pages, you will likely need to make use of relational database software capable of running SQL queries.

The open-source DBMS of choice is usually MySQL (though some prefer PostgreSQL or SQ Lite), whereas the proprietary choice for web DBMS includes Oracle, IBM DB2, and Microsoft SQL Server. All of these database servers are capable of managing large amounts of data, maintaining integrity, responding to many queries, creating indexes, creating triggers, and more. The differences between these servers are real, but are not relevant to the scope of projects we will be developing in this text.

In this book you will be using the MySQL Server, meaning if you are developing on another platform, some queries may have to be altered.

1.8.4 Scripting Software Finally (or perhaps firstly if you are starting a project from scratch) is the choice of server-side development language or platform. This development platform will be used to write software that responds to HTTP requests. The choice for a LAMP stack is usually PHP or Python or Ruby on Rails. We have chosen PHP due to its access to low-level HTTP features, object-oriented support, C-like syntax, and its wide proliferation on the web.

Other technologies like ASP.NET are available to those interested in working entirely inside the Microsoft platform. Each technology does have real advantages and disadvantages, but we will not be addressing them here.

1. 9 Chapter Sum.m.ary This long chapter has been broad in its coverage of how the Internet and the web work.

It began with a short history of the Internet and how those early choices are still affect­ ing the web today. From the design of the Internet suite of protocols you saw how IP addresses, and a multilayer stack of protocols guaranteed transmission and receipt of data. The chapter also tried to provide a picture of the hardware component of the web and the Internet, from your home router, to gigantic web farms, to the many tentacles of undersea and overland fiber optic cable. The chapter then covered some of the key protocols that make the web work: the DNS, URLs, and the HTTP protocol.