11
Web Security

When the words appeared, everyone said they were a miracle. But nobody pointed out that the web itself is a miracle.

—E. B. White (from Charlotte’s Web)

The enormous success of the World Wide Web is in no small part due to the remarkable fact (today, completely taken for granted) that countless millions of people use it routinely without having the slightest understanding of how it works. This singular achievement for such a complex amalgam of technology is at once a blessing and a curse. Undoubtedly, the web’s ease of use has sustained widespread growth. On the flip side, securing a global network of independent digital services, used by countless millions of oblivious humans at the endpoints, is indeed an extremely difficult task. Security is perhaps the hardest part of this big hard problem.

One complicating factor that makes security especially challenging is that the early web was rather naively designed, without much consideration to security. As a result, the modern web is the product of a long evolution of standards, muddled by the competitive “browser wars” and backward compatibility restrictions. In short, the web is the most extreme instance of after-the-fact, “bolt-on security” in history—though what we have, well over a quarter of a century after its invention, is getting respectable.

Yet while the modern web can be made secure, its tangled history means that it’s also quite fragile and filled with many “security and privacy infelicities,” as the authors of RFC 6265, a spec for web cookies, so colorfully put it. Software professionals need to understand all of this so as not to run afoul of these issues when building for the web. Tiny missteps easily create vulnerabilities. Given the “Wild West” nature of the internet, bad actors have the freedom to easily probe how websites work, as well as anonymously muck around looking for openings to attack.

This chapter focuses on the fundamentals of how the web security model evolved, and the right and wrong ways to use it. Vulnerabilities arise from the details, and there are so many things a secure website must get exactly right. We’ll cover all of the basics of web security, beginning with a plea to build on top of a secure framework that handles the intricacies for you. From there, we will see how secure communication (HTTPS), proper use of the HTTP protocol (including cookies), and the Same Origin Policy combine to keep websites safe. Finally, we’ll cover two of the major vulnerabilities specific to the web (XSS and CSRF) and discuss a number of other mitigations that, when combined, go a long way toward securing a modern web server. Nonetheless, this chapter is by no means a complete compendium of web security, the specifics of which are voluminous and evolve rapidly.

The goal here is to convey a broad-brush sense of the major common pitfalls so you will recognize and know how to deal with them. Web applications are also subject to the many other vulnerabilities covered elsewhere in this book: the focus in this chapter should not be interpreted to suggest that these are the only potential security concerns.

Build on a Framework

Use design as a framework to bring order out of chaos.

—Nita Leland

Thanks to modern web development tools, building a website has become nearly as easy as using one. My top recommendations for building a secure website are to rely on a high-quality framework, never override the safeguards it provides, and let competent experts handle all the messy details.

A reliance on a solid framework should insulate you from the kinds of vulnerabilities covered in the following sections, but it’s still valuable to understand exactly what frameworks do and don’t do so you can use them effectively. It’s also critical that you choose a secure framework from the start, because your code will heavily depend on it, making it painful to switch later if it lets you down. How do you know if a web framework is really secure? It boils down to trust—both in the good intentions and the expertise of its makers.

Web frameworks rise and fall in popularity and buzz almost as fast as Parisian fashion, and your choice will depend on many factors, so I won’t attempt to make recommendations. However, I can suggest general guidelines to consider for your own evaluation:

  • Choose a framework produced by a trustworthy organization or team that actively develops and maintains it in order to keep up with constantly changing web technologies and practices.
  • Look for an explicit security declaration in the documentation. If you don’t find one, I would disqualify the framework.
  • Research past performance: the framework doesn’t need a perfect record, but slow responses or ongoing patterns of problems are red flags.
  • Build a small prototype and check the resulting HTML for proper escaping and quoting (using inputs like the ones in this chapter’s examples).
  • Build a simple test bed to experiment with basic XSS and CSRF attacks, as explained later in this chapter.

The Web Security Model

I’m kind of glad the web is sort of totally anarchic. That’s fine with me.

—Roger Ebert

The web is a client/server technology, and understanding its security model requires considering both of those perspectives at once. Doing so gets interesting quickly, since the security interests of the two parties are often in contention, especially given the threat of potential attackers intruding via the internet.

Consider the typical online shopping website. The security principles at play here apply, more or less, to all kinds of web activity. In order to do business, the merchant and customers must trust each other to a certain degree, and in the vast majority of cases that does actually happen. Nonetheless, there are inevitably a few bad actors out there, so websites cannot fully trust every client, and vice versa. The following points highlight some of the nuances of the tentative mutual trust between the merchant and customer.

Here are some the merchant’s basic requirements:

  • Other websites should be unable to interfere with my customer interactions.
  • I want to minimize my competitors’ ability to scrape my product and inventory details while helpfully informing legit customers.
  • Customers shouldn’t be able to manipulate prices or order products not in stock.

Here are some of the customer’s:

  • I require assurance that the website I’m accessing is authentic.
  • I demand confidence that online payments are secure.
  • I expect the merchant to keep my shopping activities private.

Clearly, both parties must remain vigilant for the web to work well. That said, the customer expects many things from the merchant. Solving the hard problem of educating confused or gullible customers is out of scope here, if that’s even possible. Instead, in web security, we focus on securing a website from the merchant’s perspective. The web only works if servers do a good job of providing that security, making it possible for the honest end user to even have a chance at a secure web experience. Merchants must not only decide how much they can trust customers, but also intuit how much customers will likely trust them.

Another odd aspect of the web’s security model is the role of the client browser. Designing web services proves challenging because they need to interact with browsers that they have absolutely no control over. A malevolent client could easily use a modified browser capable of anything. Alternatively, a careless client could well be running an ancient browser full of security holes. Even if a web server attempts to limit the types of browsers clients use to certain versions, remember that the browser could easily misidentify itself to get around such restrictions. The saving grace is that honest clients want to use secure browsers and update them regularly, because it protects their own interests. Most importantly, so long as the server remains secure, one malicious client cannot interfere with the service that other clients receive.

Web servers overtrusting potentially untrustworthy client browsers is at the root of many web security vulnerabilities. I stress this point, at the risk of repetition, because it is so easily and often forgotten (as I will explain throughout the chapter).

The HTTP Protocol

Anyone who considers protocol unimportant has never dealt with a cat.

—Robert A. Heinlein

The HTTP protocol itself is at the heart of the web, so before we dig into web security, it’s worth briefly reviewing how it works. This hyper-simplified explanation serves as a conceptual framework for the rest of the security discussion, and we’ll focus on the parts where security enters the picture. For many, web browsing has become so commonplace in daily life that it’s worth stepping back and thinking through all the steps of the process—many of which we hardly notice, as modern processors and networks routinely provide blazing-fast responses.

Web browsing always begins with a uniform resource locator (URL). The following example shows its parts:

http://www.example.com/page.html?query=value#fragment

The scheme precedes the colon, and specifies the protocol (here, http) the browser must use to request the desired resource. IP-based protocols begin with // followed by the hostname, which for web pages is the domain name of the web server (in this case, www.example.com). The rest is all optional: the / followed by the path, the ? followed by the query, and the # followed by the fragment. The path specifies which web page the browser is requesting. The query allows the web page content to be parameterized. For example, when searching for something on the web, the URL path for results might be /search?q=something. The fragment names a secondary resource within the page, often an anchor as the destination of a link. In summary, the URL specifies how and where to request the content, the specific page on the site, query parameters to customize the page, and a way to name a particular part of the page.

Your web browser has a lot of work to do in order to display the web page when you give it a URL. First, it queries the Domain Name System (DNS) for the IP address of the hostname in order to know where to send the request. The request contains the URL path and other parameters encoded as request headers (including any cookies, the user’s preferred language, and so on) sent to the web server host. The server sends back a response containing a status code and response headers (which may set cookies, and many other things), followed by the content body that consists of the HTML for the web page. For all embedded resources, such as scripts, images, and so forth, this same request/response process repeats until the content is fully loaded and displayed.

Now let’s look at what web servers must do correctly in order to remain secure. One important detail not yet mentioned is that the request specifies the HTTP verb. For our purposes here, we will focus on just the two most common verbs. GET requests content from the server. By contrast, clients use the POST verb to send form submissions or file uploads. GET requests are explicitly not state-changing, whereas POST requests intend to change the state of the server. Respecting this semantic distinction is important, as will be seen when we cover CSRF attacks. For now, keep in mind that even though the client specifies the request verb to use, the server is the one that decides what to do with it. Additionally, by offering hyperlinks and forms on its pages, the server in effect guides the client to make subsequent GET or POST requests.

Sticklers will point out that one certainly can run a server that changes state in response to GET verb requests and, perversely, refuses to change state for form POST submissions. But if you strictly follow the standard rules, it is vastly easier to make your server secure. Think of it this way: yes, it is possible to climb over fences marked “Keep Out!” at a cliff and walk along the edge of the precipice without falling, but doing so needlessly puts your security in jeopardy.

A related security no-no is embedding sensitive data in a URL; instead, use form POST requests to send the data to the server. Otherwise, the REFERER header may disclose the URL of the web page that led to the request, exposing the data. For example, clicking a link on a web page with the URL https://example.com?param=SECRET navigates to the link destination using a GET request with a REFERER header containing the URL which includes SECRET, thereby leaking the secret data. In addition, logs or diagnostic messages risk disclosing the data contained in URLs. While servers can use the Referrer-Policy header to block this, they must depend on the client honoring it—hardly a perfect solution. (The REFERER header is indeed misspelled in the spec, so we’re stuck with that, but the policy name is correctly spelled.)

One easy mistake to make is including usernames in URLs. Even an opaque identifier, such as the hash of a username, leaks information, in that it potentially allows an eavesdropper to match two separately observed URLs and infer that they refer to the same user.

Digital Certificates and HTTPS

If what is communicated is false, it can hardly be called communication.

—Benjamin Mays

The first challenge for secure web browsing is reliably communicating with the correct server. To do this, you must know the correct URL and query a DNS service that provides the right IP address. If the network routes and transmits the request correctly, it should reach the intended server. That’s a lot of factors to get right, and a large attack surface: bad actors could interfere with the DNS lookup, the routing, or the data on the wire at any point along the route. Should the request be diverted to a malicious server, the user might never realize it; it isn’t hard to put up a look-alike website that would easily fool just about anyone.

The HTTPS protocol (also called HTTP over TLS/SSL) is tailor-made to mitigate these threats. HTTPS secures the web using many of the techniques covered in Chapter 5. It provides a secure end-to-end tamper-proof encrypted channel, as well as assurance to the client that the intended server is really at the other end of that channel. Think of the secure channel as a tamper-evident pipeline for data that confirms the server’s identity. An eavesdropping attacker could possibly see encrypted data, but without the secret key, it’s indistinguishable from random bits. An attacker may be able to tamper with the data on an unprotected network, but if HTTPS is used, any tampering will always be detected. Attackers may be able to prevent communication, for example by physically cutting a cable, but you are assured that bogus data will never get through.

Nobody ever disputed the need for HTTPS to secure financial transactions on the web, but major sites delayed going fully HTTPS for far too long. (For example, Facebook only did so in 2013.) When first implemented, the protocol had subtle flaws, and the necessary computations were too heavyweight for the hardware at the time to justify widespread adoption. The good news is that, over time, developers fixed the bugs and optimized the protocol. Thanks to protocol optimizations, more efficient crypto algorithms, and faster processors, HTTPS is fast, robust, and rapidly approaching ubiquity today. It’s widely used to protect private data communications, but even for a website only serving public information, HTTPS is important to ensure authenticity and strong integrity. In other words, it provides assurance that the client is communicating with the bona fide server named in the request URL, and that data transmitted between them has not been snooped on or tampered with. Today, it’s difficult to think of any good reason not to configure a website to use HTTPS exclusively. That said, there are still plenty of non-secure HTTP websites out there. If you use them, keep in mind that the nice security properties of HTTPS do not apply, and take appropriate precautions.

Understanding precisely what HTTPS does (and does not do) to secure the client/server interaction is critical in order to grasp its value, how it helps, and what it can and cannot change. In addition to assuring server authenticity and the confidentiality and integrity of web requests and response content, the secure channel protects the URL path (in the first line of the request headers—for example, GET /path/page.html?query=secret#fragment), preventing anyone who’s snooping from seeing what page of the website the client requested. (HTTPS can optionally also authenticate the client to the server.) However, the HTTPS traffic itself is still observable over the network, and because the IP addresses of the endpoints are unprotected, eavesdroppers can often deduce the identity of the server.

Table 11-1 compares of the security attributes of HTTP and HTTPS, in terms of the capabilities of an attacker lurking between the two endpoints of a client/server communication.

Table 11-1: HTTP vs. HTTPS Security Attributes

Can an attacker. . . HTTP HTTPS
See web traffic between client/server endpoints? Yes Yes
Identify the IP addresses of both client and server? Yes Yes
Deduce the web server’s identity? Yes Sometimes (see note below)
See what page within the site is requested? Yes No (in encrypted headers)
See the web page content and the body of POSTs? Yes No (encrypted)
See the headers (including cookies) and URL (including the query portion)? Yes No
Tamper with the URL, headers, or content? Yes No

As HTTPS and the technology environment matured, the last obstacle to broad adoption was the overhead of getting server certificates. Whereas larger companies could afford the fees that trusted certificate authorities charged and had staff to manage the renewal process, the owners of smaller websites balked at the extra cost and administrative overhead. By 2015, HTTPS was mature and most internet-connected hardware operated fast enough to handle it, and with awareness of the importance of web privacy growing quickly, the internet community was approaching a consensus that it needed to secure the majority of web traffic. The lack of free and simple server certificate availability proved the biggest remaining obstacle.

Thanks to strong promotion by the wonderful Electronic Frontier Foundation and sponsorship from a wide range of industry companies, Let’s Encrypt, a product of the nonprofit Internet Security Research Group, offers the world a free, automated, and open certificate authority. It provides Domain Validation (DV) certificates, free of charge, to any website owner. Here’s a simplified explanation of how Let’s Encrypt works. Keep in mind that the following process is automated in practice:

  1. Identify yourself to Let’s Encrypt by generating a key pair and sending the public key.
  2. Query Let’s Encrypt, asking what you need to do to prove that you control the domain.
  3. Let’s Encrypt issues a challenge, such as provisioning a specified DNS record for the domain.
  4. You satisfy the challenge by creating the requested DNS record and ask Let’s Encrypt to verify what you did.
  5. Once verified, the private key belonging to the generated key pair is authorized for the domain by Let’s Encrypt.
  6. Now you can request a new certificate by sending Let’s Encrypt a request signed by the authorized private key.

Let’s Encrypt issues 90-day DV certificates and provides a “certbot” to handle automatic renewals. With automatically renewable certificates available as a free service, secure web serving today has widely become a turnkey solution at no additional cost. HTTPS comprised more than 85 percent of web traffic in 2020, more than double the 40 percent level of 2016, when Let’s Encrypt launched.

A DV certificate is usually all you need to prove the identity of your website. DV certificates simply assert the authenticated web server’s domain name, and nothing more. That is, the example.com certificate is only ever issued to the owner of the example.com web server. By contrast, certificates offering higher levels of trust, such as Organization Validation (OV) and Extended Validation (EV) certificates, authenticate not only the identity of the website but also, to some extent, the owner’s identity and reputation. However, with the proliferation of free DV certificates, it’s increasingly unclear if the other kinds will remain viable. Few users care about such distinctions of trust, and the technical as well as legal nuances of OV and EV certificates are subtle. Their precise benefits are challenging to grasp unless (and even if) you are a lawyer.

Once you’ve set up your web server to use the HTTPS protocol with a certificate, you must make sure it always uses HTTPS. To ensure this, you must reject downgrade attacks, which attempt to force the communication to occur with weak encryption or without encryption. These attacks work in two ways. In the simplest case, the attacker tries changing an HTTPS request to HTTP (which can be snooped and tampered with), and a poorly configured web server might be tricked into complying. The other method exploits the HTTPS protocol options that let the two parties negotiate cipher suites for the encrypted channel. For example, the server may be able to “speak” one set of crypto “dialects,” and the client might “speak” a different set, so up front, they need to agree on one that’s in both their repertoires. This process opens the door to an attacker, who could trick both parties into selecting an insecure choice that compromises security.

The best defense is to ensure your HTTPS configuration only operates with secure modern cryptographic algorithms. Judging exactly which cipher suites are secure is highly technical and best left to cryptographers. You must also strike a balance to avoid excluding, or degrading the experience of, older and less powerful clients. If you don’t have access to reliable expert advice, you can look at what major trustworthy websites do and follow that. Simply assuming that the default configuration will be secure forever is a recipe for failure.

Mitigate such attacks by always redirecting HTTP to HTTPS, as well as restricting web cookies to HTTPS only. Include the Strict-Transport-Security directive in your response HTTP headers so the browser knows that the website always uses HTTPS. For an HTTPS web page to be fully secure, it must be pure HTTPS. This means all content on the server should use HTTPS, as should all scripts, images, fonts, CSS, and other referenced resources. Failing to take all the necessary precautions weakens the security protection.

The Same Origin Policy

Doubt is the origin of wisdom.

—Rene Descartes

Browsers isolate resources—typically windows or tabs—from different websites so they can’t interfere with each other. Known as the Same Origin Policy (SOP), the rule allows interaction between resources only if their host domain names and port numbers match. The Same Origin Policy dates back to the early days of the web and became necessary with the advent of JavaScript. Web script interacts with web pages via the Document Object Model (DOM), a structured tree of objects that correspond to browser windows and their contents. It didn’t take a security expert to see that if any web page could use script to window.open any other site, and programmatically do anything it wanted with the content, countless problems would ensue. The first restrictions that were implemented—including fixes for a number of tricky ways people found of getting around them over the years—evolved into today’s Same Origin Policy.

The Same Origin Policy applies to script and cookies (with a few extra twists), which both can potentially leak data between independent websites. However, web pages can include images and other content, such as web ads, from other websites. This is safely allowed, since these cannot access the content of the window they appear in.

Although the Same Origin Policy prevents script in pages from other websites from reaching in, web pages can always choose to reach out to different websites if they wish, pulling their content into the window. It’s quite common for a web page to include content from other websites, to display images, to load scripts or CSS, and so forth. Including any content from other websites is an important trust decision; however, because it makes the web page vulnerable to malicious content that may originate there.

Web Cookies

When the going gets tough, the tough make cookies.

—Erma Bombeck

Cookies are small data strings that the server asks the client to store on its behalf and then provide back to it with subsequent requests. This clever innovation allows developers to easily customize web pages for a particular client. The server response may set named cookies to some value. Then, until the cookies expire, the client browser sends the cookies applicable to a given page in subsequent requests. Since the client retains its own cookies, the server doesn’t necessarily need to identify the client to bind cookie values to it, so the mechanism is potentially privacy-preserving.

Here’s a simple analogy: if I run a store and want to count how many times each customer visits, an easy way would be for me to give each customer a slip of paper with “1” on it and ask them to bring it back the next time they come. Then, each time a customer returns, I take their paper, add one to the number on it, and give it back. So long as customers comply, I won’t have to do any bookkeeping or even remember their names to keep accurate tallies.

We use cookies for all manner of things on the web, tracking users being among the most controversial. Cookies often establish secure sessions so the server can reliably tell all of its clients apart. Generating a unique session cookie for each new client allows the server to identify the client from the cookie appearing in a request.

While any client could tamper with its own cookies and pretend to be a different session, if the session cookie is properly designed, the client shouldn’t be able to forge a valid session cookie. Additionally, clients could send copies of their cookies to another party, but in doing so they would only harm their own privacy. That behavior doesn’t threaten innocent users and is tantamount to sharing one’s password.

Consider a hypothetical online shopping website that stores the current contents of a customer’s shopping cart in cookies as a list of items and the total cost. There is nothing to stop a clever and unethical shopper from modifying the local cookie store. For instance, they could change the price of a valuable load of merchandise to a paltry sum. This does not mean that cookies are useless; cookies could be used to remember the customer’s preferences, favorite items, or other details, and tampering with these wouldn’t hurt the merchant. It just means that you should always use client storage on a “trust but verify” basis. Go ahead and store item costs and the cart total as cookies if that’s useful, but before accepting the transaction, be certain to validate the cost of each item on the server side, and reject any data that’s been tampered with. This example makes the problem plain as day. However, other forms of the same trust mistake are more subtle, and attackers frequently exploit this sort of vulnerability.

Now let’s look at this same example from the client’s perspective. When two people use an online shopping website and browse to the same /mycart URL, they each see different shopping carts because they have distinct sessions. Usually, unique cookies establish independent anonymous sessions, or, for logged-in users, identify specific accounts.

Servers set session cookies with a time of expiration, but since they cannot always rely on the client to respect that wish, they must also enforce limits on the validity of session cookies that need renewing. (From the user’s perspective, this expiration looks like being asked to log in again after a period of inactivity.)

Cookies are subject to the Same Origin Policy, with explicit provisions for sharing between subdomains. This means that cookies set by example.com are visible to the subdomains cat.example.com and dog.example.com, but cookies set on those respective subdomains are isolated from each other. Also, though subdomains can see cookies set by parent domains, they cannot modify them. By analogy, state governments rely on national-level credentials such as passports, but may not issue them. Within a domain, cookies may be further scoped by path as well (but this is not a strong security mechanism). Table 11-2 illustrates these rules in detail. In addition, cookies may specify a Domain attribute for explicit control.

Table 11-2: Cookie Sharing Under Same Origin Policy (SOP) with Subdomains

Can the web pages served by the hosts below. . . . . .see the cookies set for these hosts?
example.com dog.example.com cat.example.com example.org
example.com Yes
(same domain)
No
(subdomain)
No
(subdomain)
No
(SOP)
dog.example.com Yes
(parent domain)
Yes
(same domain)
No
(sibling domain)
No
(SOP)
cat.example.com Yes
(parent domain)
No
(sibling domain)
Yes
(same domain)
No
(SOP)
example.org No
(SOP)
No
(SOP)
No
(SOP)
Yes
(same domain)

Script nominally has access to cookies via the DOM, but this convenience would give malicious script that manages to run in a web page an opening to steal the cookies, so it’s best to block script access by specifying the httponly cookie attribute. HTTPS websites should also apply the secure attribute to direct the client to only send cookies over secure channels. Unfortunately, due to legacy constraints too involved to cover here, integrity and availability issues remain even when you use both of these attributes (see RFC 6265 for the gory details). I mention this not only as a caveat, but also as a great example of a repeated pattern in web security; the tension between backward compatibility and modern secure usage results in compromise solutions that illustrate why, if security isn’t baked in from the start, it often proves to be elusive.

HTML5 has added numerous extensions to the security model. A prime example is Cross-Origin Resource Sharing (CORS), which allows selective loosening of Same Origin Policy restrictions to enable data access by other trusted websites. Browsers additionally provide the Web Storage API, a more modern client-side storage capability for web apps that’s also subject to the Same Origin Policy. These newer features are much better designed from a security standpoint, but still are not a complete substitute for cookies.

Common Web Vulnerabilities

Websites should look good from the inside and out.

—Paul Cookson

Now that we’ve surveyed the major security highlights of website construction and use, it’s time to talk about specific vulnerabilities that commonly arise. Web servers are liable to all kinds of security vulnerabilities, including many of those covered elsewhere in this book, but in this chapter we’ll focus on security issues specific to the web. The preceding sections explained the web security model, including a lot of potential ways to avoid weakening security and useful features that help better secure your web presence. Even assuming you did all of that right, this section covers still more ways web servers can get it wrong and be vulnerable.

The first category of web vulnerability, and likely the most common, is cross-site scripting (XSS). The other vulnerability we’ll cover here is probably my favorite due to its subtlety: cross-site request forgery (CSRF).

Cross-Site Scripting

I don’t let myself “surf” on the Web, or I would probably drown.

—Aubrey Plaza

The isolation that the Same Origin Policy provides is fundamental to building secure websites, but this protection breaks easily if we don’t take necessary precautions. Cross-site scripting (XSS) is a web-specific injection attack where malicious input alters the behavior of a website, typically resulting in running unauthorized script.

Let’s consider a simple example to see how this works and why it’s essential to protect against. The attack usually begins with the innocent user already logged in to a trusted website. The user then opens another window or tab and goes surfing, or perhaps unwisely clicks a link in an email, browsing to an attacking site. The attacker typically aims to commandeer the user’s authenticated state with the target site. They can do so even without a tab open to the victim site, so long as the cookies are present (which is why it’s good practice to log out of your banking website when you’re done). Let’s look at what an XSS vulnerability in a victim site looks like, exactly how to exploit it, and finally, how to fix it.

Suppose that for some reason a certain page of the victim website (www.example.com) wants to render a line of text in several different colors. Instead of building separate pages, all identical except for the color of that line, the developer chooses to specify the desired color in the URL query parameter. For example, the URL for the version of the web page with a line of green text would be:

https://www.example.com/page?color=green

The server then inserts the highlighted query parameter into the following HTML fragment:

<h1 style="color:green">This is colorful text.</h1>

This works fine if used properly, which is exactly why these flaws are easily overlooked. Seeing the root of the problem requires looking at the server-side Python code responsible for handling this task (as well as some devious thinking):

vulnerable code

query_params = urllib.parse.parse_qs(self.parts.query)
color = query_params.get('color', ['black'])[0]
h = '<h1 style="color:%s">This is colorful text.</h1>' % color

The first line parses the URL query string (the part after the question mark). The next line extracts the color parameter, or defaults to black if it’s unspecified. The last line constructs the HTML fragment that displays text with the corresponding font color, using inline styling for the heading level 1 tag (<h1>). The variable h then forms part of the HTML response that comprises the web page.

You can find the XSS vulnerability in that last line. There, the programmer has created a path from the contents of the URL (which, on the internet, anyone can send to the server) that leads directly into the HTML content served to the client. This is the familiar pattern of injection attacks from Chapter 10, and constitutes an unprotected trust boundary crossing, because the parameter input string is now inside the web page HTML contents. This condition alone is enough to raise red flags, but to see the full dimensions of this XSS vulnerability, let’s try exploiting it.

An attack requires a little imagination. Refer back to the <h1> HTML tag and consider other possible substitutions for the highlighted color name. Think outside the box, or in this case, outside the double quoted string style="color:green". Or can you break out of the <h1> tag entirely? Here’s a URL that illustrates what I mean by “break out”:

https://www.example.com/page?color=orange"><SCRIPT>alert("Gotcha!")</SCRIPT><span%20id="dummy

All of that highlighted stuff gets dutifully inserted into the <h1> HTML tag as before, producing a vastly different result.

In the actual HTML, this code would appear as a single line, but for legibility I’ve indented it here to show how it’s parsed:

<h1 style="color:orange">
  <SCRIPT>alert("Gotcha!")</SCRIPT>
  <span id="dummy">This is colorful text.
</h1>

The new <h1> tag is syntactic, specifying an orange color. However, note that the attacker’s URL parameter value supplied the closing angle bracket. This wasn’t done just to be nice: the attacker needed to close the <h1> tag in order to make a well-formed <SCRIPT> tag and inject it into the HTML, ensuring that the script would run. In this case, the script opens an alert dialog—a harmless but unmistakable proof of the exploit. After the closing </SCRIPT> tag, the rest of the injection is just filler to obscure that tampering occurred. The new <span> tag has an id attribute merely so the following double quote and closing angle bracket will appear as part of the <span> tag. Browsers routinely supply closing </span> tags if missing, so the exploited page is well-formed HTML, making the modifications invisible to the user (unless they inspect the HTML source).

To actually attack victims remotely, the attacker has more work to do in order to get people to browse to the malicious URL. Attacks like this generally only work when the user is already authenticated to the target website—that is, when valid login session cookies exist. Otherwise, the attacker might as well type the URL into their own browser. What they’re after is your website session, which shows your bank balance or your private documents. A serious attacker-defined script would immediately load additional script, and then proceed to exfiltrate data, or make unauthorized transactions in the user’s context.

XSS vulnerabilities aren’t hard for attackers to discover, since they can easily view a web page’s content to see the inner workings of the HTML. (To be precise, they can’t see code on the server, but by trying URLs and observing the resulting web pages, it isn’t hard to make useful inferences about how it works.) Once they notice an injection from the URL into a web page, they can then perform a quick test, like the example shown here, to check if the server is vulnerable to XSS. Moreover, once they have confirmed that HTML metacharacters, such as angle brackets and quotes, flow through from the URL query parameter (or perhaps another attack surface) into the resultant web page, they can view the page’s source code and tweak their attempts until they hit the jackpot.

There are several kinds of XSS attacks. This chapter’s example is a reflected XSS attack, because it is initiated via an HTTP request and expressed in the immediate server response. A related form, the stored XSS attack, involves two requests. First, the attacker somehow manages to store malicious data, either on the server or in client-side storage. Once that’s set up, a following request tricks the web server into injecting the stored data into a subsequent request, completing the attack. Stored XSS attacks can work across different clients. For example, on a blog, if the attacker can post a comment that causes XSS in the rendering of comments, then subsequent users viewing the web page will get the malicious script.

A third attack form, called DOM-based XSS, uses the HTML DOM as the source of the malicious injection, but otherwise works much the same. Categories aside, the bottom line is that all of these vulnerabilities derive from injecting untrusted data that the web server allows to flow into the web page, introducing malicious script or other harmful content.

A secure web framework should have XSS protection built in, in which case you should be safe so long as you stay within the framework. As with any injection vulnerability, the defense involves either avoiding any chance for untrusted input to flow into a web page and potentially break out, or performing input validation to ensure that inputs will be handled safely. In the colored text example, the former technique could be implemented by simply serving named web pages (/green-page and /blue-page, for example) without the tricky query parameter. Alternatively, with a color parameter in the URL, you could constrain the query parameter value to be in an allowlist.

Cross-Site Request Forgery

One cannot separate the spider web’s form from the way in which it originated.

—Neri Oxman

Cross-site request forgery (CSRF, or sometimes XSRF) is an attack on a fundamental limitation in the Same Origin Policy. The vulnerability that these attacks exploit is conceptually simple but extremely subtle, so exactly where the problem lies, and how to fix it, can be hard to see at first. Web frameworks should provide CSRF protection, but a strong understanding of the underlying issue is still valuable so you can confirm that it works and be sure not to interfere with the mechanism.

Websites certainly can and often do include content, such as images from different websites, obtained via HTTP GET. The Same Origin Policy allows these requests while isolating the content, so the image data doesn’t leak between different websites from different domains. For example, site X can include on its page an image from site Y; the user sees the embedded image as part of the page, but site X itself cannot “see” the image, because the browser blocks script access to image data via the DOM.

But the Same Origin Policy works the same for POST as it does for GET, and POST requests can modify a site’s state. Here’s exactly what happens: the browser allows site X to submit a form to site Y, and includes the Y cookies, too. The browser ensures that the response from site Y is completely isolated from site X. The threat is that a POST can modify data on the Y server, which X shouldn’t be able to do, and by design, any website can POST to any other. Since browsers facilitate these unauthorized requests, web developers must explicitly defend against these attempts to modify data on the server.

A simple attack scenario will illustrate what CSRF vulnerabilities look like, how to exploit them, and in turn, how to defend against attack. Consider a social website Y, with many users who each have accounts. Site Y is running a poll, and each user gets one vote. The site drops a unique cookie for each authenticated user on the voting page, and then only accepts one vote per user.

A comment posted on the voting page says, “Check this out before you vote!” and links to a page on another website, X, that offers advice on how to vote. Many users click the link and read the page. With the Same Origin Policy protecting you, what could go wrong?

If you don’t see the problem yet, here’s a big hint: think about what might be going on in the site X window. Suppose site X is run by some dastardly and guileful cheaters trying to steal votes. Whenever a user browses to X, script on that page submits the site owner’s preferred vote to the social website in that user’s browser context (using their cookies from Y).

Since site X is allowed to submit forms using each user’s Y cookies, that’s enough to steal votes. The attackers just want to effect the state change on the server; they don’t need to see the response page confirming the user’s vote, which is all the Same Origin Policy blocks.

To prevent CSRF, ensure that valid state-changing requests are unguessable. In other words, treat each valid POST request as a special snowflake that only works once in the context of its intended use. An easy way to do this is by including a secret token as a hidden field in all forms, then checking that each request includes the secret corresponding to the given web session. There is a lot of nuanced detail packed into the creation and checking of a secret token for CSRF protection, so the details are worth digging into. A decent web framework should handle this for you, but let’s take a look at the details.

Here’s an example of the voting form with an anti-CSRF secret token highlighted:

<form action="/ballot" method="post">
  <label for="name">Voting for</label>
  <input type="text" id="name" name="name" value=""/>
  <input type="hidden" name="csrf_token"
         value="mGEyoi1wE6NBWCyhBN9IZdEmaJLQtrYxi0J23XuXR4o="/>
  <input type="submit" value="Vote"/>
</form>

The hidden csrf_token field doesn’t appear on the screen, but is included in the POST request. The field’s value is a base-64 encoding of a SHA-256 hash of the contents of the session cookie, but any per-client secret works. Here’s the Python code creating the anti-CSRF token for the session:

def csrf_token(self):
    digest = hashlib.sha256(self.session_id.encode('utf-8')).digest()
    return base64.b64encode(digest).decode('utf-8')

The code derives the token from the session cookie (the string value self.session_id), so it’s unique to each client. Since the Same Origin Policy prevents site X from knowing the victim’s site Y cookies, it’s impossible for Y’s creators to concoct an authentic form that satisfies these conditions to POST and steal the vote.

The validation code on the Y server simply computes the expected token value and checks that the corresponding field in the incoming form matches it. The following code prevents CSRF attempts by returning an error message if the token doesn’t match, before actually processing the form:

    token = fields.get('csrf_token')
    if token != self.csrf_token():
        return 'Invalid request: Cross-site request forgery detected.'

There are many ways to mitigate CSRF attacks, but deriving the token from the session cookie is a nice solution because all the necessary information to do the check arrives in the POST request. Another possible mitigation is to use a nonce—an unguessable token for one-time use—but to fend off CSRF attacks, you still have to tie it to the intended client session. This solution involves generating the random nonce for the form’s CSRF token, storing the token in a table indexed by session, and then validating the form by looking up the nonce for the session and checking that it matches.

Modern browsers support the SameSite attribute on cookies to mitigate CSRF attacks. SameSite=Strict blocks sending cookies for any third-party requests (to other domains) on a page, which would stop CSRF but can break some useful behavior when navigating to another site that expects its cookies. There are other settings available, but support may be inconsistent across browser brands and older versions. Since this is a client-side CSRF defense it may be risky for the server to completely depend on it, so it should be considered at additional mitigation rather than the sole defense.

More Vulnerabilities and Mitigations

The only way you can know where the line is, is if you cross it.

—Dave Chappelle

To recap: to be secure you should build websites in pure HTTPS, using a quality framework. Don’t override protection features provided by the framework unless you really know what you are doing, which means understanding how vulnerabilities such as XSS and CSRF arise. Modern websites often incorporate external scripts, images, styling, and the like, and you should only depend on resources from sources that you can trust since you are letting them inject content into your web page.

Naturally, that isn’t the end of the story, as there are still plenty of ways to get in trouble when exposing a server to the web. Websites present a large attack surface to the public internet, and those untrusted inputs can easily trigger all manner of vulnerabilities in server code, such as SQL injection (web servers frequently use databases for storage) and all the rest.

There are a number of other web-specific pitfalls worth mentioning. Here are some of the more common additional issues to watch out for (though this list is hardly exhaustive):

  • Don’t let attackers inject untrusted inputs into HTTP headers (similar to XSS).
  • Specify accurate MIME content types to ensure that browsers process responses correctly.
  • Open redirects can be problematic: don’t allow redirects to arbitrary URLs.
  • Only embed websites you can trust with <IFRAME>. (Many browsers support the X-Frame-Options header mitigation.)
  • When working with untrusted XML data, beware of XML external entity (XXE) attacks.
  • The CSS :visited selector potentially discloses whether a given URL is in the browser history.

In addition, websites should use a great new feature, the HTTP Content-Security-Policy response header, to reduce exposure to XSS. It works by specifying authorized sources for script or images (and many other such features), allowing the browser to block attempts to inject inline script or other malicious content from other domains. There are a lot of browsers out there, and browser compatibility for this feature is still inconsistent, so using this header isn’t sufficient to consider the vulnerability completely fixed. Think of this as an additional line of defense, but since it is client-side and out of your control, don’t consider it a free pass granting perfect immunity to XSS.

Links to untrusted third-party websites can be risky because the browser may send a REFERER header, as mentioned earlier in this chapter, and provide a window.opener object in the DOM to the target page. The rel="noreferrer" and rel="noopener" attributes, respectively, should be used to block these unless they are useful and the target can be trusted.

Adding new security features after the fact may be daunting for large existing websites, but there is a relatively easy way of moving in the right direction. In a test environment, add restrictive security policies in all web pages, and then test the website and track down what gets blocked issue by issue. If you prohibit script loading from a site that you know is safe and you intended to use, then by incrementally loosening the script policy, you’ll quickly arrive at the correct policy exceptions. With automated in-browser testing just to make sure the entire site gets tested, you should be able to make great strides for security with a modest investment of effort.

There are a number of HTTP response headers that help you specify what the browser should or should not allow, including the Content-Security-Policy, Referrer-Policy, Strict-Transport-Security, X-Content-Type-Options, and X-Frame-Options headers. The specifications are still evolving, and support may vary from browser to browser, so this is a tricky, changing landscape. Ideally, make your website secure on the server side, and then use these security features as a second layer of defense, bearing in mind that reliance only on client side mechanisms would be risky.

It’s amazing how secure the web actually is, considering all the ways that things can go wrong, what it evolved from, and the volume of critical data it carries. Perhaps, in hindsight, it’s best that security technologies have matured slowly over time as the web has seen widespread global adoption. Had the early innovators attempted to design a completely secure system back in the day, the task would have been extremely daunting, and had they failed, the entire endeavor might never have come to anything.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset