Chapter 7. ML and AI Case Studies

The discussion in this book has been relatively abstract to this point. With new technologies, such as AI and ML, it can be difficult to picture how these tools will improve the efficiency and workflow of your security team. More important, it can be difficult to understand how using these tools can help you save money or at least get a better return on your investment. This chapter presents one case study that focuses on the use of AI and ML to detect and mitigate a sophisticated bot attack. This type of attack is a common problem and is an especially good fit for AI and ML technologies, as well as being easy to implement.

Case Study: Global Media Company Fights Scraping Bots

To better understand how AI and ML can thwart malicious bots, this case study discusses how a global media company with a large marketing presence that includes more than 50,000 websites and which runs more than 20,000 pay-per-click campaigns (Figures 7-1 and 7-2) at any given time took preventive action when experiencing a high volume of sophisticated scraping bots.

Figure 7-1. Typical traffic from the customer (number of HTTPS Requests per 5 minutes, over 24 hours). Note that the figure also shows blocked attack traffic (traffic identified by the WAF, Access Control, and Bot Manager are blocked) and caching performance (in green)
Figure 7-2. Typical traffic from the customer (number of HTTPS Requests over 24 hours)

A scraping bot is one that visits a website and grabs content to repurpose it on another site. This is a common tactic used by disreputable “news” websites. Rather than create original content, they steal content from other sites and post it as their own. These deceitful news sites then use “shady” search engine optimization (SEO) techniques to elevate their sites in search engines. So, not only are they stealing content, they are also stealing clicks away from legitimate sites that produce original content. This type of activity is surprisingly common. Take a paragraph from almost any story on a news site and run it through a search engine, and there will often be dozens of clones of that story on sites with domains that are just slightly off.

The Problem

The onslaught of malicious bots was stealing data from the media company’s sites and providing that data to third-party competitors, negatively affecting revenue streams and diluting brand recognition. However, this wasn’t just a matter of copyright theft, there was also resource theft. The high volume of nonhuman traffic impeded the user experience and increased the number of servers needed to handle the onslaught of requests. This increased hardware spending on web services 100% year over year, but the increase in expenditures was necessary to ensure that sufficient resources were available to handle the significant load.

A traditional web application firewall was unable to solve the problem, because these web scraping bots, at least at first glance, look like legitimate traffic. When a normal user visits a website, the web browser makes what is called a “GET” request. The GET request does just what it suggests: it “gets” the requested content. For example, a visitor to the CNN home page will, temporarily, download all the content on the front page, and any article the visitor clicks will also be downloaded as part of the HTTP/HTTPS transaction. That is how the web browser can render the HTML content for the visitor.

Human versus Bot Behavior

Sometimes, there are significant differences. A legitimate user stays on a rendered page for several seconds or longer as that user is reading the content. Then, they might click one or two links from the main page (more if the site is particularly good at keeping visitors) and stay on each of those pages for several seconds or longer to read the content.

Typically, bots do not do that. When a bot visits a targeted web page, it often immediately begins visiting links that are on that page to grab as much content as possible, as quickly as possible. A bot might spend a fraction of a second on a page and will visit hundreds or thousands of pages while on a site. Such a bot can be easily identified and blocked using real-time, behavior-based analytics, as illustrated in Figure 7-3. The behavior of the end user is determined by looking at time spent on page, links clicked by the user, scrolling actions, historical pages visited, to name just a few examples.

Figure 7-3. Behavioral analysis setup

Some more advanced bots have become better at hiding their tracks. They will spend longer periods of time on each page in order to better mimic real visitor behavior, and they will also divide the scraping of subpages among hundreds of bots so that it doesn’t look like one IP address is doing all of the scraping. Even then, with a close examination of traffic, and knowing what to look for, it is possible to distinguish bot visitors from human visitors, as shown in 7-4. But doing so “on the fly” requires AI and ML.

Figure 7-4. A sample of IPs from identified malicious bots over a period of a few hours.

Remember, no matter how much time and effort an attacker invests in building a botnet to mimic legitimate traffic to a website, the one thing that attacker doesn’t have is the metrics on how visitors to the site engage with the web applications. As long as you have that information available, it is possible to detect bot traffic—even advanced bot traffic, using AI and ML.

Bot Management

In this case, the media company implemented a layered bot manager solution that incorporated analytics and controls, combined with an extra layer of ML, based on input from a data science team.

Human Interaction Analysis: 90% Were Bots

A human interaction bot mitigation (based on behavior analysis) was able to block about 90% of the suspicious activities, as depicted in Figure 7-5. The remaining 10% required a much more sophisticated ML approach.

Figure 7-5. Requests blocked by Human Interaction Analysis. Note that about 7% of the traffic is blocked by a Human Interaction Challenge.

The global media company’s client traffic pattern displayed a spike in malicious bot activity blocked by Human Interaction Challenge (HIC). Note also that a very large number of blocked requests were coming from compromised hosts. Access rules block users based on threat intelligence database that contains a very large number of known compromised users (based on IP addresses).

JavaScript challenge

About 8% of identified malicious bots were blocked by simple JavaScript Challenge. JavaScript Challenge identifies visitors that do not have a browser with a full JavaScript engine (typical of crude bots). This bot challenge does not use behavior analysis and is typically useful to prevent large-scale DDoS attacks and small “background noise” attacks.

The complex bots were identified (see Figures 7-6 and 7-7) when they failed the bot management solution’s HIC. This challenge identified normal usage patterns for each web application, based on expected visitor behavior that was provided by analysts. Customized security postures were deployed to stop bots that deviate from the standard usage behavior using a combination of anomalous and behavioral analysis, as shown in Figure 7-6.

Figure 7-6. The Human Interaction Challenge (labeled HIC in the graph) blocked 92% of identified malicious bots. JavaScript Challenge blocked the remaining 8%.

As a result, the company was able to identify and isolate the highly advanced bots that were able to bypass traditional bot management and mitigation techniques.

It would not stop…

Advanced bot protection has now been in place for more than months, but the attackers are still sending daily bot traffic, even though the site is now being well protected.

Limiting resource utilizations

There is also an unexpected benefit to the enhanced bot protection. The company’s image store infrastructure had been strained due to the massive amount of static content that the sites were expected to serve to its global user base.

Also, as expected, most of the bot traffic represents uncacheable search from the site; hence, it shows a disproportionate workload for the web servers.

The caching functionality (included with the botnet protections implemented) dramatically reduced the load on infrastructure (see Figure 7-7), providing much-needed headroom while the organization continued to upgrade its infrastructure.

Figure 7-7. Caching improvements created by the bot management solution.

When Nothing Else Works: Using Very Sophisticated ML Engines with a Data Science Team

For this organization, there were about 20,000 suspect requests over a few days (from a total of 56,000,000 legitimate requests over the same period). Those suspect requests came from many IP address, with a clear pattern: the same IP/user would systematically go through the entire subcontent of the site and never come back and then be replaced by another IP looking for different subcontent.

The traffic would come from apparently legitimate user agents, as shown here:

A data science team applied an unsupervised ML algorithm to see whether the pattern of these 60,000 requests could be identified. This is what was found:

  • The attack traffic was interspersed within the legitimate traffic, spread over several hundred IPs. Graphical analysis of the attack was inconclusive (that is, human analysts were not able to see visual patterns), as demonstrated in Figure 7-8.

Figure 7-8. Example of attack traffic (in red) and legitimate traffic (in green), using a data visualization tool (x and y are irrelevant for this exercise, the goal is to identify patterns).
  • The ML platform finally identified a very strong pattern (with a correlation of almost 100%, the pattern represents almost all the attack traffic), using vectors of more than 120 elements (that is, more than 120 different pieces of information define an attack request). The pattern could only be identified over a space of 120 dimensions, as shown in Figure 7-9.

Figure 7-9. Example of a vector that defines the pattern of the attack traffic.

Correlated elements

The highest correlated elements of the vector represent a random sort group of headers and header content (certain sorts of HTTPS headers and values can represent the main part of the attack traffic), as shown in Figure 7-10.

Figure 7-10. Main elements of the traffic vector, with correlation to the attack traffic.

After the ML engine was deployed in production, the platform was able to identify almost all malicious traffic with only a few false positives, as illustrated here:

The Results

The bot management solution allowed the media company to mitigate attack traffic using several strategies: ML-based, automated HIC (a behavior-based analysis of the traffic) and sophisticated, supervised ML that prevent traffic that corresponds to a pattern that is hard for humans to visualize. This was represented by vectors of more than 120 elements. Such ML was applied by a specialized team of data scientists.

At the same time, the company was able to apply controls to restrict resources (bandwidth and CPU) allocated to illicit traffic. The company continues to work closely with the managed services provider that provided the cloud-based bot management solution in order to research and identify increasingly advanced—and, in many cases, custom—malicious bots and other targeted attacks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset