13
Applications of Machine Learning Techniques in the Realm of Cybersecurity

Koushal Kumar1 and Bhagwati Prasad Pande2*

1Sikh National College, Qadian, Guru Nanak Dev University, Punjab, India

2Department of Computer Applications, LSM Government PG College, Pithoragarh, Uttarakhand, India

*Corresponding author: [email protected]

Abstract

Machine learning (ML) is the latest buzzword growing rapidly across the world, and ML possesses massive potential in numerous domains. ML technology is a subset of Artificial Intelligence (AI) and empowers digital machines with the ability to learn without being explicitly programmed, i.e., the capability to learn from past experiences. Since the last decade, ML technology has been used in various domains because it possesses numerous interesting characteristics such as adaptability, robustness, learnability, and its ability to take instant actions against unexpected challenges. The traditional cybersecurity systems are built on rules, attack signatures, and fixed algorithms. Thus, the systems can act only upon the ‘knowledge’ fed to them and human intervention is continually required for the proper functioning of traditional cybersecurity systems. On the other hand, ML technology can recognize various patterns from past experiences and is capable of predicting or detecting future attacks based on seen or unseen data. The ML technology is capable of handling massive real-time network data which allows various issues present in conventional cybersecurity systems to be overcome. In the present chapter, various issues related to the applications of ML in cybersecurity have been discussed. The effectiveness of applying ML technology in cybersecurity affairs has been thoroughly investigated. The contemporary challenges being faced by researchers in the realm have been identified and discussed. The current chapter presents available datasets and algorithms for the successful implementation of ML technology in the domain of cybersecurity. The datasets are also compared across various parameters. Finally, applications of ML practices by three renowned businesses, Facebook, Microsoft, and Google are explored.

Keywords: Machine Learning (ML), cybersecurity, malware, intrusion detection, adversarial attacks, deep learning, dataset

13.1 Introduction

The term cybersecurity deals with the act of securing electronic devices (computers, mobile phones, servers, network systems, etc.); programs, data, and the cloud against digital attacks. Sometimes referred to as electronic information security, the goal of cybersecurity is to protect the electronic gadgets and software codes from unauthorized access, malicious attacks, and damage and to guard the services they offer against disruption or misdirection. Such cyberattacks generally endeavor to access, modify and destroy sensitive information; to deceive and extort money from users; to malfunction programmed business functions, etc. The realm of cybersecurity has become substantially crucial in recent years due to the advancement of computing technology; availability of smart devices (phones, tablets, TVs, AI-enabled personal assistants, etc.); dependence on these devices, and easy access to the Internet. Constant advancement and research in cybersecurity programs to guard individuals, businesses, critical public infrastructures (banks, hospitals, powerplants, etc.) is the essential need in the modern digital era.

The basic principles of cybersecurity say that for a cybersecurity approach to be successful, it must possess multiple layers of protection, and the three main components of any cybersecurity system, say, people, process, and technology must be in harmony with each other to develop a strong shield against cyberattacks. In the current scenario, the two technology trends have greatly influenced cyberattacks and corresponding remedies. These are (1) eruption of digital data: data stored in devices and clouds, and (2) Internet of Things (IoT): devices connected to an internet or network. The landscape of today’s cybercrime includes several security threats and breaches, which include malware, phishing, distributed denial of service (DDoS), ransomware, advanced persistent threats, money theft, intellectual property theft, man in the middle (MITM), drive-by downloads, unpatched software, wiper attacks, etc. The list is indeed long. Therefore, to deal with such modern-day threats, an advanced and smart threat intelligence management system is required.

Machine Learning (ML) is a branch of Artificial Intelligence (AI) that equips computer systems and/or programs with the capability to learn from experience (data) and improve themselves automatically without the need to program them explicitly. ML deals with developing algorithms that allow computer programs to access data and utilize them to learn and change actions accordingly. Based on sample (training) data, ML algorithms build a set of rules known as models to predict or to take decisions without the direct involvement of humans. Deep Learning (DL) is a subset of ML which draws on learning and improving on its own by examining algorithms. ML uses simpler concepts than DL; the latter works with a layered structure of algorithms known as Artificial Neural Networks (ANNs). ANNs are developed and designed to imitate the way we humans think and learn. The DL models are designed to constantly analyze data with algorithms that try to imitate humans to draw conclusions. Such a process of learning is far more efficient than that of the standard ML models.

The ability of ML algorithms to make assumptions about the behavior of a computer and to adjust the functions it performs is actually a boon to secure it from threats. With the capability of scanning billions or trillions of files and identifying potentially malicious ones, the usage of ML algorithms in the domain of cybersecurity has been increasing for the past few years. In today’s scenario, it is almost impossible to imagine a sound, smart and successful cybersecurity system without the notion of ML. ML bestows on cybersecurity systems the power of analyzing threat patterns and learning from this experience to detect and prevent future attacks with similar patterns or signatures. This enables the cybersecurity systems to respond against changing behaviors of machines. ML allows the cybersecurity systems to be smartly proactive in preventing malicious activities and to respond in real time against active attacks. With the power of ML, businesses can use their resources more efficiently as it reduces the time spent on regular tasks. In simple words, ML makes the practice of cyber-security more simple, more proactive, more effective, and obviously, less expensive.

According to Information Data Corporation (IDC), the market of Artificial Intelligence (AI) and ML acquired the ascent of $37 in the past four years [1]. Google reports that around 50 to 70 percent of emails in Gmail are spam which get filtered automatically using ML algorithms. Google takes advantage of ML to analyze, identify and remove threats in mobile devices that run on Android. The cloud giant Amazon acquired a cybersecurity start-up named harvest.ai’ and launched Macie, a service that automates the process of cloud data protection with ML. Apple Inc. also applies ML to ensure the security of the users’ information. According to Drinkwater [2], ML can be applied in the following realms of cybersecurity: (1) to detect and to stop malicious activities; (2) to secure mobile devices; (3) to enhance human analysis; (4) to automate repetitive security functions; (5) to rectify zero-day vulnerabilities.

The rest of the chapter is organized as follows: section 13.2 discusses a brief literature review; section 13.3 discusses various issues related to ML and cybersecurity; section 13.4 sheds light on the ML datasets and algorithms being practiced in cybersecurity; section 13.5 highlights applications of ML by three leading organizations in cybersecurity; and the final section discusses the conclusion of the present work.

13.2 A Brief Literature Review

Dua and Xian [3] provided a rich reference for particular ML solutions to the issues of cybersecurity. Their work serves as a state-of-the-art application of ML and data mining techniques in cybersecurity. The authors presented the categorization of ML methods for various tasks like signature, anomaly, hybrid, and scan detection; profiling network traffic, etc. They also discussed detailed emerging challenges in cybersecurity. Tesfahun and Bhaskari [4] used the NSL-KDD dataset to develop an intrusion detection system. The authors exploited a technique known as SMOTE (Synthetic Minority Oversampling TEchnique) and presented a feature selection method to reduce the features of the above training dataset. The authors applied random forest classifiers to develop the intrusion detection system and based on the empirical results, they claimed that their approach reduced the time required to build the model and increased the detection rate as well. Ford and Siraj [5] highlighted various applications of ML, like phishing detection, intrusion detection, keystroke authentication, cryptography, spam detection in social networks, etc., in the realm of cybersecurity. The authors underlined that although ML tools keep systems safe, the ML classifiers are themselves vulnerable to malicious attacks. Das and Morris [6] presented applications of various ML techniques to cybersecurity. They discussed some important datasets like Network packet data, NetFlow, DARPA 1998 and 1999, etc., and described working of various ML techniques like Bayesian Network for anomaly detection and comparison of data with known attack patterns, Decision trees for comparison, Clustering for real-time signature detection, Hidden Markov Models (HMMs) for intrusion detection, etc. The authors also performed an evaluation of four ML algorithms (Naïve Bayes, Random Forest, OneR and J48) on MODBUS data.

Apruzzese et al. [7] discussed the effectiveness of ML and DL methods for cybersecurity. The authors presented a rich review of the literature and performed experiments on real-world network traffic and enterprise systems. The authors presented an analysis of security aspects of ML algorithms for intrusion detection, malware analysis and spam and phishing detection. The authors also presented a two-level classification of algorithms, first as Shallow Learning and Deep Learning, and then as Supervised and Unsupervised. The authors explored several issues that may affect the application of ML algorithms to cybersecurity. They claimed on an empirical basis that current ML techniques are suffering from many imperfections which may reduce their effectiveness for cybersecurity issues. Rege and Mbah [8] presented ML-based defensive and offensive techniques in the realm of cybersecurity. The authors discussed applications of ML in implementing cyberattacks and discussed various issues related to them, like Threat detection and classification, Network risk scoring, Automated routine security tasks and optimized human analysis. The authors also discussed the applications of ML in cybercrime. Devakunchari et al. [9] presented a review of ML and deep learning (DL) methods in the domain of cybersecurity. The authors performed a comparative survey of ML techniques on intrusion detection systems.

In their review article, Handa et al. [10] discussed various areas of cyber-security where ML plays a vital role. The authors discussed several applications of ML in cybersecurity, like power system security, industrial control systems, detection of cyberattacks, malware analysis, etc. The authors also highlighted the fact that ML can be exploited for malicious activities. They discussed possible adversary attacks by manipulation of training and test data for ML classifiers. Dasgupta et al. [11] presented a thorough survey on the usage of ML technology in cybersecurity. The authors described the cyberattacks and related defense strategies, mechanisms of commonly employed ML algorithms for cybersecurity. The authors highlighted the fact that ML algorithms may prove to be vulnerable against attacks both in the training and testing phases. Iyer [12] discussed cybercrime and applications of ML in cybersecurity. The authors raised the recent advancements and challenges in ML in the realm of cybersecurity. The authors also shed light on future trends and directions in ML and cybersecurity.

Lakshmanarao and Shashi [13] presented a survey on ML on cybersecurity. The authors discussed various cybersecurity issues and proposed corresponding ML algorithms to deal with them. They commented that ML technology is not capable of automating a cybersecurity system completely but it surely helps to trace threats more effectively than any other software-oriented technique. The authors suggested that multi-layered models are needed to develop to attain high detection rates and to provide resilience against malware attacks. Sagar et al. [14] discussed real-world requirements in security and ML applications to deal with them. They compared various ML models over some parameters and accuracy results. The authors also discussed several possible adversarial attacks by which ML models can suffer. Sarker et al. [15] discussed issues and future prospectives related to cybersecurity data science. The authors developed a multilayered framework for cybersecurity modelling based on ML. They focused on applications of cybersecurity data science to protect systems against cyberattacks by developing data-based smart decision-making tools and services.

13.3 Machine Learning and Cybersecurity: Various Issues

Different organizations around the world have shown a significant interest in implementing ML models for various purposes and security is one of them. The primary objective of ML applications in the cybersecurity framework is to make the security analysis process more effective and automated as compared to traditional cybersecurity systems. However, the ever-changing nature of cyber threats continually challenges security researchers to explore all the potentials threats in cybersecurity systems using ML approaches. In this section, we will discuss various issues related to various possible ML usage issues in cybersecurity.

13.3.1 Effectiveness of ML Technology in Cybersecurity Systems

These days, a variety of domains have incorporated ML models in their conventional security designs and the results reveal its superiority over traditional rule-based or signature-based algorithms. Various organizations around the world are generating a huge amount of data daily from various user activities, network data traffic, and many other electronic transactions. It is the job of a security analyst to observe various patterns in this data, so that suspicious or abnormal activity patterns can be identified. The real problem in this process is that it can be very challenging and time-consuming for security professionals to manually analyse such a massive amount of data for finding suspicious and abnormal patterns. On the other hand, machines are much more efficient than humans for recognizing patterns, and ML enables a digital device to learn and become more intelligent.

In the mechanism of Anti-virus software (AVS) packages, the malwares and viruses are recognised based on their signatures. Therefore, AVS can only detect those malwares that match with a virus signature in the database [8]. On the other hand, ML-based cybersecurity systems can learn with data and they have the capability to recognise known and unknown malwares. ML algorithms are gaining popularity and adoption in cyber-security domains and their outcomes show unbelievable performance in detecting and preventing numerous cyberattacks [16]. Recent studies suggest that it is hard to implement powerful first-level cybersecurity systems without relying on ML algorithms.

Many organizations across the world are now upgrading their security systems with ML technology, and research analysts estimate that the market of ML in cybersecurity will rise to $96 billion by 2021. In contrast to conventional cybersecurity systems, the ML tends to enable the cybersecurity computing processes to be more actionable and intelligent even in unfamiliar circumstances. The effectiveness of ML in cybersecurity systems draws on its capability of pattern analysis which helps in quick decision making and ultimately prevents all possible known and unknown attacks. ML technology has the potential to handle massive data generated in diverse networks which makes it a profound security solution for various security issues such as user authentication, access control, firewall filtering, etc. [15]. ML-based security tools work by correlating the incoming data traffic and organising them in a particular pattern, scanning various potential threats, making a predictive analysis and forecasting the next attack.

The applications of ML in cybersecurity save a substantial amount of time and resources of an organization that might otherwise have been invested by cybersecurity analysts. ML techniques have been proved more effective in the cybersecurity domain for repetitive automatic tasks and this enabled security experts to focus on more important strategic issues. Many researchers believe that ML is an ideal solution for handling zero-day cyber-attacks, it helps security professionals to potentially close vulnerabilities and stop patch exploits before they result in a data breach [17]. Supervised and unsupervised ML algorithms possess the capability of classifying and predicting a normal request and a malicious request using various statistical methods. Supervised algorithms are very effective in detecting denial of service (DoS) attacks, spoofing, and intrusion detection. On the other hand, unsupervised algorithms are mainly used for identifying anomalies, policy violations etc. Various modern ML such as DL, Convolutional Neural Network (CNN), Recurrent Neural Networks (RNN) etc. have been developed to strengthen attack detection capabilities as well as to reduce false positive and false negative approaches. Various studies reveal that intrusion detection and prevention system created using CNN and RNN showed higher accuracy, low false alerts and takes less time in identifying malicious traffic across the network [18].

The effectiveness of ML models is directly proportional to the quality of data provided in the training and testing phases of the model. In general, ML can make cybersecurity more responsive, less expensive and far more effective if the data provided to the models reflect a complete picture of the environment.

13.3.2 Machine Learning Problems and Challenges in Cybersecurity

Some researchers have presented the various emerging ML issues and challenges in the domain of cybersecurity which requires contemporary methodologies to handle all these issues. We summarize some of these significant problems and challenges below, ranging from data collection to decision making.

13.3.2.1 Lack of Appropriate Datasets

For ML technology to hold a major role in cybersecurity, the biggest challenge that has been observed by many researchers is the inaccessibility of appropriate datasets. All of the available cybersecurity datasets are obsolete and are not sufficient to understand the recent behaviour of different cyberattacks. Therefore, researchers cannot analyse all types of threats and vulnerabilities without comprehensive datasets [19].

13.3.2.2 Reduction in False Positives and False Negatives

Since most ML algorithms in the domain of cybersecurity are based on supervised and unsupervised approaches, therefore, they have a strong tendency to generate false positives and false negatives. A false positive is a prediction of the ML model where it incorrectly predicts the positive class of attacks, and a false negative is a vice versa case. Such misclassification of threats creates a serious problem for the security of any organization. The execution of ML algorithms generates an alert about the malicious request or a file but it cannot inform what precisely was malicious about the application [7].

13.3.2.3 Adversarial Machine Learning

Another major challenge with ML-based applications in cybersecurity is the intelligent adversary. The performance of any ML model is entirely dependent on the quality of the data supplied to it during the training and testing phases. Therefore, the accuracy of ML models will be reduced dramatically if they are trained in noisy datasets. To deceive an ML model, an attacker can include a long harmless code alongside the malicious code. This may keep the ML algorithm busy in classification and regression cycles to read all the information. Such noise code is included to deviate the ML algorithm from the correct measurement of the correlation among malicious and normal packets. Recently, a few studies have shown that it is possible to fully bypass ML algorithms by embedding malicious code inside benign code and vice versa. To avoid all types of adversarial attacks, we need continuous retraining and careful parameter tuning that cannot be automatized [20].

13.3.2.4 Lack of Feature Engineering Techniques

The effectiveness of ML algorithms in cybersecurity depends on the quality and amount of data used to build ML models. Determining the appropriate data sources for training and testing of models and assessing the adequate amount of data for training ML models is a challenging task and it is related to the problem of feature engineering. This concern is one of the challenging phases in the development of ML models for cybersecurity tasks. In the context of datasets, features are the information that describes a given sample of data, and the process of pre-processing existing data to build new and more interesting features is known as feature engineering. The quality of features supplied to the model is more important than the number of features fed to the model. Thus, it is crucial to use the correct features to train the model. Despite these challenges, feature engineering is generally guided by domain knowledge, and this approach is usually ineffective with the complicated nature of cyber-data [21].

13.3.2.5 Context-Awareness in Cybersecurity

Most of the cybersecurity-related researches are conducted on real-time relevant datasets which contain several low-level features. After applying various data pre-processing treatments, data mining and ML techniques to these datasets, a new relationship can be identified among various features that describe datasets properly. However, contextual information such as temporal, spatial and correlation among events can be used to determine whether there exists a malicious activity in the network or not. Accordingly, a major drawback for cybersecurity using machine learning is the lack of using contextual information for predicting risks or attacks [22].

13.3.3 Is Machine Learning Enough to Stop Cybercrime?

Since the last decade, ML and its accompanying technologies have brought a substantial revolution in the cybersecurity domain. The reason behind the success of ML in the realm of cybersecurity is its ability to rapidly scan a vast volume of data traffic and analyse them using statistical techniques. However, the brains behind cybercrime are continually finding innovative ways to wreak havoc, steal your sensitive information and commit all kinds of disruption. Although ML has been continuously changing the cybersecurity landscape and preventing cyberattacks, we cannot entirely rely on it. Hackers and cybercriminals have already leveraged AI and ML for their malicious purposes. Even though many private and public organizations are applying ML to identify and mitigate cyber-attacks more effectively than ever before, we still have a long way to go in developing highly effective data security systems [23]. Over-relying on ML in the domain of cybersecurity may develop a false sense of complete security. ML developers assume that a machine can learn everything from the data, but that is not always true; human intervention cannot be ignored and domain experts can play a crucial role in cybersecurity. ML is an emerging technology that needs to be significantly improved before solely relying on it. Many experts across the globe predict that AI and ML are surely going to be the future of cybersecurity, but with a bit of human supervision.

13.4 ML Datasets and Algorithms Used in Cybersecurity

With the modern-day smart threats of the cyberworld, researchers are now focusing on ML-based protection methods as one of the alternatives to traditional cybersecurity systems. Various studies have identified a vast set of ML tools, techniques, datasets and algorithms to prevent modern-day cyberattacks. This section presents various freely available datasets and popular ML algorithms used in cybersecurity functions.

13.4.1 Study of Available ML-Driven Datasets Available for Cybersecurity

The effectiveness of ML is largely driven by the quality, completeness, relevance and availability of the datasets. In general, datasets represent a series of information records comprising many attributes or characteristics and relevant details based on cybersecurity. Various studies have been conducted in the cybersecurity domain on the available datasets. There exist several datasets for different scenarios such as intrusion analysis, anomaly detection, fraud, malware analysis, spam analysis, detecting DDoS attacks, HTTP attacks, etc. In this section, we will briefly discuss some of the most used publicly available authentic security data-sets for research purposes.

13.4.1.1 KDD Cup 1999 Dataset (DARPA1998)

DARPA 1998 was one of the first datasets for intrusion detection to be made publicly accessible. The goal of this project was to develop a reliable and robust network intrusion detection system capable of distinguishing between intrusions and normal connections. A specific set of information to be audited is included in this dataset, which encompasses a broad range of intrusions simulated in a military network environment. This dataset contains four main types of attacks: Denial of service attack (DoS) User to Root attack (U2R), Remote to Local attack (R2L) and Probing attack. This dataset consists of 5,209,458 (approx.) instances for both training and testing purposes and each of which has 41 attributes and is classified as either normal or an attack [24]. The following link can be used to download the complete dataset: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.

13.4.1.2 NSL-KDD Dataset

The NSL-KDD data set is a realistic representation of its predecessor KDD 99 dataset, which includes a large number of redundant records, making it very difficult to accurately process the data. The number of records in the NSL-KDD dataset are substantial to conduct experiments. The following link can be used to download the complete dataset: http://205.174.165.80/CICDataset/NSL-KDD/.

13.4.1.3 ECML-PKDD 2007 Discovery Challenge Dataset

The web traffic dataset, which was built for the ECML/PKDD 2007 Discovery Challenge, is one of the widely used HTTP-labeled datasets. The key objectives of the challenge were identification of the attacks based on the context and isolation of the attack patterns from the requests. This dataset contains around 50,000 instances for training and testing ML models [25]. The dataset is available in XML format and contains the following type of attack information: Cross-Site Scripting, LDAP Injection, SQL Injection, XPATH Injection, Path traversal, Command execution, Server-Side Include (SSI) attacks. This dataset is available at http://www.lirmm.fr/pkdd2007-challenge/index.html#dataset.

13.4.1.4 Malicious URL’s Detection Dataset

This dataset contains samples of malicious URLs from various major webmail providers. It reports 6,000-7,500 instances of spam and phishing URLs per day. The data is collected by analysing and classifying URLs based on their lexical and host-based features. One of the ways used to collect malicious URLs is through extraction from email messages, where email classifier labels emails as spam and then they are verified malicious by human users. The following link can be used to download the complete dataset: http://www.sysnet.ucsd.edu/projects/url/.

13.4.1.5 ISOT (Information Security and Object Technology) Botnet Dataset

This dataset is a mixture of various existing malicious and non-malicious datasets which are available publicly. This security dataset contains approx. 1,675,424 records from the Honeynet project which consists of Storm and Waledac botnets. Waledac is one of the most widespread P2P botnets at present and is usually recognised as the successor of the Storm botnet with a more decentralised communication protocol. The researchers who built this dataset combined two different datasets one from the Traffic Lab at Ericsson Research in Hungary and the other from the Lawrence Berkeley National Lab (LBNL) to represent non-malicious traffic. The following link can be used to download the complete data set: https://www.uvic.ca/engineering/ece/isot/datasets/botnet-ransomware/index.php.

13.4.1.6 CTU-13 Dataset

The CTU-13 dataset was developed at CTU University, Czech Republic, and it comprises an amalgamation of real botnet traffic with normal and background traffics. This dataset includes 13 different instances of various botnet samples and for each instance, a specific malware has been executed with multiple protocols and actions. Each instance-related information was collected in a pcap file with the extension ‘.pcap’ that contains details about various incoming and outgoing packets. However, due to some privacy concerns, the complete pcap file containing all the background details, normal and botnet data are not available publicly. The following link can be used to download this dataset: https://mega.nz/folder/vdRmBA6D#yMZXx74nnu8GjhdwSF54Sw.

13.4.1.7 MAWILab Anomaly Detection Dataset

MAWILab is a dataset used by researchers to evaluate their traffic anomaly detection methods. This dataset categorises network traffic anomalies of the MAWI archive into the following sets or labels: notice, benign, suspicious, and anomalous. To effectively analyse and identify various anomaly-based attacks, the MAWILab dataset uses two distinct anomaly classification techniques: (1) simple heuristic approach based on port numbers, ICMP codes and TCP flags; (2) backbone traffic anomalies detection based on protocol headers and connection patterns. The following link can be used to download the dataset: http://www.fukudalab.org/mawilab/data.html.

13.4.1.8 ADFA-LD and ADFA-WD Datasets

In 2013, the Australian Defence Force Academy (ADFA) built two new datasets ADFA-LD and ADFA-WD for detecting intrusions in Linux and Window machines. Both of these datasets can effectively identify several types of modern intrusive attacks. ADFA datasets were developed specifically for host-based intrusion detection systems (HIDS) and many studies show that these datasets are suitable in identifying zero-day malware attacks. These datasets were created to identify the accuracy that the HIDS can achieve with the help of various ML algorithms [26]. The complete data set can be downloaded from the following link: https://www.unsw.adfa.edu.au/unsw-canberra-cyber/cybersecurity/ADFA-IDS-Datasets/. Table 13.1 presented below compares some important parameters of various data sets discussed above.

13.4.2 Applications ML Algorithms in Cybersecurity Affairs

In this section, a few popular ML algorithms with their specific applications to various tasks of cybersecurity are discussed. In their work, Thomas et al. presented a detailed study of various ML algorithms and their applications to cybersecurity [27].

Table 13.1 Comparison of key characteristics of various datasets.

Data setData typeApplications areaReal time trafficZero day attack detectionLabelled dataSizeYear
KDD Cup 1999MixedIntrusion detectionNoNoYes743 MB1999
NSL-KDDMixedIntrusion detection and preventionNoYesYes6.3 MB2001
ECML-PKDD discovery challengeMixedClassification of HTTP attacksYesYesYes148 MB2007
Malicious URL’s detectionUnmixedClassification of spam and ham and also used to identify phishing attacksYesYesYes470 MB2009
CTU-13MixedIdentification of malware and botnet attacksYesYesYes1.8 GB2011
MAWILab anomaly detectionMixedTraffic anomaly detectionYesYesYes350 MB2010
ADFA-LD and ADFA-WDMixedHost intrusion detection systemYesYesYes88 MB2013

13.4.2.1 Clustering

Clustering is a technique that combines similar signatures or patterns in a group or set. It partitions a given dataset into several sets called clusters, such that those signatures that belong to the same cluster have some common characteristics. The distance between two points that belong to the same cluster is shorter than the distance between the two points belonging to different clusters. This ML technique is applied to analyze applications and consequently group them as malicious or benign. It works on an input dataset which is a mixture of malwares and non-malwares or goodwares. The most commonly applied algorithms are k-means, fuzzy c-means, hierarchical clustering, etc. [27].

13.4.2.2 Support Vector Machine (SVM)

SVM is a special ML technique that is supervised by nature and can be applied to both the problems of classification and regression. Regression ML models help in predicting continuous values. In cybersecurity, SVM is mostly used to solve classification problems, i.e., to differentiate malware from benign. Thus, SVM is used for malware detection by extracting features from dataset first, and then constructing optimum hyperplane which distinguishes malicious codes and normal codes. SVM draws on techniques optimization, linear algebra and kernel to attain these goals.

13.4.2.3 Nearest Neighbor (NN)

NN is a supervised ML technique, given a data point, it endeavors to find all the neighbors and then assigns that data point to a class with the help of distance function. NN is also used for both the problems of regression and classification. In cybersecurity, NN is mainly applied for malware detection, intrusion detection and fingerprint recognition, etc. This technique employs two algorithms: k-nearest neighbors (k-NN) and radius based nearest neighbor. NN is a non-parametric and lazy learning technique by nature.

13.4.2.4 Decision Tree

This ML technique can also be applied in regression and classifications problems. It works by constructing tree structures to identify the relationship among data points. Such structures can be used to make precise predictions about the unseen data. Businesses endeavor to protect themselves by employing the decision tree technique to predict cyberattacks. This technique can also be used to track intrusion signatures automatically and to categorize processes in a computer network as normal or intrusive.

13.4.2.5 Dimensionality Reduction

This ML technique is used to reduce the number of traits, where each trait is represented by dimension and corresponds to some data object. With the processes of feature selection and feature extraction, this technique reduces the sparseness of data. In cybersecurity, dimensionality reduction is used in anomaly and intrusion detection, face recognition, multi-modal biometrics etc. Diffusion maps, random projection and principal component analysis are some mathematical approaches to implement it.

13.5 Applications of Machine Learning in the Realm of Cybersecurity

These days ML in cybersecurity is a fast-growing trend across the globe. Many globally reputed companies are adopting contemporary technologies such as the Internet of Things (IoT), Big Data and AI. Therefore, with the development of new technologies, the demand for ML-based security solutions is also increasing. This section contains the case study of three globally recognized organizations and addresses how these organizations have been employing ML to better protect their services and customers’ data.

13.5.1 Facebook Monitors and Identifies Cybersecurity Threats with ML

The social media giant Facebook is now committed to enhancing its various security parameters such as user privacy permissions, limiting developer freedoms and handling fake profiles after the unforgettable Cambridge Analytica data scandal and the 2016 presidential election controversy. Therefore, under a multi-pronged strategy for improving security, Facebook is now emphasizing the adoption of ML techniques. Facebook has been using ML as a critical tool to protect users’ private data from unauthorised access. It enables security experts to proactively take actions prior to a security breach. According to Facebook, they employed ML to drop approximately 2 billion fake accounts in 2019, before it could harm real users. To characterize each account, Facebook uses over 20,000 deep features and the data collected are used to train a neural network which is then fine-tuned with a small batch of high-precision hand-labelled data. ML enables Facebook to finally classify fake profiles against the following categories: (a) Fraudulent accounts that do not represent the person; (b) Hacked accounts of real users (authorised) that hackers have taken over; (c) Spammers who send revenue-generating texts repeatedly; and (d) Scammers who exploit users to disclose private data.

Facebook recently announced the release of Opacus, a high-speed library for the implementation of differential privacy to train deep learning models using the PyTorch platform. These days the ML sector has been witnessing a considerable demand in differential privacy, which is a mathematically rigorous framework commonly used in analytics to measure sensitive data anonymization. Facebook is also using pattern matching ML algorithms to detect unusual login activities and to alert the user through an email. ML can analyse and classify massive volume of data with a considerable speed which makes it appropriate for the task of separating normal user behaviour from abnormal activities. Apart from these techniques, Facebook has also been using ML for many tasks like automatic face recognition for tagging, fake news detection, friend suggestions, language translation, identifying abusive posts etc.

13.5.2 Microsoft Employs ML for Security

The Microsoft corporation operates in various products and services segments like Devices and Consumer (D&C) licensing, commercial licensing, D&C hardware, etc. The Microsoft corporation develops, manufactures, licenses and sells many types of software (application and system) for PCs and server systems. To overcome all kinds of modern cybersecurity threats. Microsoft has been encouraging the incorporation of ML techniques as part of its comprehensive security strategy. For preventive protection, Microsoft uses its ML-based cybersecurity platform by the name of Windows Defender Advanced Threat Protection (WDATP) for breach detection, automated analysis and response. WDATP has been programmed into the Windows 10 operating system which automatically updates and uses multiple levels of ML algorithms to detect threats. Microsoft handles a large volume of diverse data. Therefore, Microsoft’s security solutions are trained on 8 trillion daily threat signals from a wide variety of products, services, etc. See Figure 13.1 [28].

Schematic illustration of modules of Microsoft’s security solutions.

Figure 13.1 Modules of Microsoft’s security solutions.

In its latest research, Microsoft developed a cyberattack threat matrix, known as Adversarial Machine Learning Threat Matrix (AMLTM). The AMLTM aims to locate attacks on ML systems which will help security analysts to take preventive measures against upcoming and current attacks [28].

13.5.3 Applications of ML by Google

Google specialises in services and products related to the Internet, including digital advertising technology, search engines, cloud computing, software and hardware. According to the Google corporation, 50-70% of all emails in Gmail are spam that can be easily classified with the help of ML with 99% accuracy. Google also employs ML to identify threats against mobile devices running on the Android operating system and removing malicious applications from infected handsets. Google has incorporated ML in the G-Suite platform to prevent it from various phishing attacks and to avoid spam. The anti-phishing ML technique works by delaying a suspicious message to process the malicious URLs in the backend. As soon as the ML model observes some new unknown patterns in the processed data, it immediately responds which would have not been possible with normal or manual systems. Google applies ML in understanding various potentially harmful applications which encourage operations through remote places on mobile phones. To ensure comprehensive security of customers data, Google cloud uses ML to provide in-depth control and visibility of their data.

13.6 Conclusions

ML techniques are being widely used for various real-world applications and cybersecurity practitioners have adopted them to protect cyber-facets of their organizations. Since the past decade, extensive research studies have been carried out on the cybersecurity applications of ML technologies. The present chapter discusses how do the ML approaches resolve different issues that exist in traditional cybersecurity systems. The adaptation of ML in cybersecurity helps in developing automated security systems which complement the shortage of security experts. ML-based automated cybersecurity systems enable various organizations to invest less in human resources and to analyze data more accurately than human analysts. The importance of data preparation in the ML modelling is presented and the impact of correctly trained ML models to reduce false-positive rates and to detect modern cyberattacks are discussed. Various problems and challenges faced by researchers while integrating cybersecurity and ML algorithms have been investigated. The present chapter also explains some of the popular cybersecurity datasets and ML algorithms along with a precise comparison of datasets. It is concluded that the automated potential of ML should not be overestimated since human intervention cannot be neglected in critical security scenarios where domain experts play a key role.

References

1. Community, The applications of Machine Learning in cyber-security. NASSCOM Insights. https://community.nasscom.in/communities/emerging-tech/the-applications-of-machine-learning-in-cyber-security.html, 2020.

2. Drinkwater D., 5 top machine learning use cases for security. CSO Online. https://www.csoonline.com/article/3240925/5-top-machine-learning-use-cases-for-security.html, 2017.

3. Dua S., Xian D., Data Mining and Machine Learning in Cybersecurity. Auerbach Publications, 2011. ISBN:978-1-4398-3942-3

4. Tesfahun A., Bhaskari D. L., Intrusion Detection Using Random Forests Classifier with SMOTE and Feature Reduction. 2013 International Conference on Cloud & Ubiquitous Computing & Emerging Technologies, pp, 127-132, 2013. DOI: 10.1109/CUBE.2013.31

5. Ford V., Siraj, A., Applications of Machine Learning in Cyber Security. 27th International Conference on Computer Applications in Industry and Engineering, CAINE 2014, 2014.

6. Das R., Morris T. H., Machine Learning and Cyber Security. International Conference on Computer, Electrical & Communication Engineering (ICCECE), Kolkata, pp. 1-7, 2017.

7. Apruzzese G., Colajanni M., Ferretti L., Guido A., Marchetti M., On the Effectiveness of Machine and Deep Learning for Cyber Security, in Minárik T., Jakschis R., Lindström L. (Eds.) 10th International Conference on Cyber Conflict (CyCon), Tallinn, pp. 371-390, 2018.

8. Rege M., Mbah R. B. K., Machine Learning for Cyber Defense and Attack. DATA ANALYTICS 2018 : The Seventh International Conference on Data Analytics, pp. 73-78, 2018. ISBN: 978-1-61208-681-1

9. Devakunchari R., Sourabh, Malik P., A Study of Cyber Security using Machine Learning Techniques. International Journal of Innovative Technology and Exploring Engineering, 8(7), 183-186, 2019.

10. Handa A., Sharma A., Shukla S. K., Machine learning in cybersecurity: A review. WIREs Data Mining and Knowledge Discovery, 9(4), 1-7, 2019.

11. Dasgupta D., Akhtar Z., Sen S., Machine learning in cybersecurity: a comprehensive survey. The Journal of Defense Modeling and Simulation: Applications, Methodology, Technology, 2020. DOI: 10.1177/1548512920951275

12. Iyer S. S., Rajagopal S., Applications of Machine Learning in Cyber Security Domain. Handbook of Research on Machine and Deep Learning Applications for Cyber Security, 2020. DOI: 10.4018/978-1-5225-9611-0.ch004

13. Lakshmanarao A., Shashi M., A survey on Machine Learning for cyber security. International Journal of Scientific & Technology Research, 9(1), 499-503, 2020.

14. Sagar R., Jhaveri R., Borrego C., Applications in Security and Evasions in Machine Learning: A Survey. Electronics, 9, 97, 1-42, 2020.

15. Sarker, I.H., Kayes, A.S.M., Badsha, S. Alqahtani H., Watters P., Ng A., Cybersecurity data science: an overview from machine learning perspective. Journal of Big Data, 7, 41, 2020.

16. Sommer R., Paxson V., Outside the closed world: On using machine learning for network intrusion detection, IEEE Symposium on Security and Privacy, USA, pp. 305-316, 2010.

17. Hindy H., Atkinson Robert, Utilising Deep Learning Techniques for Effective Zero-Day Attack Detection, Electronics, 9, 1684, 2020.

18. Gupta A., Suveer A., Lindblad J., Dragomir A., Sintorn I., Sladoje N., Convolutional Neural Networks for False Positive Reduction of Automatically Detected Cilia in Low Magnification TEM Images. Lecture Notes in Computer Science, 10269, pp. 407-418, 2017.

19. Ibrahim A., Thiruvady D., Schneider J., Abdelrazek M., The Challenges of Leveraging Threat Intelligence to Stop Data Breaches, Frontiers in Computer Science, Vol 2, 1-36. 2020.

20. Tabassi E., Burns K. J., Hadjimichael M., Molina-Markham A. D. Sexton J. T., “A Taxonomy and Terminology of Adversarial Machine Learning”, NISTIR 8269, 2019. DOI: https://doi.org/10.6028/NIST.IR.8269-draft

21. Nathaniel B., Paul M., Elie A., Intelligent Feature Engineering for Cybersecurity, IEEE BigData 2019, Los Angeles, CA, 2019. DOI: 10.1109/BigData47090.2019.9006122.

22. Wan K., Alagar V., Context-Aware Security Solutions for Cyber Physical Systems, Lecture Notes of the Institute for Computer Sciences, 109. DOI: 10.1007/978-3-642-36642-0_3

23. Lima A. Q., Keegan B., Challenges of using machine learning algorithms for cybersecurity: a study of threat-classification models applied to social media communication data, Cyber Influence and Cognitive Threats, Academic Press, pp. 33-52, 2020.

24. TavallaeeMahbod, Bagheri E., Lu W., Ghorbani A., A detailed analysis of the KDD CUP 99 data set, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, Ottawa, pp. 1-6, 2009.

25. Gallagher B., Eliassi-Rad T., Classification of HTTP Attacks: A Study on the ECML/PKDD 2007 Discovery Challenge.

26. Ansam K., Iqbal G., Peter V., JoarderKamruzzaman, Survey of intrusion detection systems: techniques, datasets and challenges. Cybersecurity, 2, 2019. DOI:10.1186/s42400-019-0038-7.

27. Thomas T., Vijayaraghavan A. P., Emmanuel S., Machine Learning Approaches in Cyber Security Analytics, Springer Singapore, 2020. ISBN: 978-981-15-1706-8

28. Johnson A., Microsoft Security: How to cultivate a diverse cybersecurity team, Microsoft Security Blog, 2020. https://www.microsoft.com/security/blog/2020/08/31/microsoft-security-cultivate-diverse-cybersecurity-team/

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset