Understanding the basics of data and machine learning

When we talk about data, we are generally dealing with tabular data, that is, data that is organized into rows and columns. Think of this as being able to be opened in a spreadsheet technology such as Microsoft Excel. Each row of data, otherwise known as an observation, represents a single instance/example of a problem. If our data belongs to the domain of day-trading in the stock market, an observation might represent an hour’s worth of changes in the overall market and price.

For example, when dealing with the domain of network security, an observation could represent a possible attack or a packet of data sent over a wireless system.

The following shows sample tabular data in the domain of cyber security and more specifically, network intrusion:

DateTime

Protocol

Urgent

Malicious

June 2nd, 2018

TCP

FALSE

TRUE

June 2nd, 2018

HTTP

TRUE

TRUE

June 2nd, 2018

HTTP

TRUE

FALSE

June 3rd, 2018

HTTP

FALSE

TRUE

 

We see that each row or observation consists of a network connection and we have four attributes of the observation: DateTime, Protocol, Urgent, and Malicious. While we will not dive into these specific attributes, we will simply notice the structure of the data given to us in a tabular format.

Because we will, for the most part, consider our data to be tabular, we can also look at specific instances where the matrix of data has only one column/attribute. For example, if we are building a piece of software that is able to take in a single image of a room and output whether or not there is a human in that room. The data for the input might be represented as a matrix of a single column where the single column is simply a URL to a photo of a room and nothing else.

For example, considering the following table of table that has only a single column titled, Photo URL. The values of the table are URLs (these are fake and do not lead anywhere and are purely for example) of photos that are relevant to the data scientist:

Photo URL

http://photo-storage.io/room/1

http://photo-storage.io/room/2

http://photo-storage.io/room/3

http://photo-storage.io/room/4

 

The data that is inputted into the system might only be a single column, such as in this case. In our ability to create a system that can analyze images, the input might simply be a URL to the image in question. It would be up to us as data scientists to engineer features from the URL.

As data scientists, we must be ready to ingest and handle data that might be large, small, wide, narrow (in terms of attributes), sparse in completion (there might be missing values), and be ready to utilize this data for the purposes of machine learning. Now’s a good time to talk more about that. Machine learning algorithms belong to a class of algorithms that are defined by their ability to extract and exploit patterns in data to accomplish a task based on historical training data. Vague, right? machine learning can handle many types of tasks, and therefore we will leave the definition of machine learning as is and dive a bit deeper.

We generally separate machine learning into two main types, supervised and unsupervised learning. Each type of machine learning algorithm can benefit from feature engineering, and therefore it is important that we understand each type.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset