Structured versus unstructured data

The distinction between structured and unstructured data is usually the first question you want to ask yourself about the entire dataset. The answer to this question can mean the difference between needing three days or three weeks of time to perform a proper analysis.

The basic breakdown is as follows (this is a rehashed definition of organized and unorganized data in the first chapter):

  • Structured (organized) data: This is data that can be thought of as observations and characteristics. It is usually organized using a table method (rows and columns).
  • Unstructured (unorganized) data: This data exists as a free entity and does not follow any standard organization hierarchy.

Here are a few examples that could help you differentiate between the two:

  • Most data that exists in text form, including server logs and Facebook posts, is unstructured
  • Scientific observations, as recorded by careful scientists, are kept in a very neat and organized (structured) format
  • A genetic sequence of chemical nucleotides (for example, ACGTATTGCA) is unstructured, even if the order of the nucleotides matters, as we cannot form descriptors of the sequence using a row/column format without taking a further look

Structured data is generally thought of as being much easier to work with and analyze. Most statistical and machine learning models were built with structured data in mind and cannot work on the loose interpretation of unstructured data. The natural row and column structure is easy to digest for human and machine eyes. So, why even talk about unstructured data? Because it is so common! Most estimates place unstructured data as 80-90% of the world's data. This data exists in many forms and, for the most part, goes unnoticed by humans as a potential source of data. Tweets, emails, literature, and server logs are generally unstructured forms of data.

While a data scientist likely prefers structured data, they must be able to deal with the world's massive amounts of unstructured data. If 90% of the world's data is unstructured, that implies that about 90% of the world's information is trapped in a difficult format.

So, with most of our data existing in this free-form format, we must turn to pre-analysis techniques, called pre-processing, in order to apply structure to at least a part of the data for further analysis. The next chapter will deal with pre-processing in great detail; for now, we will consider the part of pre-processing wherein we attempt to apply transformations to convert unstructured data into a structured counterpart.

Example of data pre-processing

When looking at text data (which is almost always considered unstructured), we have many options to transform the set into a structured format. We may do this by applying new characteristics that describe the data. A few such characteristics are as follows:

  • Word/phrase count
  • The existence of certain special characters
  • The relative length of text
  • Picking out topics

I will use the following tweet as a quick example of unstructured data, but you may use any unstructured free-form text that you like, including tweets and Facebook posts:

"This Wednesday morn, are you early to rise? Then look East. The Crescent Moon joins Venus & Saturn. Afloat in the dawn skies."

It is important to reiterate that pre-processing is necessary for this tweet because a vast majority of learning algorithms require numerical data (which we will get into after this example).

More than requiring a certain type of data, pre-processing allows us to explore features that have been created from the existing features. For example, we can extract features such as word count and special characters from the mentioned tweet. Now, let's take a look at a few features that we can extract from the text.

Word/phrase counts

We may break down a tweet into its word/phrase count. The word this appears in the tweet once, as does every other word. We can represent this tweet in a structured format as follows, thereby converting the unstructured set of words into a row/column format:

 

this

wednesday

morn

are

you

Word count

1

1

1

1

1

Note that to obtain this format, we can utilize scikit-learn's CountVectorizer, which we saw in the previous chapter.

Presence of certain special characters

We may also look at the presence of special characters, such as the question mark and exclamation mark. The appearance of these characters might imply certain ideas about the data that are otherwise difficult to know. For example, the fact that this tweet contains a question mark might strongly imply that this tweet contains a question for the reader. We might append the preceding table with a new column, as shown:

 

this

wednesday

morn

are

you

Word Count

1

1

1

1

1

The relative length of text

This tweet is 125 characters long:

len("This Wednesday morn, are you early to rise? Then look East. The Crescent Moon joins Venus & Saturn. Afloat in the dawn skies.") 
# get the length of this text (number of characters for a string) 
 
# 125 

The average tweet, as discovered by analysts, is about 30 characters in length. So, we might impose a new characteristic, called relative length (which is the length of the tweet divided by the average length), telling us the length of this tweet as compared with the average tweet. This tweet is actually 4.03 times longer than the average tweet, as shown:

125 / 30 = 4.03

We can add yet another column to our table using this method:

 

this

wednesday

morn

are

you

?

Relative length

Word count

1

1

1

1

1

1

4.03

Picking out topics

We can pick out some topics of the tweet to add as columns. This tweet is about astronomy, so we can add another column, as illustrated:

 

this

wednesday

morn

are

you

?

Relative length

Topic

Word count

1

1

1

1

1

1

4.03

astronomy

And just like that, we can convert a piece of text into structured/organized data ready for use in our models and exploratory analysis.

The topic is the only extracted feature we looked at that is not automatically derivable from the tweet. Looking at word count and tweet length in Python is easy. However, more advanced models (called topic models) are able to derive and predict topics of natural text as well.

Being able to quickly recognize whether your data is structured or unstructured can save hours or even days of work in the future. Once you are able to discern the organization of the data presented to you, the next question is aimed at the individual characteristics of the dataset.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset