The distinction between structured and unstructured data is usually the first question you want to ask yourself about the entire dataset. The answer to this question can mean the difference between needing three days or three weeks of time to perform a proper analysis.
The basic breakdown is as follows (this is a rehashed definition of organized and unorganized data in the first chapter):
Here are a few examples that could help you differentiate between the two:
Structured data is generally thought of as being much easier to work with and analyze. Most statistical and machine learning models were built with structured data in mind and cannot work on the loose interpretation of unstructured data. The natural row and column structure is easy to digest for human and machine eyes. So, why even talk about unstructured data? Because it is so common! Most estimates place unstructured data as 80-90% of the world's data. This data exists in many forms and, for the most part, goes unnoticed by humans as a potential source of data. Tweets, emails, literature, and server logs are generally unstructured forms of data.
While a data scientist likely prefers structured data, they must be able to deal with the world's massive amounts of unstructured data. If 90% of the world's data is unstructured, that implies that about 90% of the world's information is trapped in a difficult format.
So, with most of our data existing in this free-form format, we must turn to pre-analysis techniques, called pre-processing, in order to apply structure to at least a part of the data for further analysis. The next chapter will deal with pre-processing in great detail; for now, we will consider the part of pre-processing wherein we attempt to apply transformations to convert unstructured data into a structured counterpart.
When looking at text data (which is almost always considered unstructured), we have many options to transform the set into a structured format. We may do this by applying new characteristics that describe the data. A few such characteristics are as follows:
I will use the following tweet as a quick example of unstructured data, but you may use any unstructured free-form text that you like, including tweets and Facebook posts:
"This Wednesday morn, are you early to rise? Then look East. The Crescent Moon joins Venus & Saturn. Afloat in the dawn skies."
It is important to reiterate that pre-processing is necessary for this tweet because a vast majority of learning algorithms require numerical data (which we will get into after this example).
More than requiring a certain type of data, pre-processing allows us to explore features that have been created from the existing features. For example, we can extract features such as word count and special characters from the mentioned tweet. Now, let's take a look at a few features that we can extract from the text.
We may break down a tweet into its word/phrase count. The word this appears in the tweet once, as does every other word. We can represent this tweet in a structured format as follows, thereby converting the unstructured set of words into a row/column format:
this |
wednesday |
morn |
are |
you | |
Word count |
1 |
1 |
1 |
1 |
1 |
Note that to obtain this format, we can utilize scikit-learn's CountVectorizer
, which we saw in the previous chapter.
We may also look at the presence of special characters, such as the question mark and exclamation mark. The appearance of these characters might imply certain ideas about the data that are otherwise difficult to know. For example, the fact that this tweet contains a question mark might strongly imply that this tweet contains a question for the reader. We might append the preceding table with a new column, as shown:
this |
wednesday |
morn |
are |
you | |
Word Count |
1 |
1 |
1 |
1 |
1 |
This tweet is 125 characters long:
len("This Wednesday morn, are you early to rise? Then look East. The Crescent Moon joins Venus & Saturn. Afloat in the dawn skies.") # get the length of this text (number of characters for a string) # 125
The average tweet, as discovered by analysts, is about 30 characters in length. So, we might impose a new characteristic, called relative length (which is the length of the tweet divided by the average length), telling us the length of this tweet as compared with the average tweet. This tweet is actually 4.03 times longer than the average tweet, as shown:
125 / 30 = 4.03
We can add yet another column to our table using this method:
this |
wednesday |
morn |
are |
you |
? |
Relative length | |
Word count |
1 |
1 |
1 |
1 |
1 |
1 |
4.03 |
We can pick out some topics of the tweet to add as columns. This tweet is about astronomy, so we can add another column, as illustrated:
this |
wednesday |
morn |
are |
you |
? |
Relative length |
Topic | |
Word count |
1 |
1 |
1 |
1 |
1 |
1 |
4.03 |
astronomy |
And just like that, we can convert a piece of text into structured/organized data ready for use in our models and exploratory analysis.
The topic is the only extracted feature we looked at that is not automatically derivable from the tweet. Looking at word count and tweet length in Python is easy. However, more advanced models (called topic models) are able to derive and predict topics of natural text as well.
Being able to quickly recognize whether your data is structured or unstructured can save hours or even days of work in the future. Once you are able to discern the organization of the data presented to you, the next question is aimed at the individual characteristics of the dataset.