The four levels of data

It is generally understood that a specific characteristic (feature/column) of structured data can be broken down into one of four levels of data. The levels are as follows:

  • The nominal level
  • The ordinal level
  • The interval level
  • The ratio level

As we move down the list, we gain more structure and, therefore, more returns from our analysis. Each level comes with its own accepted practice in measuring the center of the data. We usually think of the mean/average as being an acceptable form of a center. However, this is only true for a specific type of data.

The nominal level

The first level of data, the nominal level, consists of data that is described purely by name or category. Basic examples include gender, nationality, species, or yeast strain in a beer. They are not described by numbers and are therefore qualitative. The following are some examples:

  • A type of animal is on the nominal level of data. We may also say that if you are a chimpanzee, then you belong to the mammalian class as well.
  • A part of speech is also considered on the nominal level of data. The word she is a pronoun, and it is also a noun.

Of course, being qualitative, we cannot perform any quantitative mathematical operations, such as addition or division. These would not make any sense.

Mathematical operations allowed

We cannot perform mathematics on the nominal level of data except the basic equality and set membership functions, as shown in the following two examples:

  • Being a tech entrepreneur is the same as being in the tech industry, but not the other way around
  • A figure described as a square falls under the description of being a rectangle, but not the other way around

Measures of center

A measure of center is a number that describes what the data tends to. It is sometimes referred to as the balance point of the data. Common examples include the mean, median, and mode.

In order to find the center of nominal data, we generally turn to the mode (the most common element) of the dataset. For example, look back at the WHO alcohol consumption data. The most common continent surveyed was Africa, making that a possible choice for the center of the continent column.

Measures of the center, such as the mean and median, do not make sense at this level as we cannot order the observations or even add them together.

What data is like at the nominal level

Data at the nominal level is mostly categorical in nature. Because we generally can only use words to describe the data, it can be lost in translation between countries, or can even be misspelled.

While data at this level can certainly be useful, we must be careful about what insights we may draw from them. With only the mode as a basic measure of center, we are unable to draw conclusions about an average observation. This concept does not exist at this level. It is only at the next level that we may begin to perform true mathematics on our observations.

The ordinal level

The nominal level did not provide us with much flexibility in terms of mathematical operations due to one seemingly unimportant fact: we could not order the observations in any natural way. Data in the ordinal level provides us with a rank order, or the means to place one observation before the other. However, it does not provide us with relative differences between observations, meaning that while we may order the observations from first to last, we cannot add or subtract them to get any real meaning.

Examples

The Likert is among the most common ordinal level scales. Whenever you are given a survey asking you to rate your satisfaction on a scale from 1 to 10, you are providing data at the ordinal level. Your answer, which must fall between 1 and 10, can be ordered: eight is better than seven while three is worse than nine.

However, differences between the numbers do not make much sense. The difference between a seven and a six might be different from the difference between a two and a one.

Mathematical operations allowed

We are allowed much more freedom on this level in mathematical operations. We inherit all mathematics from the ordinal level (equality and set membership) and we can also add the following to the list of operations allowed in the nominal level:

  • Ordering
  • Comparison

Ordering refers to the natural order provided to us by the data. However, this can be tricky to figure out sometimes. When speaking about the spectrum of visible light, we can refer to the names of colors—Red, Orange, Yellow, Green, Blue, Indigo, and Violet. Naturally, as we move from left to right, the light is gaining energy and other properties. We may refer to this as a natural order:

Mathematical operations allowed

The natural order of color

However, if needed, an artist may impose another order on the data, such as sorting the colors based on the cost of the material to make said color. This could change the order of the data, but as long as we are consistent in what defines the order, it does not matter what defines it.

Comparisons are another new operation allowed at this level. At the ordinal level, it would not make sense to say that one country was naturally better than another or that one part of speech is worse than another. At the ordinal level, we can make these comparisons. For example, we can talk about how putting a "7" on a survey is worse than putting a "10."

Measures of center

At the ordinal level, the median is usually an appropriate way of defining the center of the data. The mean, however, would be impossible because the division is not allowed at this level. We can also use the mode as we could at the nominal level.

We will now look at an example of using the median.

Imagine you have conducted a survey among your employees asking "how happy are you to be working here on a scale from 1-5?," and your results are as follows:

5, 4, 3, 4, 5, 3, 2, 5, 3, 2, 1, 4, 5, 3, 4, 4, 5, 4, 2, 1, 4, 5, 4, 3, 2, 4, 4, 5, 4, 3, 2, 1 

Let's use Python to find the median of this data. It is worth noting that most people would argue that the mean of these scores would work just fine. The reason that the mean would not be as mathematically viable is that if we subtract/add two scores, say a score of four minus a score of two, the difference of two does not really mean anything. If addition/subtraction among the scores doesn't make sense, the mean won't make sense either:

import numpy 
 
results = [5, 4, 3, 4, 5, 3, 2, 5, 3, 2, 1, 4, 5, 3, 4, 4, 5, 4, 2, 1, 4, 5, 4, 3, 2, 4, 4, 5, 4, 3, 2, 1] 
 
 
sorted_results = sorted(results) 
 
 
print(sorted_results) 
''' 
[1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5] 
''' 
 
print(numpy.mean(results)) # == 3.4375 print(numpy.median(results)) # == 4.0 

Note

The ''' (triple apostrophe) denotes a longer (over two lines) comment. It acts in a way similar to #.

It turns out that the median is not only more sound but makes the survey results look much better.

Quick recap and check

So far, we have seen half of the levels of data:

  • The nominal level
  • The ordinal level

At the nominal level, we deal with data usually described using vocabulary (but sometimes with numbers), with no order and little use of mathematics.

At the ordinal level, we have data that can be described with numbers and we also have a "natural" order, allowing us to put one in front of the other.

Let's try to classify the following example as either ordinal or nominal (answers are at the end of the chapter):

  • The origin of the beans in your cup of coffee
  • The place someone receives after completing a foot race
  • The metal used to make the medal that they receive after placing in said race
  • The telephone number of a client
  • How many cups of coffee you drink in a day

The interval level

Now, we are getting somewhere interesting. At the interval level, we are beginning to look at data that can be expressed through very quantifiable means, and where much more complicated mathematical formulas are allowed. The basic difference between the ordinal level and the interval level is, well, just that difference.

Data at the interval level allows meaningful subtraction between data points.

Example

Temperature is a great example of data at the interval level. If it is 100 degrees Fahrenheit in Texas and 80 degrees Fahrenheit in Istanbul, Turkey, then Texas is 20 degrees warmer than Istanbul. This simple example allows for so much more manipulation at this level than previous examples.

(Non) Example

It seems as though the example in the ordinal level (using the one to five survey) fits the bill of the interval level. However, remember that the difference between the scores (when you subtract them) does not make sense; therefore, this data cannot be called at the interval level.

Mathematical operations allowed

We can use all the operations allowed on the lower levels (ordering, comparisons, and so on), along with two other notable operations:

  • Addition
  • Subtraction

The allowance of these two operations allows us to talk about data at this level in a whole new way.

Measures of center

At this level, we can use the median and mode to describe this data. However, usually the most accurate description of the center of data would be the arithmetic mean, more commonly referred to as simply the mean. Recall that the definition of the mean requires us to add together all the measurements. At the previous levels, an addition was meaningless. Therefore, the mean would have lost extreme value. It is only at the interval level and above that the arithmetic mean makes sense.

We will now look at an example of using the mean.

Suppose we look at the temperature of a fridge containing a pharmaceutical company's new vaccine. We measure the temperature every hour with the following data points (in Fahrenheit):

31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26 

Using Python again, let's find the mean and median of the data:

import numpy 
temps = [31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26]  
print(numpy.mean(temps))    # == 30.73  
print(numpy.median(temps))  # == 31.0

Note how the mean and median are quite close to each other and both are around 31 degrees. The question, on average, how cold is the fridge?, has an answer of about 31. However, the vaccine comes with a warning:

"Do not keep this vaccine at a temperature under 29 degrees."

Note that at least twice the temperature dropped below 29 degrees, but you ended up assuming that it isn't enough for it to be detrimental.

This is where the measure of variation can help us understand how bad the fridge situation can be.

Measures of variation

This is something new that we have not yet discussed. It is one thing to talk about the center of the data but, in data science, it is also very important to mention how "spread out" the data is. The measures that describe this phenomenon are called measures of variation. You have likely heard of "standard deviation" from your statistics classes. This idea is extremely important and I would like to address it briefly.

A measure of variation (such as the standard deviation) is a number that attempts to describe how spread out the data is.

Along with a measure of center, a measure of variation can almost entirely describe a dataset with only two numbers.

Standard deviation

Arguably, the standard deviation is the most common measure of variation of data at the interval level and beyond. The standard deviation can be thought of as the "average distance a data point is at from the mean." While this description is technically and mathematically incorrect, it is a good way to think about it. The formula for standard deviation can be broken down into the following steps:

  1. Find the mean of the data
  2. For each number in the dataset, subtract it from the mean and then square it
  3. Find the average of each square difference
  4. Take the square root of the number obtained in Step 3 and this is the standard deviation

Notice how, in the steps, we do actually take an arithmetic mean as one of the steps.

For example, look back at the temperature dataset. Let's find the standard deviation of the dataset using Python:

import numpy 
 
temps = [31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26] 
 
mean = numpy.mean(temps)    # == 30.73 
 
squared_differences = [] 
# empty list o squared differences 
 
for temperature in temps: 
    difference = temperature - mean                 
 # how far is the point from the mean 
 
    squared_difference = difference**2              
    # square the difference 
 
    squared_differences.append(squared_difference)  
    # add it to our list 
     
 
average_squared_difference = numpy.mean(squared_differences)   
# This number is also called the "Variance" 
 
 
standard_deviation = numpy.sqrt(average_squared_difference)    
# We did it! 
 
 
print(standard_deviation)  # == 2.5157 

All of this code led to us find out that the standard deviation of the dataset is around 2.5, meaning that "on average," a data point is 2.5 degrees off from the average temperature of around 31 degrees, meaning that the temperature could likely dip below 29 degrees again in the near future.

Note

The reason we want the "square difference" between each point and the mean and not the "actual difference" is because squaring the value actually puts emphasis on outliers—data points that are abnormally far away.

Measures of variation give us a very clear picture of how spread out or dispersed our data is. This is especially important when we are concerned with ranges of data and how data can fluctuate (think percentage return on stocks).

The big difference between data at this level and at the next level lies in something that is not obvious.

Data at the interval level does not have a "natural starting point or a natural zero." However, being at zero degrees Celsius does not mean that you have "no temperature".

The ratio level

Finally, we will take a look at the ratio level. After moving through three different levels with differing levels of allowed mathematical operations, the ratio level proves to be the strongest of the four.

Not only can we define order and difference, but the ratio level also allows us to multiply and divide as well. This might seem like not much to make a fuss over but it changes almost everything about the way we view data at this level.

Examples

While Fahrenheit and Celsius are stuck in the interval level, the Kelvin scale of temperature boasts a natural zero. A measurement of zero Kelvin literally means the absence of heat. It is a non-arbitrary starting zero. We can actually scientifically say that 200 Kelvin is twice as much heat as 100 Kelvin.

Money in the bank is at the ratio level. You can have "no money in the bank" and it makes sense that $200,000 is "twice as much as" $100,000.

Note

Many people may argue that Celsius and Fahrenheit also have a starting point (mainly because we can convert from Kelvin to either of the two). The real difference here might seem silly, but because the conversion to Celsius and Fahrenheit make the calculations go into the negative, it does not define a clear and "natural" zero.

Measures of center

The arithmetic mean still holds meaning at this level, as does a new type of mean called the geometric mean. This measure is generally not used as much, even at the ratio level, but is worth mentioning. It is the square root of the product of all the values.

For example, in our fridge temperature data, we can calculate the geometric mean as shown here:

import numpy 
 
temps = [31, 32, 32, 31, 28, 29, 31, 38, 32, 31, 30, 29, 30, 31, 26] 
 
num_items = len(temps) 
product = 1. 
 
for temperature in temps: 
    product *= temperature 
     
geometric_mean = product**(1./num_items) 
 
print(geometric_mean)   # == 30.634 

Note again how it is close to the arithmetic mean and median as calculated before. This is not always the case and will be talked about at great length in the statistics chapter of this book.

Problems with the ratio level

Even with all of this added functionality at this level, we must generally also make a very large assumption that actually makes the ratio level a bit restrictive.

Note

Data at the ratio level is usually non-negative.

For this reason alone, many data scientists prefer the interval level to the ratio level. The reason for this restrictive property is because if we allowed negative values, the ratio might not always make sense.

Consider that we allowed debt to occur in our money in the bank example. If we had a balance of $50,000, the following ratio would not really make sense at all:

50,000/-50,000=-1

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset