“If we have data, let's look at data. If all we have are opinions, let's go with mine.”
—Jim Barksdale, former Netscape CEO
Many people work with data without having a dialect for it. However, we want to ensure we're all speaking the same language to make the rest of the book easier to follow. So, in this chapter, we'll give you a brief crash course on data and data types. If you've had a basic statistics or analytics course, you'll know the terms that follow but there may be parts of our discussion not covered in your class.
The terms data and information are often used interchangeably. In this book, however, we make a distinction between the two.
Information is derived knowledge. You can derive knowledge from many activities: measuring a process, thinking about something new, looking at art, and debating a subject. From the sensors on satellites to the neurons firing in our brains, information is continually created. Communicating and capturing that information, however, is not always simple. Some things are easily measurable while others are not. But we endeavor to communicate knowledge for the benefit of others and to store what we've learned. And one way to communicate and store information is by encoding it. When we do this, we create data. As such, data is encoded information.
Table 2.1 tells the story of a company. Each month, they run a different marketing campaign online, on television, or in print media (newspapers and magazines). The process they run generates new information each month. The table they've created is an encoding of this information and thus it holds data.
A table of data, like Table 2.1, is called a dataset.
Notice that it has both rows and columns that serve specific functions in how we understand the table. Each row of the table (running horizontally, under the header row) is a measured instance of associated information. In this case, it's a measured instance of information for a marketing campaign. Each column of the table (running vertically) is a list of information we're interested in, organized into a common encoding so that we can compare each instance.
The rows of each table are commonly referred to as observations, records, tuples, or trials. Columns of datasets often go by the names features, fields, attributes, predictors, or variables.
A data point is the intersection of an observation and a feature. For example, 150 units sold on 2021-02-01 is a data point.
TABLE 2.1 Example Dataset on Advertisement Spending and Revenue
Date | Ad Spending | Units Sold | Profit | Location |
---|---|---|---|---|
2021-01-01 | 2000 | 100 | 10452 | |
2021-02-01 | 1000 | 150 | 15349 | Online |
2021-03-01 | 3000 | 200 | 25095 | Television |
2021-04-01 | 1000 | 175 | 12443 | Online |
Table 2.1 has a header (a piece of non-numerical data) that helps us understand what each feature means. Note that not every dataset will have a header row. In such cases, the header row is implied, and the person working in the dataset must know what each feature means.
There are many ways to encode information, but data workers use a few specific types of encodings that store information and communicate results. The two most common data types are described as numeric or categorical.
Numeric data is mostly made up of numbers but might use additional symbols to identify units. Categorical data is made up of words, symbols, phrases, and (confusingly) sometimes numbers, like ZIP codes. Numeric and categorical data both split into further subcategories.
There are two main types of numeric data:
Categorical data also has two main types:
You'll notice Table 2.1 has a Date feature, which is an additional data type that is sequential and can be used in arithmetic expressions like numeric data.
The preceding section talked about data types within a dataset, but there are larger categories to describe data that refers to how it was collected and how it's structured.
Data can be described as observational or experimental, depending on how it's collected.
Most of the data in your company, and in the world, is observational. Examples of observational data include visits to a website, sales on a given date, and the number of emails you receive each day. Sometimes it's saved for a specific purpose; other times, for no purpose at all. We've also heard the phrase “found data” to reference this type of data; it's often created as byproducts from things like sales transactions, credit card payments, Twitter posts, or Facebook likes. In that sense, it's sitting in a database somewhere, waiting to be discovered and used for something. Sometimes observational data is collected because it's free and easy to collect. But it can be deliberately collected, as with customer surveys or political polls.
Experimental data, on the other hand, is not passively collected. It's collected deliberately and methodically to answer specific questions. For these reasons, experimental data represents the gold standard of data for statisticians and researchers. To collect experimental data, you must randomly assign a treatment to someone or something. Clinical drug trials present a common example that generates experimental data. Patients are randomly split into two groups—a treatment group and a control group—and the treatment group is given the drug while the control group is given a placebo. The random assignment of patients should balance out information not relevant to the study (age, socioeconomic status, weight, etc.) so that two groups are as similar as possible in every way, except for the application of the treatment. This allows researchers to isolate and measure the effect of the treatment, without having to worry about potential confounding features that might influence the outcome of the experiment.2
This setup can span across industries, from drug trials to marketing campaigns. In digital marketing, web designers frequently experiment on us by designing competing layouts or advertisements on web pages. When we shop online, a coin flip happens behind the scenes to determine if you are shown one of two advertisements, call them A and B. After several thousand unknowing guinea pigs visit the site, the web designers see which had led to more “click-throughs.” And because ads A and B were shown randomly, it's possible to determine which ad was better with respect to click-through rates because all other potential confounding features (time of day, type of web surfer, etc.) have been balanced out through randomization. You might hear experiments like this called “A/B tests” or “A/B experiments.”
We will talk more about why this discrepancy matters in Chapter 4, “Argue with the Data.”
Data is also said to be structured and unstructured. Structured data is like the data in your spreadsheets or in Table 2.1. It's been presented with a sense of order and structure in the form of rows and columns.
Unstructured data refers to things like text from Amazon reviews, pictures on Facebook, YouTube videos, or audio files. Unstructured data requires clever techniques to convert it into structured data required for analysis methods (see Part III of this book).
Data does not always look like a dataset or spreadsheet. It's often in the form of summary statistics. Summary statistics enable us to understand information about a set of data.
The three most common summary statistics are mean, median, and mode, and you're probably quite familiar with them. However, we wanted to spend a few minutes discussing these statistics because we frequently see the colloquial terms “normal,” “usual,” “typical,” or “average” used as synonyms for each of the terms. To avoid confusion, let's be clear on what each term means:
Mean, median, and mode are called measures of location or measures of central tendency. Measures of variation—variance, range, and standard deviation—are measures of spread. The location number tells you where on the number line a typical value falls and spread tells you how spread out the other numbers are from that value.
As a trivial example, the numbers 7, 5, 4, 8, 4, 2, 9, 4, and 100 have mean 15.89, median 5, and mode 4. Notice the mean (average), 15.89, is a number that doesn't appear in the data. This happens a lot: the average number of people in a household in the United States in 2018 was 2.63; basketball star LeBron James scores an average of 27.1 points per game.
It's a common mistake for people to use the average (mean) to represent the midpoint of the data, which is the median. They assume half the numbers must be above average, and half below. This isn't true. In fact, it's common for most of the data to be below (or above) the average. For example, the vast majority of people have greater than the average number of fingers (likely 9.something).
To avoid confusion and misconceptions, we recommend sticking with mean or average, median, and mode for full transparency. Try not to use words like usual, typical, or normal.
In this chapter, we gave you a common language to speak about your data in the workplace. Specifically, we described:
With the correct terminologies in place, you're ready to start thinking statistically about the data you come across.