Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3 Collecting Data

In this chapter, we will discuss sources of data. Sometimes data is not already nicely and neatly stored for you, and you will have to reach out to collect your own. Here, we will discuss the most common sources of information and a couple of systems that will pull or store data automatically. You will also learn how to make your queries more efficient if you are accessing a database. These resources and techniques are crucial, not only to passing the CompTIA Data+ certification exam but also to the everyday life of a data analyst.

In this chapter, we’re going to cover the following main topics:

Utilizing public sources of data
Collecting your own data
Differentiating Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT)
Understanding online transactional processing (OLTP) and online analytical processing (OLAP)
Optimizing query structure

Utilizing public sources of data

There are all kinds of datasets available for free. With these, collecting data is as simple as asking for it and downloading it. There are all sorts of information out there on any number of topics, free for anyone to use. Some of these free sources are even already clean and organized, just waiting for an analysis to be run.

Public databases

Public databases are databases that legally must be accessible. These fall into one of two categories: government or industry. The majority of public databases are run by government entities. These can be found around the world, sometimes at local and national levels. What information is specifically shared depends on the government entity, but can include information about the following:

Population
Agriculture
Utilities
Public health concerns

The downside to government-run databases is that, while they are free, they are usually full of missing data and mistakes. This means that you will have to spend a lot of time cleaning the data.

The public data you have access to depends entirely on where you are in the world. Each country may make data accessible to its own citizens but may try to limit access to citizens of other countries. While there may be creative ways around these limitations, getting too “creative” is not considered the best practice. Alternatively, some international organizations allow access to their data in much the same way as a government agency. The World Health Organization (WHO) makes all kinds of data accessible. You can visit who.int/data/collections and give it a try.

Industry-run public databases are about specific industries that are legally mandated to release specific data. These are generally cleaner than government databases, but they tend not to share more than the legal minimum, and they do not make this data easy to find.

Open sources

Open source datasets are ones that have been posted by individuals or companies for free. This data is freely given and covers a very wide range of topics. There are several of these sources in existence, but it is generally accepted that www.kaggle.com is the most useful.

Based on a community, there are datasets posted from individuals, companies, universities, and organizations. Over 50,000 datasets have been posted, and you can filter your search by file size, file type, licenses, or topics.

Important note

Let me be explicit here—being able to filter by license is important because not all of the datasets can be used commercially. If you are just creating a portfolio project to publish on Kaggle, it doesn’t matter, but if you want to try to use the data for your company, you will have to check the licenses carefully.

Kaggle also offers courses and competitions; it is a great place to get started in the data analytics community. To be clear, Kaggle is not the only resource like this. A lot of people like Dataworld, which offers free personal accounts, but to use the information commercially, you need to purchase an enterprise license. You can learn more about this at data.world/pricing. Because the data is coming from so many sources, it is not uniformly clean and ready for use, but it tends to be cleaner than data in government-run public databases.

Application programming interfaces and web services

Some companies will allow limited access to their databases remotely using an application programming interface (API). This limited access is different from the public sources that allow you to just download an entire dataset outright. There are different types of APIs, and they work in different ways, but you can generally think of them as a piece of code that allows two unrelated computer systems to exchange information. For example, you go to a restaurant and sit at a table. You tell the waiter what you want, and he goes to the kitchen to tell the cook your order before bringing the food back to you. In this case, the waiter is the API, the cook is the database, and the food is the data you want. The API is simply a middleman that allows you to communicate with a database.

Web services are a specific kind of API that uses a hosted network and require both computers to be in the same hosted environment to work. That said, with small protocol differences, they still work like other APIs. For the purpose of the exam, you need to know that all web services are APIs, but not all APIs are web services.

All APIs pass information back and forth in one of two ways:

Synchronous
Asynchronous

When requesting information through an API by synchronous means, the code will make the request and then wait for a response. An asynchronous API request means that the code continues to run while it waits for the API to return with the data. The advantages are straightforward; with asynchronous code, you can be faster and more efficient because you do not have to stop everything and wait. However, the downside is that if you don’t wait, your code may try to use the data before the API responds with it, which will cause an error in your code.

Important note

Because the exam is vendor-neutral, you do not have to know any specific programming language. This means that you will not be asked to create an API or to make a request to an API. Instead, you will simply have to have a firm understanding of what they are, why they are used, and the difference between synchronous and asynchronous.

That said, APIs are very useful resources. Most data analysts do not use them in their everyday job, but they are commonly used to acquire data for portfolio projects. While you may never be asked to build an API, it does not hurt to learn how to request data from one, just so you are comfortable with the process.

Thousands of APIs available connect to databases with all sorts of information, including entire databases of bad jokes. That said, many major tech companies have APIs available, including Facebook, Amazon, Apple, Netflix, and Google. There is a lot of good information out there available using APIs. The downside of the APIs is that they are not uniform. Every API is set up slightly differently; you will have to read the documentation and, in most cases, acquire authorization from the company in order to request information. Several APIs also have limits on the size or number of requests you can make to ensure that one person isn’t bogging down the entire system for everyone.

If you would like to learn more, try one out. A popular one for people just getting started is the Twitter API, which you can check out at developer.twitter.com/en/docs/twitter-api. This API is popular because it is well documented, there are lots of guides out there on how to use it, and it gives access to a lot of data. You can even go to the website and try it by clicking the Try a live request button, without writing a bunch of code, though you do need an account. The specifics on how to connect to it, how to get authorization, how much data you can pull for each level of authorization, and everything you may want to know is given in the documentation.

Public data sources are a great resource, and it never hurts to look through them first, but sometimes you just need to collect your own data.

Collecting your own data

You will not always be able to find the exact dataset you require nicely collected and cleaned for you with a little bow on top. Often, if you need specific data, you will need to go out and get it yourself. This can be easier said than done. However, there are several different ways for you to collect your own data. We will go over a few of the most common approaches you will need to know.

Web scraping

As the name implies, web scraping is the process of collecting data from the web. While it uses the web, this approach is different from public databases, open sources, or APIs because there is no database. Web scraping is the process of collecting information directly from a web page instead of a database. This can include collecting price information on a product from several different sites, collecting posted stock information, or filtering through social media posts with specific tags. Anything that is posted on a website can be accessed and saved in your own dataset. To be clear, you are not visiting a site and writing down the information on a notepad. The code will automatically go to a specified location, collect specific information, and bring it back.

Web scraping is a useful tool and has many applications. A lot of information is posted on the web in one form or another, and web scraping allows you to collect it into something useful. The problem that a lot of people face is that to do it well requires you to understand how web pages are structured well enough to be able to target the specific information you want. More often than not, this means that the data analyst has to at least understand the basic concepts of HTML in addition to whichever scripting programing language they are using to run the web scraping code. For more information on website structure, look up guides on the Document Object Model (DOM).

Surveying

Surveys are one of the most common ways to collect information. At its heart, a survey is just a set of questions that you give to a sample of individuals. A sample is a small subset of a larger population.

Important note

A population represents every single individual in a group. If you want information on people who own pets, the population is everyone on the planet who has a pet. Getting information on everyone is not possible. However, you take information from a sample and generalize what you learn from that sample to the population as a whole. This is a fundamental principle of inferential statistics. If you want to learn more, look up a guide on sample techniques.

These can be done electronically, with paper forms, or by someone simply asking questions and writing down the answers. While you can get just about any kind of data from a survey if you do it properly, they are often used to collect information on demographics and customer satisfaction.

Types of survey answers

There are many ways to ask the same question, each with its own set of pros and cons. For a moment, we will focus on the types of survey answers. The main types are as follows:

Text-based
Single-choice
Multiple-choice
Drop-down
Likert

Text-based answers are exactly what they sound like. You provide a blank field and let the person taking the survey write in whatever they want. From a data analytics point of view, these are a pain. There is no way to control how the person answers, so you never know what you will get. Best-case scenario, you can try to use natural language processing (NLP) to automatically give values to it. More likely, you will have to go through and look at every value individually, which can take time if you have a big sample. That said, there are certain types of data you can only collect through open-ended questions such as this:

Figure 3.1 – Text-based survey question

In Figure 3.1, we see an example of a text-based survey question. It is simply a question, with an empty box under it where someone can write a response.

Single-choice answers are where a list of answers is provided and the person taking the survey can only pick one. These are clean and the results are easy to analyze. An example is shown here:

Figure 3.2 – Single-choice survey question

In Figure 3.2, we see the same question as Figure 3.1 but formatted as a single-choice answer. The format is denoted by having circles next to the possible answers so that only one can be selected.

Multiple-choice answers are where you provide a list of possible answers and tell the person to select all that apply. They can click as many or as few as they like. These can take a little more time to analyze than a single-choice answer, but they are not complicated. Here’s an example:

Figure 3.3 – Multiple-choice survey question

In Figure 3.3, we see the question written as a multiple choice. This is denoted by the squares next to the answers, suggesting that more than one answer can be selected. Sometimes, this is even stated after the question with a “Select All that Apply” statement.

Drop-down menus are where the person taking the survey selects a value from a drop-down list. From an analyst’s point of view, the results of these are almost identical to a single-choice answer. Either way, every person who takes the survey will select one of a few possible outcomes. Here’s an example:

Figure 3.4 – Drop-down survey question

In Figure 3.4, we see the question formatted as a dropdown. This is denoted by the arrow symbols in the box. Here, the person taking the survey would click on the box and then select an answer from a drop-down menu.

Likert scales are popular, specifically when gauging the popularity or effectiveness of a specific part of a product. These are the questions that make a statement, and the person has to answer “Strongly agree”, “Agree”, “Neither agree nor disagree”, “Disagree”, or “Strongly Disagree”. Again, the results are clean and easy to analyze. In addition, Likert scales are great for turning qualitative questions into quantitative results.

Important note

Qualitative data refers to specific qualities that describe something. Color is a quality, so saying the car is red is qualitative data. Quantitative refers to things that can be quantified or things that can be counted, measured, or calculated. These are your numbers, so saying the car has four tires is quantitative data. There is a time and a place for both kinds of data, and ways to transform qualitative data into quantitative data and quantitative data into qualitative data.

Now, let’s look at a quick example:

Figure 3.5 – Likert survey question

In Figure 3.5, we see the same question formatted as a Likert scale. Note that this is set up as a single-choice question, but the difference is that the question is written as a statement and the possible answers are written as a scale. There are multiple ways to set up a Likert scale, but this is one of the most common.

Each of these types of survey answers is appropriate for different kinds of questions. It all depends on what you want to know.

Survey bias

Bias is the bane of analysts who don’t pay attention to actively avoiding it. To put it simply, bias is the difference between the expected value and the actual value. Bias means that what you learn about your sample cannot be generalized to your population. The more bias you have, the further your results will be from the truth.

Okay—you understand that bias is bad and invalidates your results, so how do you avoid it? That is tricky because there are hundreds of different kinds of biases, and there is no surefire way to avoid all of them. Randomizing who receives the survey as much as possible is a good start, but there are a few types of bias that are common to surveys that are easy to avoid.

Important note

For more information on specific types of bias, check out www.thedecisionlab.com/biases. This is a great source that covers the most common kinds of bias and how to avoid them.

Order bias is common in surveys, and it happens when the order of the questions or the order of the answers impacts what people choose. This is easily avoided by randomizing the order of the questions and answers. Most survey sites will include this randomization as an option or automatically mix things up for you.

Also, try to avoid leading questions or answers. For example, you shouldn’t ask, “Which do you prefer: The incredible product A or the lackluster product B?” Leading questions, whether intentional or accidental, add bias. Instead, try to avoid using adjectives at all.

Recall bias happens when you ask someone to remember details of something that happened in the past that they may or may not remember. This is especially problematic if you are requesting a text answer to the question. Not able to clearly remember what happened, the person is likely to just guess, which, as you can imagine, does not always yield accurate results. Try to avoid asking questions about the distant past, especially with text answers. If you must ask about the past, add “I don’t know” as a possible answer so that they don’t have to make something up.

If you are interested in surveys, I highly suggest going to a service such as SurveyMonkey, at surveymonkey.com, signing up for a free account, and creating one. There are even guides and suggestions on the survey-making process.

Pass the survey around to friends and family and just see what kind of results you get. It will help get you familiar with the process and get you some practice writing useful questions. SurveyMonkey also has templates, suggestions, and people you can talk to about your survey.

Overall, making a survey is not hard, but making a survey that will deliver accurate and useful results is not easy. There are entire companies that specialize in creating and delivering good surveys if you aren’t comfortable doing it yourself. If you want to learn more, look up guides on survey design.

Observing

Observation can be very simple or incredibly complicated. At its heart, observation is simply witnessing something and recording it. The process of devoting time and effort to gather and record specific observations is called a study.

Important note

For more information on specific types of studies, look for guides on study design. If you are feeling bold, or particularly academic, look for guides on research design.

There are many kinds of studies. Some studies are passive, where you simply watch and see what happens. Other studies are active, where you change something in the environment and see how other elements react. The specifics of the study depend entirely on the type of data you want to get out of it.

More importantly, there are different ways to observe things. The classic approach is physical observation, which is the process of going to a location and recording what you see. For certain types of data, this is required. If a department store wanted to know how many people who go to the mall wear hats, you would need to physically go to the mall and record the number of hats you saw. This process takes time but is the best approach for studies that require decision-making. Does a visor count as a hat? Is that lady wearing a bonnet?

A common example is an A-B study for a website. The basic design is roughed out in Figure 3.6:

Figure 3.6 – A-B study for a website

A company is trying to decide which website design leads to more sales: design A or design B. Every customer who visits the website is randomly routed to either design A or design B. The website automatically observes and takes notes of things such as how long a person stays on a page, how many links they click, or whether or not they buy something. Once enough data is collected for each design, you can run a statistical analysis to see which design performed better with a specific goal in mind.

More and more, automated observations are becoming popular, and software will generate metrics for you. A website will count how many times someone looks at a specific page, or how many times a link is clicked. The department store can set up a program to count the number of hats sold. These are also considered observations. While you may not physically see it happen, it is a record of something that has occurred that was observed by some mechanism that was looking for it. Needless to say, automated observations are convenient but limited in the types of information that they can collect.

Differentiating ETL and ELT

You don’t always have to collect your data manually. Some programs will automatically pull data from a source, prepare it for use, and move it to a new location, usually your local environment. The code to automate this process is called a data pipeline. There are many kinds of data pipelines, and each will need to be tuned to which data you are pulling and what you need to do with it. While more complicated pipelines can automate entire modeling and reporting processes, the exam focuses on two types, and both types have the same three steps:

Extraction
Transformation
Loading

Extraction is the step of pulling the data from the original source. The source can be a database you own, an outside database, or even an automated web scraping system—it doesn’t matter. Extraction is picking up the information from wherever it was originally stored. This is similar to the process you would do to manually collect it yourself, but it is automated.

Transformation can take on many forms, but the idea is that this step prepares the data for use. It cleans the data and organizes it nicely and neatly so that it is ready to be plugged into an algorithm as soon as it arrives. This step can include dealing with missing values, adjusting capitalizations, normalizing data, combining columns, recoding, removing data, or any number of other operations. The idea is that this automates the step of data manipulation to save you time and effort.

Loading is the process of placing the data into a new environment. More often than not, this is the environment where you will actually be working with the data. Often, this means you are bringing it to your local machine or programming interface, but there are times when it is more appropriate to move the data to a virtual machine (VM) or a cloud environment. Sometimes, the data is loaded directly into a data warehouse.

For the purpose of this exam, you don’t need to be able to create or run a pipeline. Not only would this require knowledge of a programming language, but also, the creation of a pipeline is not generally expected of someone just starting their data career. However, you will be expected to know what a pipeline is and the difference between two specific kinds of pipelines. The main difference between these pipelines is the order in which they perform the three listed steps and the consequences of changing the order.

ETL

ETL is a pipeline that extracts the data, then transforms it, then loads it to the destination. While this may seem obvious, it means that the data is transformed after it is extracted and before it is fully loaded; the data is transformed in transit or while it is being loaded to the new location. In Figure 3.7, we can see how this plays out:

Figure 3.7 – ETL

This is quick, simple, and efficient, so long as your transformations are quick and simple. When you start to get into larger or more complicated transformations, the entire process slows down very quickly.

ELT

ELT is a pipeline that extracts the data, loads it to the destination, then transforms it. This means that the processing power to transform the data comes from the destination environment. We can see the flow of this process in Figure 3.8:

Figure 3.8 – ELT

Several companies have taken a liking to this approach because they can load the data to a VM in the cloud and then pay to rent the processing power they need to transform the data quickly. This negates the requirement to own high-powered hardware to work with large or complex data.

Generally speaking, ELT loaded into a rented environment is faster than ETL when working with complicated transformations, but it is more expensive. For individuals or small companies with simple transformations, ETL is still the preferred approach.

Delta load

It is important to mention that data can be loaded in different ways. These loading methods exist with or without a pipeline, but they do have an impact on the efficiency of an ETL or ELT pipeline. The most important approaches are as follows:

Full load
Delta load

To keep it simple, a full load means that every time the pipeline is run, the entire dataset is extracted, transformed, and loaded. Occasionally this is performed as a safety measure in the form of redundancy, but it is not efficient.

In scientific terminology, delta (Δ) represents change, so a delta load only loads things that have changed since the last load. As you can imagine, it is much faster and more efficient to only update things that have changed instead of trying to load the entire dataset every time.

Understanding OLTP and OLAP

OLTP and OLAP are relatively new, but important, fields. They involve the financial transactions performed over a network, such as shopping or banking transactions. To be clear, these are a kind of automated observation and often include a pipeline such as ETL or ELT, but are considered their own categories because financial data is protected. Some guidelines must be followed for OLTP and OLAP that are not required for other types of data. Specific guidelines on protected data will be covered later in this book.

OLTP

OLTP is the act of automatically storing and processing data from online transactions. To be clear, that means that every time you purchase something online, the record is automatically collected, stored, and processed through OLTP. More often than not, this process uses a snowflake schema, as discussed in Chapter 2, Data Structures, Types, and Formats. There are specific safeguards involved with OLTP, such as making sure that if some part of the transaction fails, the entire process stops. Another safeguard is that a specific piece of data cannot be changed by two users at the same time, which means everything happens one after the other. Both features are there to ensure data integrity.

OLAP

OLAP is the next step. After the data has been collected and stored by OLTP, OLAP takes the information, often through a pipeline such as ETL or ELT, and moves it to a new database, or data warehouse. The new database is often in a star schema, as discussed in Chapter 2, Data Structures, Types, and Formats. This process includes simple analyses and aggregations so that the data is ready to help people make informed business decisions with minimal work.

OLTP and OLAP sound very similar. To keep them separate in your mind, just remember that A is for analytical, so OLAP is the analysis step. By process of elimination, OLTP must be the collection step.

Optimizing query structure

A query is simply a request for information, so in the field of data analytics, a query is a request for data. It is how we call data from a database to our local environment. Several different programs allow you to query a database, and some are more popular than others, but often the biggest difference in the performance of queries is decided by how optimized they are. If you are making a simple query to a small database, efficiency and performance may not mean very much, but as the amount of data you are pulling gets bigger and your queries get more and more complicated, performance becomes a bigger issue. When you get to the point where a query takes hours or even days to run, you might want to consider optimizing it.

Filtering and subsets

First and foremost, it should be obvious that the less information you pull, the faster the process will be. Filtering is the process of being selective about which data you are querying. There are many ways to filter data, and it depends on what program you are using. However, at its core, a filter uses conditional logic. You can say that you want data from a particular table if it meets certain conditions. Often, you are filtering for specific values or a range of values. Again, the more specific a query is, the less unnecessary data you are pulling and the less processing power you are wasting.

Subsets are exactly what they sound like: a smaller subsection of your dataset. Filtering effectively creates a subset. The important thing to remember is when you use a filter to create a subset. It is vastly more efficient to create a subset before you join tables. Squishing tables together takes a fair amount of processing power. If you add a filter after you create a join, then you have to join a much larger table and it takes more time. Filtering data before joining tables means that you are only joining a subset of the whole table, which is much more efficient.

Have a look at the following screenshot:

Figure 3.9 – Filtering

In Figure 3.9, we see a simple dataset that has had a filter applied. In this case, the filter is simply selecting all employees with the Sales values in the Department column, but filters can be applied to any of the columns and can include ranges as well as specific values. As you can see, the table including only the required data, WHERE Department = Sales, is much smaller than the original table. Using filters and subsets to only use the data you need is a great way to optimize a query.

Indexing and sorting

Indexing is assigning every entry in a table a unique ascending number, so you can count down the rows: 1, 2, 3, and so on. Why is this useful? Because the index acts as a placeholder. It means that you can temporarily sort the table and then put things back where they were afterward. You can sort without indexing first, but it is much less efficient.

Sorting is simply arranging the rows in a different order according to some logic. The idea is that, when working with a filter, sorting can save time. For example, if you only want to look at rows where the value of column X is greater than 5 and less than 10, you sort by column X, then filter it. This means that all the rows you want are right next to each other in the table. The values are easy to find and grab.

That said, there are debates about how useful sorting is. The problem is that when you are working with a large dataset, the act of sorting itself uses a lot of processing power, so you have the potential to make things less efficient instead of more. However, for the exam, you simply need to understand what sorting and indexing are and that they can play a role in query optimization.

Have a look at the following screenshot:

Figure 3.10 – Indexing and sorting

In Figure 3.10, we see that an index has been added to the dataset, then the dataset was sorted by YearsWithCompany descending. Now, the names are in order of who has been with the company the longest, but it would be easy to return them to the original order using the index. If you were looking for people who have more than 10 years of experience, then ordering the list using ORDER BY YearsWithCompany DESC will put those with the most experience on top together. This makes finding and filtering them much faster.

Parameterization

Parameterization is a prewritten query that allows the user to enter specific parameters to tell the query which data to target. This means that you don’t need to create a query from scratch every time you want to pull data from the database. It also allows you to use a pre-optimized query without having to think about it.

The biggest reason for parameterization is not query optimization, but cyber security. By forcing people to use prewritten queries and limiting their options for what they can write, you are protecting the data from injection attacks. Malicious code can come in many forms and do many things. Limiting what people can enter is one way to protect against it.

Temporary tables and subqueries

Temporary tables are an incredibly useful tool. Effectively, you are taking the results of a query and saving them as their own table. That means that a big, complicated, time-intensive query is something you only need to run once. After that, you can run small, simple queries off this new table. If every time you ran a query you had to connect multiple tables, sort, filter, add aliases, aggregate, or any number of other things, all of that processing power would be saved if you just stored the results in a temporary table. That said, the tables are temporary, and most programs will automatically delete them after a set time. It should also be noted that they do not update when the source data updates, so they are often refreshed or recreated regularly.

An example of a temporary table might look something like this:

#Temporary Table
CREATE TEMPORARY TABLE EmployeeJones
SELECT * from Employees
WHERE LastName = 'Jones';

Again, you do not need to know a specific programming language for the exam; this is just to give you an idea. This particular example is for MySQL, and the syntax is different for different versions of SQL. Effectively, this code creates a new table called EmployeeJones and fills it with every employee with Jones as a last name from the Employee database.

Subqueries are simply a query embedded inside of another query. This means that you can pull a smaller set of data from a larger set as you are pulling it. Some professionals swear by subqueries, but, in general, they are considered less efficient than other options. For example, creating a temporary table and querying it is often considered a better approach than using a subquery for the same task. Let’s look at an example:

#Subquery
SELECT * from Employees
WHERE DeptID in (SELECT DISTINCT DeptID from Departments WHERE City = 'Atlanta');

Here is a query with conditional logic. The conditional logic references another table, but instead of trying to join the tables, it uses a subquery to stuff a whole other query into the SELECT statement. These are often used as a shortcut to avoid joining entire tables. They seem complicated at first, but become easier to understand with practice.

Once more, you do not need to understand how to perform a subquery for the exam; simply have a rough idea of what they are and why they are used.

Execution plan

Many programs that are used for querying will also have the option to show the execution plan. The execution plan is the program telling you how it will execute the query, sometimes accompanied by a graphical representation. An estimated execution plan will give you a rough idea of what the process should look like, while an actual execution plan is generated after a query is run and gives specifics about how long each step took. This breaks down the actual metrics for efficiency. By checking your execution plans, you will get tangible results that you can use to adjust and refine your queries. You can get a rough idea of what one looks like in Figure 3.11:

Figure 3.11 – Execution plan

The details of an execution plan will look different based on the language you are using, the tables you are using, and how you are querying them. They can be even simpler than this, or incredibly intricate and complicated. The idea is to use them to make your query as efficient as possible, while still getting the information you need.

Summary

This chapter has covered a lot of information from some pretty diverse topics. First, we talked about public sources of data including public databases, open sources, APIs, and web services, as well as the pros and cons of using each. Then, we talked about the different ways to collect your own data, including web scraping, surveying—especially the different types of survey questions and survey bias—and observations. Then, we covered the difference between ETL and ELT, as well as a full load and a delta load, and why it is important. Next, we briefly covered OLTP and OLAP and how they are used to collect and process transactional data. Finally, we wrapped up the chapter by covering ways to optimize query structures, such as filtering, subsets, indexing, sorting, parameterization, temporary tables, subqueries, and execution plans. Whew! There sure are a lot of ways to collect data. In the next chapter, we will go over what to do with it once you have it!

Practice questions and their answers

Let’s try to practice the material in this chapter with a few example questions.

Questions

If a web service is synchronous, it means…
1. Your system will wait for a response before continuing
2. Your system does not have to wait for a response
3. You are syncing a web service to your system
4. Web services cannot be synchronous
When conducting a survey, it is best practice to ask about specific events that happened in the past and request a text-based answer. True or false?
1. True
2. False
ETL stands for…
1. Extract, Transmit, Load
2. Estimate, Time, Load
3. Extract, Transform, Load
4. Exact, Time, Load
The process of taking transactional data that has already been stored, aggregating it, and moving it to a data warehouse is called what?
1. OLTP
2. OLAP
3. API
4. Web service
You are provided with a long and complicated query that takes a long time to run, but you need several pieces of information from it. What is the most efficient approach?
1. Run the query as it was given to you
2. Add subqueries to extract specific information from the query
3. Make sure no filters are slowing the process down
4. Save the results of the query to a temporary table

Answers

Now, we will briefly go over the answers to the questions. If you got one wrong, make sure to review the topic in this chapter before continuing:

The answer is A. Your system will wait for a response before continuing

Web services are a type of API. Synchronous API calls mean that your code will make a request and then wait until it gets a response before continuing to the next step.

The answer is B. False

Requesting specific information from the past introduces recall bias and is especially bad when combined with a text-based answer. It is best practice to avoid asking about specific incidents in the past, if possible, or at least make it clear that they can give a non-answer such as “I don’t know” instead of making something up.

The answer is C. Extract, Transform, Load

ETL stands for Extract, Transform, Load because it is a data pipeline that performs these three steps in this specific order. It extracts the data, pulls it from its source, transforms it, or manipulates it into something that can be used, and finally loads it to its new location.

The answer is B. OLAP

OLAP is the process of taking information stored by OLTP, aggregating and analyzing it, and moving it to a new database or data warehouse.

The answer is D. Save the results of the query to a temporary table

By creating a temporary table, you can quickly and easily access anything from the results of the original query that you require without having to rerun it.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3: Collecting Data

Create new playlist

Sign In

Sign Up

3

Collecting Data

Utilizing public sources of data

Public databases

Open sources

Application programming interfaces and web services

Collecting your own data

Web scraping

Surveying

Types of survey answers

Survey bias

Observing

Differentiating ETL and ELT

ETL

ELT

Delta load

Understanding OLTP and OLAP

OLTP

OLAP

Optimizing query structure

Filtering and subsets

Indexing and sorting

Parameterization

Temporary tables and subqueries

Execution plan

Summary

Practice questions and their answers

Questions

Answers

Table of Contents for
Chapter 3: Collecting Data