Chapter 24. Removing Data During Input

You can simplify a lot of your data preparation work by making changes to the initial data connection, which is created and shown in the Input step in Prep Builder. However, you often know that certain elements in your data set need to be changed or removed even before you input it into Prep Builder. This chapter will cover some considerations for removing data at that early stage and how you might go about doing so.

Changing Your Data Set Before Loading It

Data sets are proliferating and growing rapidly, so you have to think carefully about what is actually being loaded into the tool. Any input data will be loaded into your computer’s memory, so any effort to reduce the amount of data that has to be processed will be useful. Prep Builder will sample the data set on the initial load, processing the full data set only when you run the output.

For Prep Builder, the initial connection is the Input step, but Prep Builder doesn’t load in all of the data instantly. Prep will load the metadata—the data about the data—in the Input step first. This helps end users in two ways:

  • Providing a quick overview of the data

  • Preventing slow load times, since Prep Builder isn’t having to process all of the data

Deselecting the fields that you and the end users do not require will save that data from being processed by Prep Builder (Figure 24-1).

Deselecting some fields in the metadata pane of the Input step
Figure 24-1. Deselecting some fields in the metadata pane of the Input step

Making changes to your data set here can also save you time and reduce complications later in the data preparation process.

Slow Performance, Slow Build, Slow Output

As already mentioned, performance is a big reason to remove unnecessary data whenever you get the opportunity. But it isn’t just the faster processing speed that helps, it’s that the software can keep up with how you want to work, show you what you might want to do next, and then enable you to make those changes. Nothing is more frustrating than the wasted time of watching a load screen.

Reducing the number of rows and columns can aid performance and decrease the amount of data that Prep Builder has to process. The cleaning, reshaping, and merging processes are more efficient with less data, as it’s easier to spot what is required once there are fewer rows to assess, not to mention it’s faster to process that data once those changes are made. You can reduce the number of rows by applying filters on the Input step and in many other steps throughout the flow. You can reduce the number of columns by deselecting them within the Input step or excluding them in any of the steps throughout the flow.

Not only are the input processing and build time affected by unnecessary data, but the output processing time is also affected. Flows may be run very frequently through either Tableau Prep Builder or the server-based Prep Conductor tool. Prep Conductor enables you to schedule your flows to run on a regular basis (Figure 24-2).

View of Prep Conductor on Tableau Server
Figure 24-2. View of Prep Conductor on Tableau Server

After publishing a flow to Tableau Server, you can set a schedule through Prep Conductor if that schedule is available (Figure 24-3).

Setting up a refresh schedule on Prep Conductor
Figure 24-3. Setting up a refresh schedule on Prep Conductor

The schedules available to you are determined by your server administrator (Figure 24-4).

Schedules available on my Tableau Server
Figure 24-4. Schedules available on my Tableau Server

If you have a very frequent schedule, like every 15 minutes or each hour, any unnecessary data can add up and place an unnecessary load on the server. This can reduce performance for others and rapidly accrues when hundreds of data sources are being processed each day on large data sets.

Removing Columns

The majority of changes made during the Input step are to the data fields, or columns, of your data set. Simply deselecting the column for the data field in the Input step will prevent it from being loaded by Prep Builder. Do this only if you’re sure the column contains nothing that you need. Having messy data in a column is not a sufficient reason to remove it. If any of that data could be useful to you or your end users, then you can clean that data in subsequent steps.

Completely null columns are a likely candidate for removal at this stage. These commonly occur in Excel files where a user has spaced out the data for formatting reasons. The data will likely appear with a column header like F4 for the fourth column in the data set. If all of your data field names appear this way, you may need to use the Data Interpreter to find the column headers in the Excel worksheet. You can manually rename the data field names in the Input tool, but this task can be laborious.

The Data Interpreter is an option within Prep Builder and Desktop to pick up the data tables that you cannot easily access by just deleting or renaming fields. Take the file in Figure 24-5 as an example.

A formatted Excel spreadsheet
Figure 24-5. A formatted Excel spreadsheet

Loading this data set in Prep Builder would show the Input step in Figure 24-6.

Input step showing issues with formatted spreadsheet
Figure 24-6. Input step showing issues with formatted spreadsheet

Here you can see the data hasn’t loaded as expected. The address, price paid, and date are the columns we would need for analysis, but they are unobtainable with the current setup. Using the Data Interpreter allows Prep Builder to find the data table in the spreadsheet. Clicking the Data Interpreter option in the Connections pane displays the data table as another option (Figure 24-7).

Using the Data Interpreter to access additional data inputs
Figure 24-7. Using the Data Interpreter to access additional data inputs

To use the new tables available for input, you can remove the existing Input step and replace it with the table(s) you need. In this example, the data set is much cleaner when using the data from cells A6 to C38 (Figure 24-8).

The Input step using the data resulting from the Data Interpreter
Figure 24-8. The Input step using the data resulting from the Data Interpreter

Using a combination of the Data Interpreter and selecting the data fields in the Input step will enable you to return the data you need.

Removing Records

By changing the data type of a column, you can change the values in the data set. If a column contains any non-numeric characters and you change the data type of that column to a number, Tableau Prep will replace the non-numeric values with nulls. Let’s look at an example.

To change the data type, click on the Data Type icon in the metadata grid within the Input step. In Figure 24-9, we’re changing from “String - default” to “Number (whole).”

Changing data types in Prep’s metadata grid
Figure 24-9. Changing data types in Prep’s metadata grid

However, the sample values shown in the metadata grid have changed to null from their previous string values (as seen in Figure 24-10) .

Result of the change in #changing_data_types_in_prepapostrophes
Figure 24-10. Result of the change in Figure 24-9

If you do not need those records with a null result within the flow, they can be filtered out too. Simply click the null value in the Profile pane, and you can choose to exclude it. This process can also happen in the Profile pane. Let’s look at another example, this time the Value column from Preppin’ Data’s 2019: Week 2 challenge, where the data set has been split into two tables, creating a second set of headers. Figure 24-11 shows the data field before the data type conversion.

Sample of the values in the Value data field
Figure 24-11. Sample of the values in the Value data field

Changing the data type of the Value column from a string to a number changes the string value to a null (Figure 24-12).

Data field resulting from the data type conversion
Figure 24-12. Data field resulting from the data type conversion

Summary

With most data sets you work with, you may notice only a slight difference in performance or efficiency after removing data, either by excluding fields or filtering out rows. However, the cumulative effects of these data removal actions over time will speed up your analysis, allow faster access to shared data sources, make the remaining data easier to prepare, and free up vital server resources.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset