Chapter 46. Using Identifiers and Keys in Data

As mentioned in multiple previous chapters, databases are designed as powerful, performant, and secure locations to store and work with data. Two concepts at the heart of most databases you will use in your organization are keys and identifiers (IDs). Databases that utilize these concepts are known as relational databases, because keys and ID fields create relationships between the various data tables. In this chapter, we will look at ID fields, how they can be used as keys between tables, and how to create them in Prep Builder.

What Is an Identifier?

As in most computing software, numbers in databases are processed more efficiently than other characters. When working with data, you’ll often need to process large volumes of data, so any method you can use to make processing more efficient can make a significant difference in the time it takes to prepare your data or conduct your analysis. This is where identifier fields come into play for databases. Rather than repeatedly storing names, addresses, or other long string values, a relational database architect will use separate look-up tables to store those values just once and associate an ID to the value. This purchases table shown in Figure 46-1 has a number of ID fields that we can use to link together other reference tables with the names of customers or products. Tables containing lists of transactions or similar metrics are often referred to as fact tables.

The Purchases data set, an example of a fact table
Figure 46-1. The Purchases data set, an example of a fact table

Identifiers can also be used in a Boolean way, indicating whether something is true or not. For example, “yes” (or “true”) is often encoded as 1 and “no” or (“false”) as 0. This type of logic often does not have an associated reference table, but some data architects will insist on one for clarity’s sake.

What Is a Key in a Database?

Keys help us identify how to join tables together in databases. Traditionally, there are two types of keys in database terminology:

Primary key
A unique identifier that cannot be replicated in a table
Foreign key
An identifier for a unique row of data stored in a different table

Figure 46-2 shows a reference table that has a unique key for each customer.

The Customer data set, an example of a reference table
Figure 46-2. The Customer data set, an example of a reference table

We could use this table to add the customer’s name to the Purchases table shown in Figure 46-1. The Customer ID in this table must be unique, as we wouldn’t want to attribute orders to the wrong customer. In the Purchases table, the Customer ID could be classified as a foreign key because the values in the column are duplicated but are used to join the Customer table.

Using Keys and Identifiers in Prep

Keys in the context of Prep Builder are very similar to those in a database; they guide users on how to join tables together. Information about which data fields in a database table are keys is likely stored in separate database documentation. Prep Builder does not display keys any differently than other data fields, and you can use them like any other data field in join conditions. This is where the Join configuration pane comes in particularly handy; you can see what values are matched as well as the resulting number of rows being created by the Join step (Figure 46-3).

Setting up a join in Prep
Figure 46-3. Setting up a join in Prep

When joining fact tables like the Purchases table, you would expect to retain the original number of rows. (Remember, fact tables are database tables containing measurements or metrics.) If the resulting number of rows doesn’t match the original fact table, the possibilities are:

  • Increased number of rows. You have duplicated the purchases, and simple counts or aggregations of values may overrepresent reality.

  • Decreased number of rows. You have lost rows of data, meaning records of sales or the number of customers will be underrepresented.

The reason why this view in Prep Builder is so useful is because it clearly shows the number of rows both entering and resulting from the Join step. In this example, there are seven rows of data from the Purchases table and seven rows of data are returned. This is a likely sign that the correct key has been joined up, but verify the results for all joins even if the number of rows matches what you expect. The Join Clauses pane demonstrates that all Customer IDs have been matched, as they remain black text. Any unmatched values will turn red and show which source table they came from.

As mentioned earlier, identifiers don’t have to be used as keys in join conditions. In the Purchases table, the Order Dispatched column is very likely an identifier field, as the only values it contains are 1s and 0s. You can easily clarify these identifiers if you are preparing the data for end users, by aliasing the values in a data field (Figure 46-4).

Setting up an alias in the Profile pane
Figure 46-4. Setting up an alias in the Profile pane

As the data field is numeric, you’ll need to update the data type to string before changing the value. The resulting data field is shown in Figure 46-5.

Results of changing the data type and aliasing
Figure 46-5. Results of changing the data type and aliasing

Creating Identifier Data Fields in Prep Builder

All of the previous points rely on an identifier or key being available in your data source. However, if you are preparing your data source for storage in a database, you may need to create a key from the data you have. Let’s use Superstore, the sample data set that is downloaded as part of the Prep Builder or Desktop package, and create a category table that could act as a reference table.

As of Prep Builder version 2020.1, the tool includes the functionality to create a rank field automatically. In the example shown in Figure 46-6, you can remove some of the complexity by creating ranks based on the Category data field.

Creating a rank through the Profile pane data field menu
Figure 46-6. Creating a rank through the Profile pane data field menu

Selecting this option will launch the rank Visual Editor, which makes the process coding-free (Figure 46-7).

Setting up a rank in the Visual Editor
Figure 46-7. Setting up a rank in the Visual Editor

In the default rank setup, the value of the rank is ordered by the number of rows attributed to each category. To get past this and create a simple ID for category, change the rank option in “Order by” to Dense Rank (Figure 46-8).

Setting up Dense Rank in the Visual Editor
Figure 46-8. Setting up Dense Rank in the Visual Editor

To create the reference table to be able to look up these values, you can branch off an Aggregate step. Aggregate steps remove all data fields from your data set that are not used in the Aggregate step. In this step, group by category and average the rank to return a single ID per category. You can rename the Rank field to Category ID (Figure 46-9).

Changing the Rank field name to Category ID in an Aggregate step
Figure 46-9. Changing the Rank field name to Category ID in an Aggregate step

The resulting table from the Aggregate step is very simple and can be output separately (Figure 46-10).

Result of the Aggregate step creating the Category ID field
Figure 46-10. Result of the Aggregate step creating the Category ID field

The 1, 2, 3 result is perfect for creating the IDs we want in the data set. You can remove the Category field from the original data set now (Figure 46-11).

Changes pane showing the steps to remove the Category field
Figure 46-11. Changes pane showing the steps to remove the Category field

The resulting flow looks like Figure 46-12, although you may wish to add outputs to use the data or write it to where you need to use it. The full data set now has only the category ID and not the category name. This saves on space and can be processed faster. The reference table can be used when the names need to be added back into the data set if required.

Using the Aggregate step on a separate branch of the flow to create the reference table
Figure 46-12. Using the Aggregate step on a separate branch of the flow to create the reference table

If you need to update the data set and increase the number of potential values in the Category field, be sure to keep the original ID set and assign the next rank value to the new values.

Summary

Keys and identifiers might seem to present unnecessary challenges when you are new to data preparation. However, by computing faster and requiring less storage space, you can reduce the time it takes to complete your analysis and the cost of storing the data, especially when data sets get larger. Prep Builder can help you see the effects of joining your reference tables back into the data, which is a good way to verify that you are using the correct keys. You can also use Prep Builder to create your own reference tables and write them to databases.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset