Profiling using the Data model viewer

Another way to profile data is by using the Data model viewer. Follow these steps to do so:

  1. The Data model viewer can be accessed using the top-left menu, as follows:

This option will display all tables from your data model as rectangles that show every field that was loaded.

  1. At the bottom-right of the screen, you can see a Preview button that, once clicked, shows the details for each selected table of the field. Let's do this by discussing an example of when you select the Orders table: 

The Data model viewer will highlight the selected table in orange.

As we can see from the preceding screenshot, our table has been loaded with 830 rows, 18 fields and 1 keys (indicating that this table is associated with another).

  1. A preview with the first few rows of your table is also displayed. When you click on any field, you can retrieve more information about it:

In the preceding example, we can see that the EmployeeID field has the following:

  • 100% Density, meaning that for every row, there is a value stored in the table. In other words, there are no null values in that field.
  • A 100% Subset ratio. The subset ratio is the percentage of distinct values in the table for this field versus all values from this field in all tables (there will only be a subset ratio of 100% when you are profiling a key field, because if the field is referenced only in one table, all the values will be stored).
  • Has duplicates as true, meaning that some values are duplicated across the table.
  • Total distinct values, which shows how many distinct values were found in this table.
  • Present distinct values, which shows how many distinct values were found in all tables (this value will always be the same as Total distinct values from non-key fields).
  • Non-null values, which shows how many non-null values were found for this field.
  • Tags, which shows metadata associated with this field.

Now, let's take a closer look at a key field (CustomerID) by selecting the Orders table and then the CustomerID field. As you can see in the following screenshot, both tables were selected and the keys fields are highlighted:

As we can see, the description for this field is as follows:

  • 100% Density, meaning that for every row, there is a value stored in the table. In other words, there are no null values in that field.
  • A 97.8% Subset ratio, which means that for the Orders table, we have 97.8% of all possible (all tables included) values from that table. In other words, there are CustomerID values that are not reflected.
  • Has duplicates is true, meaning that some values are duplicated across the table.
  • Total distinct values, which shows how many distinct values were found in this table.
  • Present distinct values, which shows how many distinct values were found in all tables. As we can see, this table has 89 distinct values from a total of 91 distinct values.
  • Non-null values, which shows how many non-null values were found for this field.
  • Tags, which shows the metadata associated with this field.

Now let's take a closer look into a key field (CustomerID) by selecting the Customers table and then the CustomerID field:

Now, as we can see in this table, Customers have a subset ratio of 100%, which means that all values from CustomerID are stored in our Customers table. Comparing the subset ratio from the Customers (100%) table against Orders (97.8%), we can conclude that all customers are represented in the Customers table, but there are customers that don't have orders (2.2% of them) since they are not represented here.

If for some reason you find a situation in which there are two or more tables that share a key field, and none of them have a 100% Subset ratio, this may conclude that we have a data quality problem. For example, we may have orders that have no customer associated with them.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset