We've already seen how IBM Watson can be used to increase the value of or extend the results of your data analytics efforts through easy exploration and visualization. This is accomplished with Watson's cutting-edge ability to understand your typed questions and provide unique, interactive visualizations that are based on your words. In the preceding section, we looked at ways by which you can customize your Watson experience. Now let's consider the prospects for extending Watson.
With all that Watson can do, it is still dependent on data quality. In other words, the better the data, the better the results.
When you hear the term "data quality" in conjunction with Watson analytics, it refers to how apt certain data is for performing analyses. In other words, how reasonable might the results be if one performs an analysis with the data file?
In Chapter 2, Identifying Use Cases, we briefly mentioned that Watson assigns a data quality score to each dataset when you upload it. This score (from 0 to 100) is an assessment of the data's quality based on a computed average of the data quality of each field or column in the data file.
To compute the data quality score for a field in a data file, Watson considers the following factors:
In Chapter 2, Identifying Use Cases, we pointed out that Watson provides you with the ability to both explore and refine your data file. Recall that you can review and tune your data file to match the way you want to see or work with it, and any changes that you make are saved as a separate version of the original data file.
Let's look again at the Historic_Stadium_Sales
file that we previously loaded into Watson:
Click on the file (as shown in the preceding screenshot) and then click on Refine, as shown here:
With our file displayed on the refine page (shown next), we have to click on the Data Metrics icon:
Watson then displays the data quality for each column of data:
If you look closely at the Gameday Weather column, you'll see that the data quality score is zero:
This is because the value of the column of data is constant for all records (Sunny is the only value found). As we discussed earlier in this section, this affects the data quality of this column and ultimately the data quality of the entire file.
Here is another example of data quality. In the same file, Historic_Stadium_Sales
, the column named Payment Method has a data quality score of 97, as shown in this screenshot:
In a new version of this file that I uploaded, Historic_Stadium_Sales_MissingValues
, the same column has a much lower data quality score, which is 71. If you look at the following score, you will see that Watson has found that 3% of the records in this file have no value (they are missing) for the column: