Summary

Cleaning is not only the most important but also the least glamorous phase of data analysis. With Haskell and the power of regular expressions, we can quickly identify areas with large quantities of data that need our attention. We left our cleaning problem incomplete in this chapter. There is still plenty of data left to clean. The Gender and State columns need some serious work. They are left as an exercise for you to learn how to craft regular expressions to quickly identify the fields that require your attention.

We also discussed the unclear border between what is meant by the terms, structured data and unstructured data. I applied two pieces of criteria for structured data—the data is in a machine-readable format and the data adheres to a metadata document standard. Our example dataset is still a long way from being structured. We assume that the person who aggregated this data had a metadata document in mind, but that didn't stop us from performing a lot of cleaning.

Our next chapter is going to put cleaning aside. We will explore data visually, allowing the data to speak for itself. It is also the key technique that is used to develop some assumptions about our data. In data analysis, the plotting of data is a method of speculation, and you will see that through this speculation, you can allow ideas to flourish. However, it will be through the subsequent chapters that we will learn how to check whether our speculations are correct.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset