Chapter 1. Tools of the Trade

Data analysis is the craft of sifting through data for the purpose of learning or decision making. To ease the difficulties of sifting through data, we rely on databases and our knowledge of programming. For nut-and-bolts coding, this text uses Haskell. For storage, plotting, and computations on large datasets, we will use SQLite3, gnuplot, and LAPACK respectively. These four pieces of software are a powerful combination that allow us to solve some difficult problems. In this chapter, we will discuss these tools of the trade and recommend a few more.

In this chapter, we will cover the following:

  • Why we should consider Haskell for our next data analysis project
  • Installing and configuring Haskell, the GHCi (short for Glasgow Haskell Compiler interactive) environment, and cabal
  • The software packages needed in addition to Haskell: SQLite3, gnuplot, and LAPACK
  • The nearly essential software packages that you should consider: Git and Tmux
  • Our first program: computing the median of a list of values
  • An introduction to the command-line environment

Welcome to Haskell and data analysis!

This book is about solving problems related to data. In each chapter, we will present a problem or a question that needs answering. The only way to get this answer is through an understanding of the data. Data analysis is not only a practice that helps us glean insight from information, but also an academic pursuit that combines knowledge of the basics of computer programming, statistics, machine learning, and linear algebra. The theory behind data analysis comes from statistics.

The concepts of summary statistics, sampling, and empirical testing are gifts from the statistical community. Computer science is a craft that helps us convert statistical procedures into formal algorithms, which are interpreted by a computer. Rarely will our questions about data be an end in themselves. Once the data has been analyzed, the analysis should serve as a plan to better decision-making powers. The field of machine learning is an attempt to create algorithms that are capable of making their own decisions based on the results of the analysis of a dataset. Finally, we will sometimes need to use linear algebra for complicated datasets. Linear algebra is the study of vector spaces and matrices, which can be understood by the data analyst as a multidimensional dataset with rows and columns. However, the most important skill of data analysts is their ability to communicate their findings with the help of a combination of written descriptions and graphs. Data science is a challenging field that requires a blend of computer science, mathematics, and statistics disciplines.

In the first chapter, the real-world problem is with regard to getting our environment ready. Many languages are suitable for data analysis, but this book tackles data problems using Haskell and assumes that you have a background in the Haskell language from Chapter 2, Getting Our Feet Wet onwards. If not, we encourage you to pick up a book on Haskell development. You can refer to Learn You a Haskell for Great Good: A Beginner's Guide, Miran Lipovaca, No Starch Press, and Real World Haskell, Bryan O'Sullivan, John Goerzen, Donald Bruce Stewart, O'Reilly Media, which are excellent texts if you want to learn programming in Haskell. Learn You a Haskell for Great Good: A Beginner's Guide can be read online at http://learnyouahaskell.com/. Real World Haskell can also be read online at http://book.realworldhaskell.org/. The former book is an introduction to the language, while the latter is a text on professional Haskell programming. Once you wade through these books (as well as Learning Haskell Data Analysis), we encourage you to read the book Haskell Data Analysis Cookbook, Nishant Shukla, Packt Publishing. This cookbook will provide snippets of code to work with a wide variety of data formats, databases, visualization tools, data structures, and clustering algorithms. We also recommend Notes on Functional Programming with Haskell by Dr. Conrad Cunningham.

Besides Haskell, we will discuss open source data formats, databases, and graphing software in the following manner:

  • We will limit ourselves to working with two data serialization file formats: JSON and CSV. CSV is perhaps the most common data serialization format for uncompressed data with the weakness of not being an explicit standard. In a later chapter, we will examine data from the Twitter web service, which exports data in the JSON format. By limiting ourselves to two data formats, we will focus our efforts on problem solving instead of prolonged discussions of data formats.
  • We will use SQLite3 for our database backend application. SQLite3 is a lightweight database software that can store large amounts of data. Using a wrapper module, we can pull data directly from a SQLite3 database into the Haskell command line for analysis.
  • We will use the EasyPlot Haskell wrapper module for gnuplot, which is a popular open source tool that is used to create publication-ready graphics. The EasyPlot wrapper provides access to a subset of features in gnuplot, but we shall see that this subset is more than sufficient for the creation of compelling graphs.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset