0%

Book Description

Analyze, manipulate, and process datasets of varying sizes efficiently using Haskell

In Detail

Haskell is trending in the field of data science by providing a powerful platform for robust data science practices. This book provides you with the skills to handle large amounts of data, even if that data is in a less than perfect state. Each chapter in the book helps to build a small library of code that will be used to solve a problem for that chapter. The book starts with creating databases out of existing datasets, cleaning that data, and interacting with databases within Haskell in order to produce charts for publications. It then moves towards more theoretical concepts that are fundamental to introductory data analysis, but in a context of a real-world problem with real-world data. As you progress in the book, you will be relying on code from previous chapters in order to help create new solutions quickly. By the end of the book, you will be able to manipulate, find, and analyze large and small sets of data using your own Haskell libraries.

What You Will Learn

  • Learn the essential tools of Haskell needed to handle large data
  • Migrate your data to a database and learn to interact with your data quickly
  • Clean data with the power of Regular Expressions
  • Plot data with the Gnuplot tool and the EasyPlot library
  • Formulate a hypothesis test to evaluate the significance of your data
  • Evaluate the variance between columns of data using a correlation statistic and perform regression analysis

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

Table of Contents

  1. Learning Haskell Data Analysis
    1. Table of Contents
    2. Learning Haskell Data Analysis
    3. Credits
    4. About the Author
    5. About the Reviewers
    6. www.PacktPub.com
      1. Support files, eBooks, discount offers, and more
        1. Why subscribe?
        2. Free access for Packt account holders
    7. Preface
      1. What this book covers
      2. What you need for this book
      3. Who this book is for
      4. Conventions
      5. Reader feedback
      6. Customer support
        1. Downloading the example code
        2. Downloading the color images of this book
        3. Errata
        4. Piracy
        5. Questions
    8. 1. Tools of the Trade
      1. Welcome to Haskell and data analysis!
      2. Why Haskell?
      3. Getting ready
        1. Installing the Haskell platform on Linux
        2. The software used in addition to Haskell
          1. SQLite3
          2. Gnuplot
          3. LAPACK
      4. Nearly essential tools of the trade
        1. Version control software – Git
        2. Tmux
      5. Our first Haskell program
      6. Interactive Haskell
        1. An introductory problem
      7. Summary
    9. 2. Getting Our Feet Wet
      1. Type is king – the implications of strict types in Haskell
        1. Computing the mean of a list
        2. Computing the sum of a list
        3. Computing the length of a list
        4. Attempting to compute the mean results in an error
        5. Introducing the Fractional class
        6. The fromIntegral and realToFrac functions
        7. Creating our average function
        8. The genericLength function
        9. Metadata is just as important as data
      2. Working with csv files
        1. Preparing our environment
        2. Describing our needs
        3. Crafting our solution
          1. Finding the column index of the specified column
          2. The Maybe and Either monads
        4. Applying a function to a specified column
      3. Converting csv files to the SQLite3 format
        1. Preparing our environment
        2. Describing our needs
        3. Inspecting column information
        4. Crafting our functions
      4. Summary
    10. 3. Cleaning Our Datasets
      1. Structured versus unstructured datasets
        1. How data analysis differs from pattern recognition
      2. Creating your own structured data
      3. Counting the number of fields in each record
      4. Filtering data using regular expressions
        1. Creating a simplified version of grep in Haskell
        2. Exhibit A – a horrible customer database
      5. Searching fields based on a regular expression
        1. Locating empty fields in a csv file based on a regular expression
        2. Crafting a regular expression to match dates
      6. Summary
    11. 4. Plotting
      1. Plotting data with EasyPlot
      2. Simplifying access to data in SQLite3
      3. Plotting data from a SQLite3 database
        1. Exploring the EasyPlot library
        2. Plotting a subset of a dataset
        3. Plotting data passed through a function
      4. Plotting multiple datasets
      5. Plotting a moving average
        1. Plotting a scatterplot
      6. Summary
    12. 5. Hypothesis Testing
      1. Data in a coin
        1. Hypothesis test
        2. Establishing the magic coin test
        3. Understanding data variance
        4. Probability mass function
        5. Determining our test interval
        6. Establishing the parameters of the experiment
        7. Introducing System.Random
        8. Performing the experiment
      2. Does a home-field advantage really exist?
        1. Converting the data to SQLite3
        2. Exploring the data
        3. Plotting what looks interesting
        4. Returning to our test
        5. The standard deviation
        6. The standard error
        7. The confidence interval
        8. An introduction to the Erf module
        9. Using Erf to test the claim
        10. A discussion of the test
      3. Summary
    13. 6. Correlation and Regression Analysis
      1. The terminology of correlation and regression
        1. The expectation of a variable
        2. The variance of a variable
        3. Normalizing a variable
        4. The covariance of two variables
        5. Finding the Pearson r correlation coefficient
        6. Finding the Pearson r2 correlation coefficient
        7. Translating what we've learned to Haskell
      2. Study – is there a connection between scoring and winning?
        1. A consideration before we dive in – do any games end in a tie?
        2. Compiling the essential data
        3. Searching for outliers
        4. Plot – runs per game versus the win percentage of each team
        5. Performing correlation analysis
      3. Regression analysis
        1. The regression equation line
        2. Estimating the regression equation
        3. Translate the formulas to Haskell
        4. Returning to the baseball analysis
        5. Plotting the baseball analysis with the regression line
      4. The pitfalls of regression analysis
      5. Summary
    14. 7. Naive Bayes Classification of Twitter Data
      1. An introduction to Naive Bayes classification
        1. Prior knowledge
        2. Likelihood
        3. Evidence
        4. Putting the parts of the Bayes theorem together
      2. Creating a Twitter application
        1. Communicating with Twitter
        2. Creating a database to collect tweets
        3. A frequency study of tweets
        4. Cleaning our tweets
        5. Creating our feature vectors
        6. Writing the code for the Bayes theorem
        7. Creating a Naive Bayes classifier with multiple features
        8. Testing our classifier
      3. Summary
    15. 8. Building a Recommendation Engine
      1. Analyzing the frequency of words in tweets
        1. A note on the importance of removing stop words
      2. Working with multivariate data
        1. Describing bivariate and multivariate data
        2. Eigenvalues and eigenvectors
          1. The airplane analogy
      3. Preparing our environment
      4. Performing linear algebra in Haskell
        1. Computing the covariance matrix of a dataset
        2. Discovering eigenvalues and eigenvectors in Haskell
      5. Principal Component Analysis in Haskell
      6. Building a recommendation engine
        1. Finding the nearest neighbors
        2. Testing our recommendation engine
      7. Summary
    16. A. Regular Expressions in Haskell
      1. A crash course in regular expressions
        1. The three repetition modifiers
        2. Anchors
        3. The dot
        4. Character classes
        5. Groups
        6. Alternations
        7. A note on regular expressions
    17. Index