Introduction to pandas

Almost all of the features we've discussed so far are features of base Python; that is, no external packages or libraries were required. The truth of the matter is that the majority of the code we write in this book will pertain to one of several external Python packages commonly used for analytics. The pandas library (http://pandas.pydata.org) is an integral part of the later programming chapters. The functions of pandas for machine learning are threefold:

  • Import data from flat files into your Python session
  • Wrangle, manipulate, format, and cleanse data using the pandas DataFrame and its library of functions
  • Export data from your Python session to flat files

Let's review each of these functions in turn.

Flat files are popular methods of storing healthcare-related data (along with HL7 formats, which are not covered in this book). A flat file is a text file representation of data. Using flat files, data can be represented as rows and columns, similar to databases, except that punctuation or whitespace are used as column delimiters, while carriage returns are used as row delimiters. We will see an example flat file in Chapter 7, Making Predictive Models in Healthcare.

pandas allows us to import data into a tabular Python data structure, called a DataFrame, from a variety of other Python structures and flat files, including Python dictionaries, pickle objects, comma-separated values (csv) files, fixed-width format (fwf) files, Microsoft Excel files, JSON files, HTML files, and even SQL database tables.

Once the data is in Python, there are additional functions that you can use to explore and transform the data. Need to perform a mathematical function on a column, such as finding its sum? Need to perform SQL-like operations, such as JOINs or adding columns (see Chapter 3, Machine Learning Foundations)? Need to filter rows by a condition? All of the functionality is there in pandas' API. We will make good use of some of pandas' functionality in Chapter 6, Measuring Healthcare Quality and in Chapter 7, Making Predictive Models in Healthcare.

Finally, when we are done exploring, cleansing, and wrangling our data, if we can choose to export it out as most of the formats listed. Or we can convert it to a NumPy array and train machine learning models, as we will do later in this book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset