Data sources and pandas methods

The data sources for a data science project can be clubbed into the following categories:

  • Databases: Most of the CRM, ERP, and other business operations tools store data in a database. Depending on the volume, velocity, and variety, it can be a traditional or NoSQL database. To connect to most of the popular databases, we need JDBC/ODBC drivers from Python. Fortunately, there are such drivers available for all the popular databases. Working with data in such databases involves making a connection through Python to these databases, querying the data through Python, and then manipulating it using pandas. We will look at an example of how to do this later in this chapter.
  • Web services: Many of the business operations tools, especially Software as a Services (SaaS) tools, make their data accessible through Application Programming Interfaces (APIsinstead of a database. This reduces the infrastructure cost of hosting a database permanently. Instead, data is made available as a service, as and when required. An API call can be made through Python, which returns packets of data in formats such as JSON or XML. This data is parsed and then manipulated using pandas for further usage.
  • Data files: A lot of data for prototyping data science models comes as data files. One example of data being stored as a physical file is the data from IoT sensors – more often than not, the data from these sensors is stored in a flat file, a .txt file, or a .csv file. Another source for a data file is the sample data that's been extracted from a database and stored in such files. The output of many data science and machine learning algorithms are also stored in such files, such as CSV, Excel, and .txt files. Another example is that the trained weight matrices of a deep learning neural network model can be stored as an HDF file.
  • Web and document scraping: Two other sources of data are the tables and text present on web pages. This data is gleaned from these pages using Python packages such as BeautifulSoup and Scrapy and are put into a data file or database to be used further. The tables and data that are present in another non-data format file, such as PDF or Docs, are also a major source of data. This is then extracted using Python packages such as Tesseract and Tabula-py.

In this chapter, we will look at how to read and write data to and from these formats/sources using pandas and ancillary libraries. We will also discuss a little bit about these formats, their utilities, and various operations that can be performed on them.

The following is a summary of the read and write methods in Python for some of the data formats we are going to discuss in this chapter: 

 Reader and writer methods in pandas for different types of data file formats and their sources

The section headers mean that we are dealing with I/O operations of that file type in that section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset