Chapter 5. Web Mining, Databases, and Big Data

On the menu for this chapter are the following recipes:

  • Simulating web browsing
  • Scraping the Web
  • Dealing with non-ASCII text and HTML entities
  • Implementing association tables
  • Setting up database migration scripts
  • Adding a table column to an existing table
  • Adding indices after table creation
  • Setting up a test web server
  • Implementing a star schema with fact and dimension tables
  • Using HDFS
  • Setting up Spark
  • Clustering data with Spark

Introduction

This chapter is light on math, but it is more focused on technical topics. Technology has a lot to offer for data analysts. Databases have been around for a while, but the relational databases that most people are familiar with can be traced back to the 1970s. Edgar Codd came up with a number of ideas that later led to the creation of the relational model and SQL. Relational databases have been a dominant technology since then. In the 1980s, object-oriented programming languages caused a paradigm shift and an unfortunate mismatch with relational databases.

Object-oriented programming languages support concepts such as inheritance, which relational databases and SQL do not support (of course with some exceptions). The Python ecosystem has several object-relational mapping (ORM) frameworks that try to solve this mismatch issue. It is not possible and is unnecessary to cover them all, so I chose SQLAlchemy for the recipes here. We will also have a look at database schema migration as a common hot topic, especially for production systems.

Big data is one of the buzzwords that you may have heard of. Hadoop and Spark may probably also sound familiar. We will look at these frameworks in this chapter. If you use my Docker image, you will unfortunately not find Selenium, Hadoop, and Spark in there because I decided not to include them to save space.

Another important technological development is the World Wide Web, also known as the Internet. The Internet is the ultimate data source; however, getting this data in an easy-to-analyze form is sometimes quite a challenge. As a last resource, we may have to crawl and scrape web pages. Success is not guaranteed because the website owner can change the content without warning us. It is up to you to keep the code of the web scraping recipes up to date.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset