Bigger Data

It's not easy to say what big data is. We will adopt an operational definition: when data is so large that it becomes cumbersome to work with, we refer to it as big data. In some cases, this might mean petabytes of data or trillions of transactions: data that will not fit into a single hard drive. In other cases, it may be one hundred times smaller, but still difficult to work with.

Why has data itself become an issue? While computers keep getting faster and gaining more memory, the size of the data has grown as well. In fact, data has grown faster than computational speed and few algorithms scale linearly with the size of the input data taken together; this means that data has grown faster than our ability to process it.

We will first build on some of the experience of the previous chapters and work with what we can call medium data setting (not quite big data, but not small either). For this, we will use a package called jug, which allows us to perform the following tasks:

  • Break up your pipeline into tasks
  • Cache (memorize) intermediate results
  • Make use of multiple cores, including multiple computers on a grid

The next step is to move on to true big data, and we will see how to use the cloud for computation purpose. In particular, you will learn about the Amazon Web Services infrastructure. In this section, we introduce another Python package called cfncluster to manage clusters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset