This chapter will cover in details of the DataFrame, Datasets, and Resilient Distributed Dataset (RDD) APIs for working with structured data targeting to provide a basic understanding of machine learning problems with the available data. At the end of the chapter you will be able to apply basic to complex data manipulation with ease. Some comparisons will be made available with basic abstractions in Spark using RDD, DataFrame, and Dataset based data manipulation to show both gains in terms of programming and performance. In addition, we will guide you on the right track so that you will be able to use Spark to persist an RDD or data objects in memory, allowing it to be reused efficiently across the parallel operations in the later stage. In a nutshell, the following topics will be covered throughout this chapter:
In practice, several factors affect the success of machine learning (ML) applications on a given task. Therefore, the representation and quality of the experimental dataset is first and foremost considered as the first class entities. It is always advisable to have better data. For example, irrelevant and redundant data, data features with null values or noisy data result in unreliable source of information. The bad properties in datasets make the knowledge discovery process during the machine learning model training phase more tedious and time inefficient.
As a result, the data preprocessing will contribute a considerable amount of computational time across the total ML workflow steps. As we stated in the previous chapter, unless you know your available data, it would be difficult to understand the problem itself. Moreover, knowing the data will help you to formulate your problem. In parallel, and more importantly, before trying to apply an ML algorithm to a problem, first you have to identify if the problem is really a machine learning problem and whether an ML algorithm could directly be applied to solve the problem. The next step that you need to take is to know the machine learning classes. More technically, you need to know if an identified problem falls under classification, clustering, rule retraction, or regression classes.
For the sake of simplicity, we assume you have a machine learning problem. Now you need to do some data pre-processing that includes some steps like data cleaning, normalization, transformation, feature extraction, and selection. The product of a data pre-processing workflow step is the final training set that is typically used to build/train the ML model.
In the previous chapter, we also argued that a machine learning algorithm learns from the data and activities during the model building and feed backing. It is critical that you feed your algorithm with the right data for the problem you want to solve. Even if you have good data (or well-structured data to be more precise), you need to make sure that the data is in an appropriate scale, with a well-known format to be parsed by the programming languages and, most importantly, if the most meaningful features are also included.
In this section, you will learn how to prepare your data so that your machine-learning algorithm becomes spontaneous towards best performance. The overall data processing is a huge topic; however, we will try to cover essential techniques to make some large scale machine learning applications in Chapter 6, Building Scalable Machine Learning Pipelines.
If you are more focused and disciplined during the data handling and preparation steps, you are likely to get more consistent and better results in the first place. However, the data preparation is a tedious process consisting of several steps. Nevertheless, the process for getting data ready for a machine learning algorithm can be summarized in three steps:
This step will focus on selecting the subset of all available datasets that you will be using and working with within your machine learning application development and deployment. There is always a strong urge to include all the available data in machine learning application development since more data will provide more features. In other words, by holding the well-known aphorism, more is better. However, essentially, this might not be true in all cases. You need to consider what data you need to have before you actually answer the question. The ultimate goal is to provide a solution of a particular hypothesis. You might be doing some assumptions about the data as well in the first place. Although it is difficult, if you are a domain expert of that problem, you can make some assumption to know at least some insights before applying your ML algorithms. However, be careful to record those assumptions so that you can test them at a later stage when required. We will present some common question to help you out in thinking through the data selection process:
Moreover, in practice in this case small problems or games, toy competition data will already have been selected for you; therefore, you don't need to be worried at all!
After you have selected the data you will be working with, you need to consider how you could use the data and the proper utilization required. This pre-processing step will address some steps or techniques for getting the selected data into a form that you can work and apply during your model building and validation steps. The three most common data pre-processing steps that are used are formatting, cleaning, and sampling the data:
After selecting appropriate data sources and pre-processing those data, the final step is to transform the processed data. Your specific ML algorithm and knowledge of the problem domain will be influenced in this step. Three common data transformations techniques are scaling attributes, decompositions and attribute aggregations. This step is also commonly referred to as feature engineering that will be discussed in more details in the next chapter:
Apache Spark has its distributed data structures includes RDD, DataFrame, and Datasets by which you can perform the data pre-processing efficiently. These data structures have different advantages and performance for processing the data. In the next sections, we will describe those data structures individually and also show examples of how to process the large Dataset using them.