Summary

In this Chapter, we applied the basic data manipulations with RDDs, Dataset and DataFrame APIs. We also learn how to do some complex data manipulation through these APIs. We tried to focus on data manipulations, to understand a practical machine learning problem Spam-filtering. In addition to these, we showed how to read the data from different sources. Analyzing and preparing your data to understand the spam filtering as an example.

However, we did not develop any complete machine learning application, since our target was just to show you the basic data manipulation on the experimental Datasets. We intended to develop complete ML application in Chapter 6, Building Scalable Machine Learning Pipelines.

Which features should be used to create a predictive model is not only a vital question but also a difficult question that may require deep knowledge of the problem domain to be answered. It is possible to automatically select those features in data that are most useful or most relevant for the problem someone is working on. Considering these questions, the next chapter covers the feature engineering in detail, explaining the reasons why to apply it along with some best practices in feature engineering. Some topics which are still unclear will be clearer in the next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset