Summary

We have now examined many of the tasks needed to start building analytical applications. Using the IPython notebook, we have covered how to load data in a file into a DataFrame in Pandas, rename columns in the dataset, filter unwanted rows, convert column data types, and create new columns. In addition, we have joined data from different sources and performed some basic statistical analyses using aggregations and pivots. We have visualized the data using histograms, scatter plots, and density plots as well as autocorrelation and log plots for time series. We also visualized geospatial data, using coordinate files to overlay data on maps. In addition, we processed the movies dataset using PySpark, creating both an RDD and a PySpark DataFrame, and performed some basic operations on these datatypes.

We will build on these tools in future sections, manipulating the raw input to develop features for building predictive analytics pipelines. We will later utilize similar tools to visualize and understand the features and performance of the predictive models we develop, as well as reporting the insights that they may deliver.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset