So far, we have learned the fundamental workflow of how to build H2O models at scale, but that was done using H2O at its barest minimum. In this chapter, we will survey the extremely broad capability set of H2O model building at scale. We will then use our knowledge from this chapter and move on to part two, Building State-of-the-Art Models on Large Data Volumes Using H2O, where we will get down to business and use advanced techniques to build and explain highly predictive models at scale.
To conduct this survey, we will break down the chapter into the following main topics:
Recall that H2O model building at scale is performed by using H2O 3 or its extension, Sparkling Water, which wraps H2O 3 with Spark capabilities. The H2O 3 API has extensive data capabilities used in the model building process, and the Sparkling Water API inherits these and adds additional capabilities from Spark. These capabilities are broken down into the following three broad categories:
As emphasized in previous chapters, the H2O cluster architecture (H2O 3 or Sparkling Water) allows model building at an unlimited scale but is abstracted from the data scientist who builds models by coding H2O in the IDE.
H2O data capabilities are overviewed in the following diagram and elaborated subsequently:
Let's start with data ingestion.
The following data sources are supported:
The supported file formats of source data are as follows:
Some important characteristics of data ingestion to H2O are as follows:
Let's now see how we can manipulate data now that it is ingested into H2O and represented as an H2OFrame.
The H2O 3 API provides extensive data manipulation capabilities. As mentioned in the previous bullet list, datasets in memory are distributed on the H2O cluster and represented in the IDE specifically as an H2OFrame after data load and subsequent data manipulations.
H2OFrames have an extensive list of methods to perform mathematical, logical, and introspection operations at the value, column, row, and full dataset levels. An H2OFrame is similar in experience to the pandas DataFrame or R data frame.
The following examples are just a few data manipulations that can be done on H2Oframes:
For full details of H2O data manipulation possibilities, see the H2O Python documentation (http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html) or R documentation (http://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/index.html). Also, refer to the fourth section of Machine Learning with Python and H2O (http://h2o-release.s3.amazonaws.com/h2o/rel-wheeler/2/docs-website/h2o-docs/booklets/PythonBooklet.pdf) for examples of data manipulation.
Manipulating data is key for preparing it as an input for model building. We may also want to export our manipulated data for future use. The next section lists the H2O data export capabilities.
H2OFrames in memory can be exported to external targets. These target systems are as follows:
The volume of exported data must, of course, be considered. Large volumes of data will not, for example, fit into a local client memory or filesystem.
Let's now see what additional data capabilities Sparkling Water adds.
Sparkling Water inherits all data capabilities from H2O 3. Importantly, Sparkling Water adds additional data capabilities by leveraging the Spark DataFrame and Spark SQL APIs, and thus can import, manipulate, and export data accordingly. See the following reference for full Spark DataFrame and Spark SQL capabilities: https://spark.apache.org/docs/latest/sql-programming-guide.html.
A key pattern in using Sparkling Water is to leverage Spark for advanced data munging capabilities, then convert the resulting Spark DataFrame to an H2Oframe, and then build state-of-the-art models using H2O's machine learning algorithms, as covered in the next section. These algorithms can be used in either H2O 3 or Sparkling Water.
H2O has extensive unsupervised and supervised learning algorithms with similar reusable API constructs – for example, similar ways to set hyperparameters or invoke explainability capabilities. These algorithms are identical from an H2O 3 or Sparkling Water perspective and are overviewed in the following diagram:
Each algorithm has an extensive set of parameters and hyperparameters to set or leverage as defaults. The algorithms accept H2OFrames as data inputs. Remember that an H2OFrame is simply a handle on the IDE client to the distributed in-memory data on the remote H2O cluster where the algorithm processes it.
Let's take a look at H2O's distributed machine learning algorithms.
Unsupervised algorithms do not predict but rather attempt to find clusters and anomalies in data, or to reduce the dimensionality of a dataset. H2O has the following unsupervised learning algorithms to run at scale:
Supervised learning algorithms predict outcomes by learning from a training dataset labeled with those outcomes. H2O has the following supervised learning algorithms to run at scale:
Each algorithm has a deep set of parameters and hyperparameters for configuration and tuning. Specifying most parameters is optional; if not specified, the default will be used. Parameters include the specification of cross-validation parameters, learning rates, tree depths, weights columns, ignored columns, early stopping parameters, the distribution of response column (for example, Bernoulli), categorical encoding schemes, and many other specifications.
You can dive deeper into H2O's algorithms and their parameters in H2O's documentation at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html#algorithms. The H2O website also lists tutorials and booklets for its algorithms at http://docs.h2o.ai/#h2o. A full list of algorithm parameters, each with a description, status as a hyperparameter or not, and mapping to algorithms that use the parameter, are found in H2O's documentation Appendix at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/parameters.html.
H2O extends its supervised learning algorithms by providing Automatic Machine Learning (AutoML) and Stacked Ensemble capabilities. We will take a closer look` at these in the next section, where we will place H2O algorithms in the larger context of model capabilities.
H2O provides utilities to enhance work with its algorithms. Target encoding helps you handle categorical values and has many configurable parameters to make this easy. TF-IDF and Word2vec are commonly used in NLP problems, and they also are nicely configurable. Finally, permutation variable importance is a method to help understand how strongly your features contribute to the model and can help in evaluating which features to use in your final training dataset.
H2O's supervised learning algorithms are used to train models on training data, tune them on validation data, and score or predict with them on test or live production data. H2O has extensive capabilities to train, evaluate, explain, score, and inspect models. These are summarized in the following diagram:
Let's take a closer look at the model training capabilities.
Algorithms are at the heart of model training, but there are a larger set of capabilities to consider beyond the algorithms themselves. H2O provides the following model training capabilities:
After training a model, we want to evaluate it to determine whether its predictive performance meets our needs. Let's see what H2O offers in this regard.
H2O exposes many model attributes to evaluate model performance. These are summarized as follows:
Let's now explore ways to explain H2O models.
H2O presents a simple and uniform interface to explain either single models or multiple models, which can be a list of separately built models or a reference to those generated from AutoML. On top of that, H2O allows you to generate global (that is, model-level) and local (row- or individual-level) explanations. H2O's explainability capabilities are configurable to your specifications. Output is tabular, graphical, or both, depending on the explanation.
We will dedicate all of Chapter 6, Advanced Model Building – Part II, to explore this important topic in greater detail, but for now, here is a quick list of capabilities:
Let's now complete our survey of H2O's capabilities for modeling at scale by seeing what we can do once our model is trained, evaluated, and explained.
Once a model is trained, it can be exported and saved as a scoring artifact. The larger topic of deploying your artifact for production scoring will be treated in Part 3: Deploying Your Model to Production Environments. Here are the fundamental capabilities of the exported scoring artifact:
In this chapter, we conducted a wide survey of H2O capabilities for model building at scale. We learned about the data sources we can ingest into our H2O clusters and the file formats that are supported. We learned how this data moves from the source to the H2O cluster, and how the H2OFrame API provides a single handle in the IDE to represent the distributed in-memory data on the H2O cluster as a single two-dimensional data structure. We then learned the many ways in which we can manipulate data through the H2OFrame API and how to export it to external systems if need be.
We then surveyed the core of H2O model building at scale – H2O's many state-of-the-art distributed unsupervised and supervised learning algorithms. Then, we put those into context by surveying model capabilities around them, from training, evaluating, and explaining the models, to using model artifacts to retrain, score and inspect models.
With this map of the landscape firmly in hand, we can now roll up our sleeves and start building state-of-the-art H2O models at scale. In the next chapter, we will start by implementing the advanced model building topics one by one, before later putting it all together in a fully developed use case.