Chapter 4: H2O Model Building at Scale – Capability Articulation

So far, we have learned the fundamental workflow of how to build H2O models at scale, but that was done using H2O at its barest minimum. In this chapter, we will survey the extremely broad capability set of H2O model building at scale. We will then use our knowledge from this chapter and move on to part two, Building State-of-the-Art Models on Large Data Volumes Using H2O, where we will get down to business and use advanced techniques to build and explain highly predictive models at scale.

To conduct this survey, we will break down the chapter into the following main topics:

  • Articulating the H2O data capabilities during model building
  • Overviewing the H2O machine learning algorithms
  • Understanding the H2O modeling capabilities

H2O data capabilities during model building

Recall that H2O model building at scale is performed by using H2O 3 or its extension, Sparkling Water, which wraps H2O 3 with Spark capabilities. The H2O 3 API has extensive data capabilities used in the model building process, and the Sparkling Water API inherits these and adds additional capabilities from Spark. These capabilities are broken down into the following three broad categories:

  • Ingesting data from the source to the H2O cluster
  • Manipulating data on the H2O cluster
  • Exporting data from the H2O cluster to an external destination

As emphasized in previous chapters, the H2O cluster architecture (H2O 3 or Sparkling Water) allows model building at an unlimited scale but is abstracted from the data scientist who builds models by coding H2O in the IDE.

H2O data capabilities are overviewed in the following diagram and elaborated subsequently:

Figure 4.1 – The H2O data capabilities

Let's start with data ingestion.

Ingesting data from the source to the H2O cluster

The following data sources are supported:

  • Local file
  • Remote file
  • AWS S3
  • MinIO cloud storage
  • Azure Blob and Data Lake
  • Google Cloud Storage
  • HDFS
  • HDFS-like: Alluxio FS and IBM HDFS
  • Hive (via Metastore/HDFS or JDBC)
  • JDBC

The supported file formats of source data are as follows:

  • CSV (a file with any delimiter, auto-detected or specified)
  • GZipped CSV
  • XLS or XLSX
  • ORC
  • Parquet
  • Avro
  • ARFF
  • SVMLight

Some important characteristics of data ingestion to H2O are as follows:

  • Data is ingested directly from the source to the H2O cluster memory and does not pass through the IDE client.
  • In all cases, data is partitioned in-memory across the H2O cluster.
  • Except for the local file, the remote file, and JDBC sources, data is ingested in parallel to each partition.
  • Data on the H2O cluster is represented to the user in the IDE as a two-dimensional H2OFrame.

Let's now see how we can manipulate data now that it is ingested into H2O and represented as an H2OFrame.

Manipulating data in the H2O cluster

The H2O 3 API provides extensive data manipulation capabilities. As mentioned in the previous bullet list, datasets in memory are distributed on the H2O cluster and represented in the IDE specifically as an H2OFrame after data load and subsequent data manipulations.

H2OFrames have an extensive list of methods to perform mathematical, logical, and introspection operations at the value, column, row, and full dataset levels. An H2OFrame is similar in experience to the pandas DataFrame or R data frame.

The following examples are just a few data manipulations that can be done on H2Oframes:

  • Operations on data columns:
    • Change the data type (for example, integers from 0 to 7 as categorical values).
    • Aggregate a column (group by) by applying mathematical functions.
    • Display column names and use as features in a model.
  • Operations on data rows:
    • Combine rows from one or more datasets.
    • Slice out (filter) rows of a dataset by specifying the row index, the range of rows, or the logical condition.
  • Operations on datasets:
    • Merge two datasets on common values of shared column names.
    • Transform a dataset by pivoting on a column.
    • Split a dataset into two or more datasets (for example, train, validate, and test).
  • Operations on data values:
    • Fill missing values forward or backward with adjacent row or column values.
    • Fill missing values by imputing with aggregate results (for example, the mean for the column).
    • Replace numerical values based on logical conditions.
    • Trim values, manipulate strings, return a numerical value sign, and test whether a value is N/A.
  • Feature engineering operations:
    • Date parsing, for example, parsing one date column into separate columns for year, month, day.
    • Derive a new column mathematically and conditionally from other columns, including the use of lambda expressions.
    • Perform target encoding (that is, replace a categorical value with the mean of the target variable).
    • For Natural Language Processing (NLP) problems, perform string tokenizing, Term Frequency-Inverse Document Frequency (TF-IDF) calculations, and convert a Word2vec model into an H2OFrame for data manipulations.

For full details of H2O data manipulation possibilities, see the H2O Python documentation (http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html) or R documentation (http://docs.h2o.ai/h2o/latest-stable/h2o-r/docs/reference/index.html). Also, refer to the fourth section of Machine Learning with Python and H2O (http://h2o-release.s3.amazonaws.com/h2o/rel-wheeler/2/docs-website/h2o-docs/booklets/PythonBooklet.pdf) for examples of data manipulation.

Manipulating data is key for preparing it as an input for model building. We may also want to export our manipulated data for future use. The next section lists the H2O data export capabilities.

Exporting data out of the H2O cluster

H2OFrames in memory can be exported to external targets. These target systems are as follows:

  • Local client memory
  • Local filesystem
  • AWS S3
  • MinIO cloud storage
  • Azure Blob and Data Lake
  • Google Cloud Storage
  • HDFS
  • HDFS-like: Alluxio FS and IBM HDFS
  • Hive tables (CSV or Parquet, via JDBC)

The volume of exported data must, of course, be considered. Large volumes of data will not, for example, fit into a local client memory or filesystem.

Let's now see what additional data capabilities Sparkling Water adds.

Additional data capabilities provided by Sparkling Water

Sparkling Water inherits all data capabilities from H2O 3. Importantly, Sparkling Water adds additional data capabilities by leveraging the Spark DataFrame and Spark SQL APIs, and thus can import, manipulate, and export data accordingly. See the following reference for full Spark DataFrame and Spark SQL capabilities: https://spark.apache.org/docs/latest/sql-programming-guide.html.

A key pattern in using Sparkling Water is to leverage Spark for advanced data munging capabilities, then convert the resulting Spark DataFrame to an H2Oframe, and then build state-of-the-art models using H2O's machine learning algorithms, as covered in the next section. These algorithms can be used in either H2O 3 or Sparkling Water.

H2O machine learning algorithms

H2O has extensive unsupervised and supervised learning algorithms with similar reusable API constructs – for example, similar ways to set hyperparameters or invoke explainability capabilities. These algorithms are identical from an H2O 3 or Sparkling Water perspective and are overviewed in the following diagram:

Figure 4.2 – H2O algorithms

Each algorithm has an extensive set of parameters and hyperparameters to set or leverage as defaults. The algorithms accept H2OFrames as data inputs. Remember that an H2OFrame is simply a handle on the IDE client to the distributed in-memory data on the remote H2O cluster where the algorithm processes it.

Let's take a look at H2O's distributed machine learning algorithms.

H2O unsupervised learning algorithms

Unsupervised algorithms do not predict but rather attempt to find clusters and anomalies in data, or to reduce the dimensionality of a dataset. H2O has the following unsupervised learning algorithms to run at scale:

  • Aggregator
  • Generalized Low Rank Models (GLRM)
  • Isolation Forest
  • Extended Isolation Forest
  • K-Means Clustering
  • Principal Component Analysis (PCA)

H2O supervised learning algorithms

Supervised learning algorithms predict outcomes by learning from a training dataset labeled with those outcomes. H2O has the following supervised learning algorithms to run at scale:

  • Cox Proportional Hazards (CoxPH)
  • Deep Learning (Artificial Neural Network, or ANN)
  • Distributed Random Forest (DRF)
  • Generalized Linear Model (GLM)
  • Maximum R Square Improvements (MAXR)
  • Generalized Additive Models (GAM)
  • ANOVA GLM
  • Gradient Boosting Machine (GBM)
  • Naïve Bayes Classifier
  • RuleFit
  • Support Vector Machine (SVM)
  • XGBoost

Parameters and hyperparameters

Each algorithm has a deep set of parameters and hyperparameters for configuration and tuning. Specifying most parameters is optional; if not specified, the default will be used. Parameters include the specification of cross-validation parameters, learning rates, tree depths, weights columns, ignored columns, early stopping parameters, the distribution of response column (for example, Bernoulli), categorical encoding schemes, and many other specifications.

You can dive deeper into H2O's algorithms and their parameters in H2O's documentation at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science.html#algorithms. The H2O website also lists tutorials and booklets for its algorithms at http://docs.h2o.ai/#h2o. A full list of algorithm parameters, each with a description, status as a hyperparameter or not, and mapping to algorithms that use the parameter, are found in H2O's documentation Appendix at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/parameters.html.

H2O extensions of supervised learning

H2O extends its supervised learning algorithms by providing Automatic Machine Learning (AutoML) and Stacked Ensemble capabilities. We will take a closer look` at these in the next section, where we will place H2O algorithms in the larger context of model capabilities.

Miscellaneous

H2O provides utilities to enhance work with its algorithms. Target encoding helps you handle categorical values and has many configurable parameters to make this easy. TF-IDF and Word2vec are commonly used in NLP problems, and they also are nicely configurable. Finally, permutation variable importance is a method to help understand how strongly your features contribute to the model and can help in evaluating which features to use in your final training dataset.

H2O modeling capabilities

H2O's supervised learning algorithms are used to train models on training data, tune them on validation data, and score or predict with them on test or live production data. H2O has extensive capabilities to train, evaluate, explain, score, and inspect models. These are summarized in the following diagram:

Figure 4.3 – The H2O supervised learning capabilities

Figure 4.3 – The H2O supervised learning capabilities

Let's take a closer look at the model training capabilities.

H2O model training capabilities

Algorithms are at the heart of model training, but there are a larger set of capabilities to consider beyond the algorithms themselves. H2O provides the following model training capabilities:

  • AutoML: An easy-to-use interface and parameter set that automates the process of training and tuning many different models, using multiple algorithms, to create a large number of models in a short amount of time.
  • Cross-validation: K-fold validation is used to generate performance metrics against folds of the validation split, and parameters such as the number of folds can be specified in the algorithm's training parameters.
  • Checkpointing: A new model is built as a continuation from a previously trained, checkpointed model as opposed to building the model from scratch; this is useful, for example, in retraining a model with new data.
  • Early stopping: The parameters to define when an algorithm stops model building early, determined by which of many stopping metrics is specified.
  • Grid search: Build models for each combination of a range of hyperparameters that are specified and sort the resulting models by a performance metric.
  • Regularization: Most algorithms have parameter settings to specify regularization techniques to prevent overfitting and increase explainability.
  • Segmented training: Training data is partitioned into segments based on the same column values, and a separate model is built for each segment.
  • Stacked ensembles: Combines the results of multiple base models that use the same or different algorithms into a better-performing single model.

After training a model, we want to evaluate it to determine whether its predictive performance meets our needs. Let's see what H2O offers in this regard.

H2O model evaluation capabilities

H2O exposes many model attributes to evaluate model performance. These are summarized as follows:

  • The leaderboard for AutoML: The AutoML model results ranked by configured performance metrics or other attributes, such as average prediction speed, with additional metrics shown.
  • Performance metrics for classification problems: For classification problems, H2O calculates GINI coefficient, Absolute Matthew Correlation Coefficient (MCC), F1, F0.5, F2, Accuracy, Logloss, Area Under the ROC Curve (AUC), Area Under the Precision-Recall Curve (AUCPR), and Kolmogorov-Smirnov (KS) metrics.
  • Performance metrics for regression problems: For regression problems, H2O calculates R Squared (R²), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Root Mean Squared Logarithmic Error (RMSLE), and Mean Absolute Error (MAE) metrics.
  • Prediction metrics: After a model is built, H2O allows you to predict a leaf node assignment (tree-based models), feature contributions, class probabilities for each stage (GBM models), and feature frequencies on a prediction path (GBM and DRF).
  • Learning curve plot: This shows a model performance metric as learning progresses to help diagnose overfitting or underfitting.

Let's now explore ways to explain H2O models.

H2O model explainability capabilities

H2O presents a simple and uniform interface to explain either single models or multiple models, which can be a list of separately built models or a reference to those generated from AutoML. On top of that, H2O allows you to generate global (that is, model-level) and local (row- or individual-level) explanations. H2O's explainability capabilities are configurable to your specifications. Output is tabular, graphical, or both, depending on the explanation.

We will dedicate all of Chapter 6, Advanced Model Building – Part II, to explore this important topic in greater detail, but for now, here is a quick list of capabilities:   

  • Residual analysis for regression
  • Confusion matrix for classification
  • Variable importance table and heatmap
  • Model correlation heatmap
  • Shapley values
  • Partial Dependency Plots (PDPs)
  • Individual Conditional Expectation (ICE)

Let's now complete our survey of H2O's capabilities for modeling at scale by seeing what we can do once our model is trained, evaluated, and explained.

H2O trained model artifacts

Once a model is trained, it can be exported and saved as a scoring artifact. The larger topic of deploying your artifact for production scoring will be treated in Part 3: Deploying Your Model to Production Environments. Here are the fundamental capabilities of the exported scoring artifact:

  • Predicting with a MOJO: Models can be saved as self-contained binary Java objects called MOJOs that can be flexibly implemented as low-latency production scoring artifacts on diverse systems (for example, a REST server, batch database scoring, and Hive UDFs). MOJOs also can be reimported into H2O clusters for purposes described in the next bullet point.
  • Inspecting the model with a MOJO: An exported MOJO can be re-imported into the H2O cluster and used to score against a dataset, inspect hyperparameters used to train the original model, see the scoring history, and show feature importances.
  • A MOJO compared to a POJO: The POJO is the precursor to the MOJO and is being deprecated by H2O but is still required for some algorithms.

Summary

In this chapter, we conducted a wide survey of H2O capabilities for model building at scale. We learned about the data sources we can ingest into our H2O clusters and the file formats that are supported. We learned how this data moves from the source to the H2O cluster, and how the H2OFrame API provides a single handle in the IDE to represent the distributed in-memory data on the H2O cluster as a single two-dimensional data structure. We then learned the many ways in which we can manipulate data through the H2OFrame API and how to export it to external systems if need be.

We then surveyed the core of H2O model building at scale – H2O's many state-of-the-art distributed unsupervised and supervised learning algorithms. Then, we put those into context by surveying model capabilities around them, from training, evaluating, and explaining the models, to using model artifacts to retrain, score and inspect models.

With this map of the landscape firmly in hand, we can now roll up our sleeves and start building state-of-the-art H2O models at scale. In the next chapter, we will start by implementing the advanced model building topics one by one, before later putting it all together in a fully developed use case.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset