In this chapter, we will gain a fundamental understanding of the components of H2O's machine learning at scale technology. We will view a simple code example of H2O machine learning, understand what it does, and identify any problems the example has with machine learning at an enterprise scale. This Hello World code example will serve as a simple representation in which to build our understanding further.
We will overview each H2O component of machine learning at scale, identify how each component achieves scale, and identify how each component relates to our simple code snippet. Then, we will tie these components together into a reference machine learning workflow using these components. Finally, we will focus on the underlying key concepts that arise from these components. The understanding obtained in this chapter will be foundational to the rest of the book, where we will be implementing H2O technology to build and deploy state-of-the-art machine learning models at scale in an enterprise setting.
In this chapter, we're going to cover the following main topics:
For this chapter, you will need to install H2O-3 locally to run through a bare minimum Hello World workflow. To implement it, follow the instructions in the Appendix. Note that we will use the Python API throughout the book, so follow the instructions to install it in Python.
H2O Core is designed for machine learning at scale; however, it can also be used on small datasets on a user's laptop. In the following section, we will use a minimal code example of H2O-3 to build a machine learning model and export it as a deployable artifact. We will use this example to serve as the most basic unit to understand H2O machine learning code, much like viewing a human stick figure to begin learning about human biology.
Take a look at the code examples that follow. Here, we are writing in Python, which could be from Jupyter, PyCharm, or another Python client. We will learn that R and Java/Scala are alternative languages in which to write H2O code.
Let's start by importing the H2O library:
import h2o
Recall from the documentation that this has been downloaded from H2O and installed in the client or an IDE environment. This h2o package allows us to run H2O in-memory distributed machine learning from the IDE using the H2O API written in Python.
Next, we create an H2O cluster:
h2o.init(ip="localhost", port=54323)
The preceding line of code creates what is called An H2O cluster. This is a key concept underlying H2O's model building technology. It is a distributed in-memory architecture. In the Hello World case, the H2O cluster will be created on the laptop as localhost and will not be distributed. We will learn more about the H2O cluster in the H2O key concepts section of this chapter.
The ip and port configurations that are used to start the H2O cluster should provide sufficient clues that the H2O code will be sent via an API to the compute environment, which could be inside a data center or the cloud for an enterprise environment. However, here, it is on our localhost.
Then, we import a dataset:
loans = h2o.import_file("https://raw.githubusercontent.com/PacktPublishing/Machine-Learning-at-Scale-with-H2O/main/chapt2/loans-lite.csv")
Now we explore the dataset:
loans.describe()
This is a minimal amount of data exploration. It simply returns the number of rows and columns.
Okay, now let's prepare the data for our model:
train, validation = loans.split_frame(ratios=[0.75])
label = "bad_loan"
predictors = loans.col_names
predictors.remove(label)
We have split the data into training and validation sets, with a 0.75 proportion for training. We are going to predict whether a loan will be bad or not (that is, whether it will default or not) and have identified this column as the label. Finally, we define the columns used to predict bad loans by using all columns in the dataset except the bad loan column.
Now, we build the model:
from h2o.estimators import H2OXGBoostEstimator
param = {"ntrees" : 25, "nfolds" : 10}
xgboost_model = H2OXGBoostEstimator(**param)
xgboost_model.train(x = predictors,
y = label,
training_frame = train,
validation_frame = validation)
We have imported H2O's XGBoost module and configured two hyperparameters for it. Then, we started the model training by inputting references into the predictor column, label column, training data, and testing data.
XGBoost is one of many widely recognized and extensively used machine learning algorithms packaged in the h2o module. The H2O API exposed by this module will run the XGBoost model in H2O's architecture on the enterprise infrastructure, as we will learn later. Regarding hyperparameters, we will discover that H2O offers an extensive set of hyperparameters to configure for each model.
When the model finishes, we can export the model using one line of code:
xgboost_model.download_mojo(path="~/loans-model", get_genmodel_jar=True)
The exported scoring artifact is now ready to pass to DevOps to deploy. The get_genmodel_jar=True parameter triggers the download to include h2o-genmodel.jar. This is a library used by the model for scoring outside of an H2O cluster, that is, in a production environment. We will learn more about productionizing H2O models in Section 3 – Deploying Your Models to Production Environments.
We are done with model building, for now. So, we will shut down the cluster:
h2o.cluster().shutdown()
This frees up the resources that the H2O cluster has been using.
Bear in mind that this is a simple Hello World H2O model building example. It is meant to do both of the following:
In Section 2 – Building State-of-the-Art Models on Large Data Volumes Using H2O, we will explore extensive techniques to build highly predictive and explainable models at scale. Let's start our journey by discussing some issues of scale that our Hello World example exposes.
This Hello World code will not scale well in an enterprise setting. Let's revisit the code to better understand these scaling constraints.
We import the library in our IDE code:
import h2o
Most enterprises want to have some control over the versions of libraries that are used. Additionally, they usually want to provide a central platform to host and authenticate all users of a piece of technology and to have administrators manage that platform. We will discover that Enterprise Steam plays a key role in centrally managing users and H2O environments.
We initialize the H2O cluster:
h2o.init(ip="localhost", port=54323)
Machine learning at scale requires the distribution of compute resources across a server cluster to achieve horizontal scaling (that is, divide-and-conquer compute resources across many servers). Therefore, the IP address and port should point to a member of a server cluster and not to a single computer, as demonstrated in this example. We will see that H2O Core creates its own self-organized cluster that distributes and horizontally scales model building.
Since scaling is on the enterprise server cluster, which, typically, is used by many individuals and groups, enterprises want to control user access to this environment along with the number of resources consumed by users. But then what would prevent a user from launching multiple H2O clusters, using as many resources as possible on each, and thus, blocking resource availability from other users? Enterprise Steam manages H2O user and H2O resource consumption on the enterprise server cluster.
We import the dataset:
loans = h2o.import_file("https://raw.githubusercontent.com/PacktPublishing/Machine-Learning-at-Scale-with-H2O/main/chapt2/loans-lite.csv")
Data at large volumes takes an exceedingly long time to move over the network, taking hours or days to complete a transfer, or it could time out beforehand. Computation during model building at scale should occur where the data resides to prevent this bottleneck in data movement. We will discover that H2O clusters that are launched on the enterprise system ingest data from the storage layer directly into server memory. Because data is partitioned across the servers that comprise an H2O cluster, data ingest occurs in parallel to those partitions.
We will see how Enterprise Steam centralizes user authentication and how the user's identity is passed to the enterprise system where its native authorization mechanism is honored.
We train the model:
xgboost_model.train(x = predictors,
y = label,
training_frame = train,
validation_frame = validation)
Of course, this is the heart of the model building process and, likewise, the focus of much of this book: how to build world-class machine learning models against large data volumes using H2O's extensive machine learning algorithm and model building capabilities.
We download the deployable model:
xgboost_model.download_mojo(path="~/loans-model", get_genmodel_jar=True)
Bear in mind that, from a business standpoint, value is not achieved until a model is exported and deployed into production. Doing so involves the complexities of multiple enterprise stakeholders. We will learn how the design and capabilities of the exported MOJO (Model Object, Optimized) facilitate the ease of deployment to diverse software systems involving these stakeholders.
We shut down the H2O cluster:
h2o.cluster().shutdown()
An H2O cluster uses resources and should be shut down when not in use. If this is not done, other users or jobs on the enterprise system could be competing for these resources and, consequently, become impacted. Additionally, fewer new users can be added to the system before the infrastructure must be expanded. We will see that Enterprise Steam governs how H2O users consume resources on the enterprise system. The resulting gain in resource efficiency allows H2O users and their work to scale more effectively on a given allocation of infrastructure.
Now that we have run our Hello World example and explored some of its issues regarding scale, let's move on to gain an understanding of H2O components for machine learning model building and deployment at scale.
As introduced in the previous chapter and emphasized throughout this book, H2O machine learning overcomes problems of scale. The following is a brief introduction of each component of H2O machine learning at scale and how each overcomes these challenges.
H2O Core allows a data scientist to write code to build models using well-known machine learning algorithms. The coding experience is through an H2O API expressed in Python, R, or Java/Scala language and written in their favorite client or IDE, for example Python in a Jupyter notebook. The actual computation of model building, however, takes place on an enterprise server cluster (not the IDE environment) and leverages the server cluster's vast pool of memory and CPUs needed to run machine learning algorithms against massive data volumes.
So, how does this work? First, data used for model building is partitioned and distributed in memory by H2O on the server cluster. The IDE sends H2O instructions to the server cluster. A server in the cluster receives these instructions and distributes them to the other servers in the cluster. The instructions are run in parallel on the partitioned in-memory data. The server that received the instructions gathers and combines the results and sends them back to the IDE. This is done repeatedly as code is sequenced through the IDE.
This divide and conquer approach is fundamental to H2O model building at scale. A unit of H2O divide and conquer architecture is called an H2O cluster and is elaborated as a key concept later in the chapter. The result is rapid model building on large volumes of data.
Some of the key features of H2O Core are as follows:
H2O Core comes in two flavors: H2O-3 and H2O Sparkling Water.
H2O-3 is H2O Core, as described in the previous section. H2O Sparkling Water is H2O-3 wrapped by Spark integration. It is identical to H2O-3 along with the following additional capabilities:
H2O-3 and Sparkling Water are the model building alternatives of the more general H2O Core. The concept of the H2O cluster launched on the larger enterprise server cluster is similar for both H2O Core flavors, though some implementation details differ, which are essentially invisible to the data scientist. As mentioned, Sparkling Water is particularly useful for integrating Spark data engineering and H2O model building workflows.
Enterprise Steam provides a centralized web UI and API for data scientists to initialize and terminate their H2O environments (called H2O clusters) and for administrators to manage H2O users and H2O integration with the enterprise server cluster.
The key features of Enterprise steam are as follows:
The models built from H2O Core are exported as deployable scoring artifacts called H2O MOJOs. MOJOs can run in any JVM environment (except, perhaps, the very smallest edge devices).
In Section 3 – Deploying Your Models to Production Environments, we will learn that MOJOs are ready to deploy directly to H2O software as well as many third-party scoring solutions with no coding required. However, if you wish to directly embed MOJOs into your own software, there is a MOJO Java API to build Java helper classes to expose MOJO capabilities (for example, output reason codes in addition to a prediction) and to provide flexible integration with your scoring input and output.
MOJOs, out of all models, regardless of the machine learning algorithm used to build the model, are identical in construct. Therefore, deployment from a DevOps perspective is repeatable and automatable.
The key features of the MOJO are as follows:
Note that there is an alternative to the H2O MOJO, called POJO, which is used for infrequent edge cases. This will be explored further in Chapter 8, Putting It All Together.
Now that we understand the roles and key features of H2O's machine learning at scale components, let's tie them together into a high-level workflow, as represented in the following diagram:
The workflow occurs in the following sequence:
In the following sections, we will identify and describe the key concepts of H2O that underlie the workflow steps of the previous section. These concepts are necessary to understand the rest of the book.
The data scientist has a familiar experience in building H2O models at scale while being abstracted from the complexities of the infrastructure and architecture on the enterprise server cluster. This is further detailed in the following diagram:
Data scientists use well-known unsupervised and supervised machine learning techniques that scale across the enterprise's distributed infrastructure and architecture. These techniques are written with the H2O model building API, which is written in familiar languages (such as Python, R, or Java) using familiar IDEs (for example, Jupyter or RStudio).
H2O Flow – A Convenient, Optional UI
H2O generates its own web UI called H2O Flow, which is optional to use during model building. H2O Flow's UI focus and richness of features can be used for a full model building workflow or to leverage for handy tricks, as we will demonstrate in Chapter 5, Advanced Model Building – Part 1.
Therefore, the data scientist works in a familiar world that connects to a complex architecture to scale model building to large or massive datasets. We will explore this architecture in the next section.
The H2O cluster is perhaps the most central concept for all stakeholders to understand. It is how H2O creates its unit of architecture for building machine learning models on the enterprise server cluster. We can understand this concept using the following diagram:
When a data scientist launches an H2O cluster, they specify the number of servers to distribute the work across (which is also known as the number of nodes), along with the amount of memory and CPUs to use for each node. We will learn that this can be done by configuring manually or by allowing Enterprise Steam to auto compute these specifications based on the volume of training data.
When the H2O cluster is launched, the IDE pushes H2O software (a single JAR file) to each specified number of nodes in the enterprise server cluster, where each node allocates the specified memory and CPU. Then, the H2O software organizes into a self-communicating cluster with one node elected as the leader that communicates with the IDE and coordinates with the remainder of the H2O cluster.
The data scientist connects to the launched H2O cluster from the IDE. Then, the data scientist writes the model building code. Each part of the code is translated by the H2O library in the IDE into instructions to the H2O cluster. Each instruction is sent, in sequence, to the leader node on the H2O cluster, which distributes it to other H2O cluster members where the instructions are executed in parallel. The leader node gathers and combines the results and sends them back to the IDE.
Here are some important notes to bear in mind:
Let's look at the life cycle of an H2O cluster, as shown in the following diagram:
Let's look at each of the stages of the life cycle, one by one, to understand how they work:
Stop/Save Data & Restart: This is an alternative to Stop and is possible when the Enterprise Steam administrator configures this option for a user or user group. In this case, when the H2O cluster is stopped, it saves data from the model building steps (that is, it saves the model building state) to the storage layer. When the cluster is restarted (using the same name as when it was launched), the cluster is launched and returned to its previous state.
All H2O administration activities occur on Enterprise Steam, and users must launch H2O clusters through Steam. This all roads lead to Enterprise Steam approach means that Steam governs users and their H2O clusters before they are launched on the enterprise system. This is detailed in the following diagram:
Administrators configure settings to manage H2O users and integrate Enterprise Steam with the enterprise server cluster. Additionally, administrators store H2O software versions that will be pushed to the server cluster when H2O clusters are launched and removed when the cluster is stopped and the resources are released. Administrators also have access to user usage data. This is all done through an administration-only UI.
Administrators configure users and how users launch H2O clusters in the enterprise environment. These configurations define limits on the number of concurrent clusters a user can launch simultaneously, the size (that is, the number of nodes), and the number of resources (for example, memory per node) allocated for each H2O cluster that is launched. Configurations also define when the cluster will stop or delete if the user does not do so manually from the H2O model building code in the IDE. A set of such configurations is defined as a profile, and one or more profiles are assigned to users or user groups. Therefore, administrators can assign some users as power users and others as light users.
Users authenticate to Enterprise Steam via the same identity provider (for example, LDAP) that was implemented to authorize access to resources on the enterprise server cluster environment (for example, S3 buckets). Enterprise Steam passes the user identity when the user launches a cluster, and this identity is used during authorization challenges on the enterprise system. Users in their IDEs must authenticate against the Enterprise Steam API to connect to the clusters they have launched.
Does H2O Core Require Enterprise Steam?
Note that H2O core does not require Enterprise Steam. Enterprise administrators can configure their enterprise server cluster infrastructure to allow H2O clusters to be launched on this infrastructure.
However, this approach is not a sound enterprise practice. It introduces a loss of control and governance that Enterprise Steam provides as a centralized H2O gateway to secure, manage, and log users, as elaborated in this section. Additionally, Enterprise Steam provides benefits to users by freeing them from the technical steps involved with integrating H2O Core with the enterprise cluster when launching H2O clusters, for example, Kerberos security requirements. The enterprise benefits of Enterprise Steam are explored in greater detail in Chapter 11, The Administrator and Operations Views, and in Chapter 12, The Enterprise Architect and Security Views.
Also, bear in mind that H2O Core is free and open source, whereas Enterprise Steam is not.
Now that we know how H2O clusters are formed and the role Enterprise Steam plays in administering H2O users and launching H2O clusters, let's understand Enterprise Steam and the H2O Core architecture from a high-level deployment perspective. The following diagram describes this deployment architecture:
Enterprise Steam runs on its own dedicated server that communicates with the enterprise server cluster via HTTP(S). As mentioned earlier, Enterprise Steam stores the H2O Core (H2O-3 or Sparkling Water) JAR file that is pushed to the server cluster, which then self-organizes into a coordinated but distributed H2O cluster. This H2O cluster can be a native YARN or Kubernetes job, depending on which backend is implemented. Note that H2O-3 is run on a Map-Reduce framework, and Sparkling Water is run on the Spark framework.
An H2O-3 or Sparkling Water API library is installed to the data science IDE (for example, a pip install of the H2O-3 package in the Jupyter environment). It must match the version that is used to launch the cluster from Enterprise Steam. As mentioned previously, data scientists use the IDE to authenticate to Enterprise Steam, connect to the H2O cluster, and write H2O model building code. The H2O model building code is translated by the H2O client library into a REST message that is sent to the H2O cluster's leader node. Then, the work is distributed across the H2O cluster, and the results are returned to the IDE.
Note that enterprise clusters can be on-premise, cloud infrastructure-as-a-service, or managed service implementations. They can be, for example, Kubernetes or Cloudera CDH on-premise or in the cloud, or Cloudera CDP or Amazon EMR in the cloud. The full deployment possibilities are discussed in more detail in Chapter 12, The Enterprise Architect and Security Views.
H2O Platform Choices
H2O At Scale technology in this book is referred to as comprising: H2O Enterprise Steam + H2O Core (H2O-3, H2O Sparkling Water) + H2O MOJO. H2O At Scale integrates with an enterprise server cluster for model building and an enterprise scoring environment for model deployment.
H2O At Scale can be implemented with the just mentioned components alone. Alternatively, H2O At Scale can be implemented as a subset of the larger H2O machine learning platform and capability set called H2O AI Cloud. The H2O AI Cloud platform is described in greater detail in Section 5 – Broadening the View – Data to AI Applications with the H2O AI Cloud Platform.
The following code shows a simple example of Spark and H2O integrated in the same H2O code using H2O Sparkling Water:
# import data
loans_spark = spark.read.load("loans.csv", format="csv", sep=",", inferSchema="true", header="true")
# Spark data engineering code
loans_spark = # any Spark SQL or Spark DataFrame code
# Convert Spark DataFrame to H2O Frame
loans = h2oContext.asH2OFrame(loans_spark)
# Continue with H2O model building steps as in previous code example
loans.describe()
The code shows Spark importing data, which is held as a Spark DataFrame. Spark SQL or the Spark DataFrame API is used to engineer this data into a new DataFrame and then this Spark DataFrame is converted into an H2OFrame from which H2O model building is performed. Therefore, the user is iterating seamlessly from Spark to H2O code in the same API language and IDE.
The idea of the H2O cluster is still fundamentally true for Sparkling Water. It now expresses the H2O cluster architecture within the Spark framework. Details of this architecture are elaborated in Chapter 12, The Enterprise Architect and Security Views.
Data scientists build models, but the end goal is to put models into a production environment where predictions are made in a business context. MOJOs make this last mile of deployment easy. MOJOs are exported by a single line of code. For example, whether the model was built using Python, R, or using a generalized linear model, an XGBoost model, or stacked ensemble, all MOJOs are identical from a DevOps perspective. This makes model deployment repeatable and, thus, capable of fitting into existing automated CI/CD pipelines that are used throughout the organization.
In this chapter, we laid the foundation for understanding H2O machine learning at scale. We started by reviewing a bare minimum Hello World code example and discussed the problems of scale around it. Then, we introduced the H2O Core, Enterprise Steam, and MOJO technology components and how these can overcome problems of scale. Finally, we extracted a set of key concepts from these technologies to deepen our understanding.
In the next chapter, we will use this understanding to begin our journey of learning how to build and deploy world-class models at scale. Let the coding begin!