Chapter 2: Platform Components and Key Concepts

In this chapter, we will gain a fundamental understanding of the components of H2O's machine learning at scale technology. We will view a simple code example of H2O machine learning, understand what it does, and identify any problems the example has with machine learning at an enterprise scale. This Hello World code example will serve as a simple representation in which to build our understanding further.

We will overview each H2O component of machine learning at scale, identify how each component achieves scale, and identify how each component relates to our simple code snippet. Then, we will tie these components together into a reference machine learning workflow using these components. Finally, we will focus on the underlying key concepts that arise from these components. The understanding obtained in this chapter will be foundational to the rest of the book, where we will be implementing H2O technology to build and deploy state-of-the-art machine learning models at scale in an enterprise setting.

In this chapter, we're going to cover the following main topics:

  • Hello World – the H2O machine learning code
  • The components of H2O machine learning at scale
  • The machine learning workflow using these H2O components
  • H2O key concepts

Technical requirements

For this chapter, you will need to install H2O-3 locally to run through a bare minimum Hello World workflow. To implement it, follow the instructions in the Appendix. Note that we will use the Python API throughout the book, so follow the instructions to install it in Python.

Hello World – the H2O machine learning code

H2O Core is designed for machine learning at scale; however, it can also be used on small datasets on a user's laptop. In the following section, we will use a minimal code example of H2O-3 to build a machine learning model and export it as a deployable artifact. We will use this example to serve as the most basic unit to understand H2O machine learning code, much like viewing a human stick figure to begin learning about human biology.

Code example

Take a look at the code examples that follow. Here, we are writing in Python, which could be from Jupyter, PyCharm, or another Python client. We will learn that R and Java/Scala are alternative languages in which to write H2O code.

Let's start by importing the H2O library:

import h2o

Recall from the documentation that this has been downloaded from H2O and installed in the client or an IDE environment. This h2o package allows us to run H2O in-memory distributed machine learning from the IDE using the H2O API written in Python.

Next, we create an H2O cluster:

h2o.init(ip="localhost", port=54323)

The preceding line of code creates what is called An H2O cluster. This is a key concept underlying H2O's model building technology. It is a distributed in-memory architecture. In the Hello World case, the H2O cluster will be created on the laptop as localhost and will not be distributed. We will learn more about the H2O cluster in the H2O key concepts section of this chapter.

The ip and port configurations that are used to start the H2O cluster should provide sufficient clues that the H2O code will be sent via an API to the compute environment, which could be inside a data center or the cloud for an enterprise environment. However, here, it is on our localhost.

Then, we import a dataset:

loans = h2o.import_file("https://raw.githubusercontent.com/PacktPublishing/Machine-Learning-at-Scale-with-H2O/main/chapt2/loans-lite.csv")

Now we explore the dataset:

loans.describe()

This is a minimal amount of data exploration. It simply returns the number of rows and columns.

Okay, now let's prepare the data for our model:

train, validation = loans.split_frame(ratios=[0.75])
label = "bad_loan"
predictors = loans.col_names
predictors.remove(label)

We have split the data into training and validation sets, with a 0.75 proportion for training. We are going to predict whether a loan will be bad or not (that is, whether it will default or not) and have identified this column as the label. Finally, we define the columns used to predict bad loans by using all columns in the dataset except the bad loan column.

Now, we build the model:

from h2o.estimators import H2OXGBoostEstimator
param = {"ntrees" : 25, "nfolds" : 10}
xgboost_model = H2OXGBoostEstimator(**param)
xgboost_model.train(x = predictors,
                    y = label,
                    training_frame = train,
                    validation_frame = validation)

We have imported H2O's XGBoost module and configured two hyperparameters for it. Then, we started the model training by inputting references into the predictor column, label column, training data, and testing data.

XGBoost is one of many widely recognized and extensively used machine learning algorithms packaged in the h2o module. The H2O API exposed by this module will run the XGBoost model in H2O's architecture on the enterprise infrastructure, as we will learn later. Regarding hyperparameters, we will discover that H2O offers an extensive set of hyperparameters to configure for each model.

When the model finishes, we can export the model using one line of code:

xgboost_model.download_mojo(path="~/loans-model", get_genmodel_jar=True)

The exported scoring artifact is now ready to pass to DevOps to deploy. The get_genmodel_jar=True parameter triggers the download to include h2o-genmodel.jar. This is a library used by the model for scoring outside of an H2O cluster, that is, in a production environment. We will learn more about productionizing H2O models in Section 3 – Deploying Your Models to Production Environments.

We are done with model building, for now. So, we will shut down the cluster:

h2o.cluster().shutdown()

This frees up the resources that the H2O cluster has been using.

Bear in mind that this is a simple Hello World H2O model building example. It is meant to do both of the following:

  • Give a bare minimum introduction to H2O model building.
  • Serve as a basis to discuss issues of scale in the enterprise, which we will do in the next section.

In Section 2 – Building State-of-the-Art Models on Large Data Volumes Using H2O, we will explore extensive techniques to build highly predictive and explainable models at scale. Let's start our journey by discussing some issues of scale that our Hello World example exposes.

Some issues of scale

This Hello World code will not scale well in an enterprise setting. Let's revisit the code to better understand these scaling constraints.

We import the library in our IDE code:

import h2o

Most enterprises want to have some control over the versions of libraries that are used. Additionally, they usually want to provide a central platform to host and authenticate all users of a piece of technology and to have administrators manage that platform. We will discover that Enterprise Steam plays a key role in centrally managing users and H2O environments.

We initialize the H2O cluster:

h2o.init(ip="localhost", port=54323)

Machine learning at scale requires the distribution of compute resources across a server cluster to achieve horizontal scaling (that is, divide-and-conquer compute resources across many servers). Therefore, the IP address and port should point to a member of a server cluster and not to a single computer, as demonstrated in this example. We will see that H2O Core creates its own self-organized cluster that distributes and horizontally scales model building.

Since scaling is on the enterprise server cluster, which, typically, is used by many individuals and groups, enterprises want to control user access to this environment along with the number of resources consumed by users. But then what would prevent a user from launching multiple H2O clusters, using as many resources as possible on each, and thus, blocking resource availability from other users? Enterprise Steam manages H2O user and H2O resource consumption on the enterprise server cluster.

We import the dataset:

loans = h2o.import_file("https://raw.githubusercontent.com/PacktPublishing/Machine-Learning-at-Scale-with-H2O/main/chapt2/loans-lite.csv")

Data at large volumes takes an exceedingly long time to move over the network, taking hours or days to complete a transfer, or it could time out beforehand. Computation during model building at scale should occur where the data resides to prevent this bottleneck in data movement. We will discover that H2O clusters that are launched on the enterprise system ingest data from the storage layer directly into server memory. Because data is partitioned across the servers that comprise an H2O cluster, data ingest occurs in parallel to those partitions.

We will see how Enterprise Steam centralizes user authentication and how the user's identity is passed to the enterprise system where its native authorization mechanism is honored.

We train the model:

xgboost_model.train(x = predictors,
                    y = label,
                    training_frame = train,
                    validation_frame = validation)

Of course, this is the heart of the model building process and, likewise, the focus of much of this book: how to build world-class machine learning models against large data volumes using H2O's extensive machine learning algorithm and model building capabilities.

We download the deployable model:

xgboost_model.download_mojo(path="~/loans-model", get_genmodel_jar=True)

Bear in mind that, from a business standpoint, value is not achieved until a model is exported and deployed into production. Doing so involves the complexities of multiple enterprise stakeholders. We will learn how the design and capabilities of the exported MOJO (Model Object, Optimized) facilitate the ease of deployment to diverse software systems involving these stakeholders.

We shut down the H2O cluster:

h2o.cluster().shutdown()

An H2O cluster uses resources and should be shut down when not in use. If this is not done, other users or jobs on the enterprise system could be competing for these resources and, consequently, become impacted. Additionally, fewer new users can be added to the system before the infrastructure must be expanded. We will see that Enterprise Steam governs how H2O users consume resources on the enterprise system. The resulting gain in resource efficiency allows H2O users and their work to scale more effectively on a given allocation of infrastructure.

Now that we have run our Hello World example and explored some of its issues regarding scale, let's move on to gain an understanding of H2O components for machine learning model building and deployment at scale.

The components of H2O machine learning at scale

As introduced in the previous chapter and emphasized throughout this book, H2O machine learning overcomes problems of scale. The following is a brief introduction of each component of H2O machine learning at scale and how each overcomes these challenges.

H2O Core – in-memory distributed model building

H2O Core allows a data scientist to write code to build models using well-known machine learning algorithms. The coding experience is through an H2O API expressed in Python, R, or Java/Scala language and written in their favorite client or IDE, for example Python in a Jupyter notebook. The actual computation of model building, however, takes place on an enterprise server cluster (not the IDE environment) and leverages the server cluster's vast pool of memory and CPUs needed to run machine learning algorithms against massive data volumes.  

So, how does this work? First, data used for model building is partitioned and distributed in memory by H2O on the server cluster. The IDE sends H2O instructions to the server cluster. A server in the cluster receives these instructions and distributes them to the other servers in the cluster. The instructions are run in parallel on the partitioned in-memory data. The server that received the instructions gathers and combines the results and sends them back to the IDE. This is done repeatedly as code is sequenced through the IDE.

This divide and conquer approach is fundamental to H2O model building at scale. A unit of H2O divide and conquer architecture is called an H2O cluster and is elaborated as a key concept later in the chapter. The result is rapid model building on large volumes of data.

The key features of H2O Core

Some of the key features of H2O Core are as follows:

  • Horizontal scaling: Data operations and machine learning algorithms are distributed in parallel and in memory, with additional optimizations such as a distributed key/value store to rapidly access data and objects during model building.
  • Familiar experience: Data scientists use familiar languages and IDEs to write H2O API code, as we have just done.
  • Open source: H2O Core is open source.
  • Wide range of file formats: H2O supports a wide range of source data formats.
  • Data manipulation: The H2O API includes a wide range of tasks commonly performed to prepare data for machine learning. Sparkling Water (covered in the next section) extends data engineering techniques to Spark.
  • Well-recognized machine learning algorithms: H2O uses a wide range of well-recognized supervised and unsupervised machine learning algorithms.
  • Training, testing, and evaluation: Extensive techniques in cross-validation, grid search, variable importance, and performance metrics are used to train, test, and evaluate models; this also includes model checkpointing capabilities.
  • Automatic Machine Learning (AutoML): The H2O Core AutoML API provides a simple wrapper function to concisely automate the training and tuning of multiple models, including stacked ensembling, and present results in a leaderboard.
  • Model explainability: It offers extensive local and global explainability methods and visualizations for single models or those involved in AutoML, all from a single wrapper function.
  • AutoDoc: It enables the automated generation of standardized Word documents, extensively describing model building and explainability in detail; note that AutoDoc is not available as a free open source platform.
  • Exportable scoring artifact (MOJO): It uses a single line of code to export the model as a deployable scoring artifact (model deployment will be discussed in greater detail in Section 3 – Deploying Your Models to Production Environments).
  • H2O Flow Web UI: This is an optional web-based interactive UI to guide users through the model building workflow in an easy yet rich point-and-click experience, which is useful for the rapid experimentation and prototyping of H2O models.

H2O-3 and H2O Sparkling Water

H2O Core comes in two flavors: H2O-3 and H2O Sparkling Water.

H2O-3 is H2O Core, as described in the previous section. H2O Sparkling Water is H2O-3 wrapped by Spark integration. It is identical to H2O-3 along with the following additional capabilities:

  • Seamless integration of Spark and H2O API code: The user writes both Spark and H2O code in the same IDE; for example, using SparkSQL code to engineer data and H2O code to build world-class models.
  • Conversion between H2O and Spark DataFrames: H2O and Spark DataFrames interconvert as part of the seamless integration; therefore, the results of SparkSQL data munging can be used as input to H2O model building.
  • Spark engine: Sparkling Water runs as a native Spark application on the Spark framework.

H2O-3 and Sparkling Water are the model building alternatives of the more general H2O Core. The concept of the H2O cluster launched on the larger enterprise server cluster is similar for both H2O Core flavors, though some implementation details differ, which are essentially invisible to the data scientist. As mentioned, Sparkling Water is particularly useful for integrating Spark data engineering and H2O model building workflows.

H2O Enterprise Steam – a managed, self-provisioning portal

Enterprise Steam provides a centralized web UI and API for data scientists to initialize and terminate their H2O environments (called H2O clusters) and for administrators to manage H2O users and H2O integration with the enterprise server cluster.

The key features of Enterprise Steam

The key features of Enterprise steam are as follows:

  • Data science self-provisioning: This is an easy, UI-based way for data scientists to manage their H2O environments.
  • Central access point for all H2O users: This creates the ease of H2O user management and a single entry point for H2O access to the enterprise server cluster.
  • Govern user resource consumption: Administrators build profiles of resource usage boundaries that are assigned to users or user groups. This places limits on the number of resources a user can allocate on the enterprise server cluster.
  • Seamless security: User authentication to Enterprise Steam flows through to the authorization of resources on the enterprise server cluster. Enterprise Steam authenticates against the same identity provider (for example, LDAP) that is used by the enterprise server cluster.
  • Configure integration: The administrator configures the integration of H2O with the enterprise server cluster and identity provider.
  • Manage H2O Core versions: The administrator manages one or more H2O Core versions that data scientists use to create H2O clusters for model building.

The H2O MOJO – a flexible, low-latency scoring artifact

The models built from H2O Core are exported as deployable scoring artifacts called H2O MOJOs. MOJOs can run in any JVM environment (except, perhaps, the very smallest edge devices).

In Section 3 – Deploying Your Models to Production Environments, we will learn that MOJOs are ready to deploy directly to H2O software as well as many third-party scoring solutions with no coding required. However, if you wish to directly embed MOJOs into your own software, there is a MOJO Java API to build Java helper classes to expose MOJO capabilities (for example, output reason codes in addition to a prediction) and to provide flexible integration with your scoring input and output.

MOJOs, out of all models, regardless of the machine learning algorithm used to build the model, are identical in construct. Therefore, deployment from a DevOps perspective is repeatable and automatable.

The key features of MOJOs

The key features of the MOJO are as follows:

  • Low latency: Typically, this is less than 100 milliseconds for each scoring.
  • Flexible data speeds: Mojos can make predictions on batch, real time, and streaming data (for example on entire database tables, as REST endpoints and Kafka topics, respectively, to name a few examples).
  • Flexible target systems: This fits into JVM runtimes, including JDBC clients, REST servers, AWS Lambda, AWS SageMaker, Kafka queues, Flink streams, Spark pipelines including streaming, Hive UDF, Snowflake's external functions, and more. Target systems can be specialized H2O scoring software, third-party scoring software, or your own software. A common pattern is to deploy the MOJO to a REST server and consume its predictions via REST calls from a client application (for example, an Excel spreadsheet).
  • Explainability features: In addition to predictions, you can receive K-Lime or Shapley reason codes from the MOJO during live scoring, and you can load the MOJO into H2O Core to score and inspect MOJO attributes.
  • Repeatable deployments: MOJOs are easy to integrate into existing deployment automation (CI/CD) pipelines used by the organization for software deployment.

Note that there is an alternative to the H2O MOJO, called POJO, which is used for infrequent edge cases. This will be explored further in Chapter 8, Putting It All Together.

The workflow using H2O components

Now that we understand the roles and key features of H2O's machine learning at scale components, let's tie them together into a high-level workflow, as represented in the following diagram:

Figure 2.1 – A high-level machine learning at scale workflow with H2O

Figure 2.1 – A high-level machine learning at scale workflow with H2O

The workflow occurs in the following sequence:

  1. The administrator configures H2O Enterprise Steam.
  2. The data scientist logs into H2O Enterprise Steam and launches the H2O Core cluster (choosing either H2O-3 or H2O Sparkling Water).
  3. The data scientist uses their favorite client to build models using the Python, R, or Java/Scala language flavor of the H2O model building API. The data scientist uses a UI or IDE to authenticate to H2O Enterprise Steam and connect to the H2O cluster that was started on H2O Enterprise Steam.
  4. The data scientist uses the IDE to iterate through the model building steps with H2O.
  5. After the data scientist decides on the model to be deployed, H2O AutoDoc is generated, and H2O MOJO is exported from the IDE.
  6. The data scientist either terminates the H2O cluster or waits for H2O Enterprise Steam to do so after the idle or absolute uptime duration has been exceeded. These durations have been configured in a resource profile assigned to the user by the administrator. Note that the terminated cluster checkpoints do work, and a new H2O cluster can always be launched to continue working from the termination point.
  7. The model is exported as H2O MOJO and is deployed to any of a diverse set of hosting targets. The model is consumed in a business context and achievement of the business value begins.

H2O key concepts

In the following sections, we will identify and describe the key concepts of H2O that underlie the workflow steps of the previous section. These concepts are necessary to understand the rest of the book.

The data scientist's experience

The data scientist has a familiar experience in building H2O models at scale while being abstracted from the complexities of the infrastructure and architecture on the enterprise server cluster. This is further detailed in the following diagram:

Figure 2.2 – Details of the data scientist's experience with H2O Core

Figure 2.2 – Details of the data scientist's experience with H2O Core

Data scientists use well-known unsupervised and supervised machine learning techniques that scale across the enterprise's distributed infrastructure and architecture. These techniques are written with the H2O model building API, which is written in familiar languages (such as Python, R, or Java) using familiar IDEs (for example, Jupyter or RStudio).

H2O Flow – A Convenient, Optional UI

H2O generates its own web UI called H2O Flow, which is optional to use during model building. H2O Flow's UI focus and richness of features can be used for a full model building workflow or to leverage for handy tricks, as we will demonstrate in Chapter 5, Advanced Model Building – Part 1.

Therefore, the data scientist works in a familiar world that connects to a complex architecture to scale model building to large or massive datasets. We will explore this architecture in the next section.

The H2O cluster

The H2O cluster is perhaps the most central concept for all stakeholders to understand. It is how H2O creates its unit of architecture for building machine learning models on the enterprise server cluster. We can understand this concept using the following diagram:

Figure 2.3 – The architecture of the H2O cluster

Figure 2.3 – The architecture of the H2O cluster

When a data scientist launches an H2O cluster, they specify the number of servers to distribute the work across (which is also known as the number of nodes), along with the amount of memory and CPUs to use for each node. We will learn that this can be done by configuring manually or by allowing Enterprise Steam to auto compute these specifications based on the volume of training data.

When the H2O cluster is launched, the IDE pushes H2O software (a single JAR file) to each specified number of nodes in the enterprise server cluster, where each node allocates the specified memory and CPU. Then, the H2O software organizes into a self-communicating cluster with one node elected as the leader that communicates with the IDE and coordinates with the remainder of the H2O cluster.

The data scientist connects to the launched H2O cluster from the IDE. Then, the data scientist writes the model building code. Each part of the code is translated by the H2O library in the IDE into instructions to the H2O cluster. Each instruction is sent, in sequence, to the leader node on the H2O cluster, which distributes it to other H2O cluster members where the instructions are executed in parallel. The leader node gathers and combines the results and sends them back to the IDE.

Here are some important notes to bear in mind:

  • Data is ingested directly from the data source to the memory of the H2O nodes. Source data is partitioned between the H2O nodes and not duplicated among them. Data ingested from the storage layer (for example, S3, HDFS, and more) is done in parallel and, therefore, is fast. Data from external sources (for example, the GitHub repository and the JDBC database tables) is not done in parallel. In all cases, data does not pass through the IDE or the client.
  • Each H2O cluster is independent and isolated from the others, including the data ingested into them. Thus, two users launching a cluster and using the same data source do not share data.
  • We will see that administrators of Enterprise Steam assign upper limits on the number of concurrent clusters that users can launch, along with the amount of memory, CPU, and other resources a user can specify when launching a cluster.
  • H2O clusters are static. Once launched, the number of nodes and the number of resources per node do not change until they are terminated, in which case the H2O cluster is torn down. If one of the nodes goes down, the H2O cluster must be restarted and model building steps from the IDE started from the beginning. For longer durations of work, H2O's checkpointing feature helps you to continue from a restore point.

Let's look at the life cycle of an H2O cluster, as shown in the following diagram:

Figure 2.4 – The life cycle of the H2O cluster

Figure 2.4 – The life cycle of the H2O cluster

Let's look at each of the stages of the life cycle, one by one, to understand how they work:

  1. Launch: The data scientist launches an H2O cluster from the Enterprise Steam UI or API. H2O-3 or Sparkling Water is chosen. The H2O cluster size and resources (that is, the number of nodes, memory per node, and other configurations) are manually input, or they are automatically generated by Enterprise Steam based on the data volume input by the user. The H2O cluster is formed as described earlier.
  2. Connect to: The data scientist switches to their IDE and connects to the H2O cluster by specifying its name.
  3. Build models on: The data scientist builds models using H2O. The H2O library used in the IDE translates the H2O API code for each model building iteration into instructions. These are sent to the leader node and distributed across the H2O cluster.
  4. Stop: The H2O cluster is shut down. Resources are released, and the H2O software is removed from each node of the H2O cluster. This can be done by the user from the IDE or can occur automatically after a duration of idle time or when the absolute running time of the H2O cluster has been exceeded (these durations were specified in the H2O cluster launch during step 1 of the life cycle). Though not running, information regarding this cluster is still available to the user (for example, the name, the H2O version, and the size).

Stop/Save Data & Restart: This is an alternative to Stop and is possible when the Enterprise Steam administrator configures this option for a user or user group. In this case, when the H2O cluster is stopped, it saves data from the model building steps (that is, it saves the model building state) to the storage layer. When the cluster is restarted (using the same name as when it was launched), the cluster is launched and returned to its previous state.

  1. Delete: This stops the cluster (if running) and permanently deletes all references to the H2O cluster. If it has been stopped with the model building state saved, this data will be permanently deleted as well.

Enterprise Steam as an H2O gateway

All H2O administration activities occur on Enterprise Steam, and users must launch H2O clusters through Steam. This all roads lead to Enterprise Steam approach means that Steam governs users and their H2O clusters before they are launched on the enterprise system. This is detailed in the following diagram:

Figure 2.5 – Enterprise Steam viewed as an H2O gateway to the enterprise cluster

Figure 2.5 – Enterprise Steam viewed as an H2O gateway to the enterprise cluster

Administrators configure settings to manage H2O users and integrate Enterprise Steam with the enterprise server cluster. Additionally, administrators store H2O software versions that will be pushed to the server cluster when H2O clusters are launched and removed when the cluster is stopped and the resources are released. Administrators also have access to user usage data. This is all done through an administration-only UI.

Administrators configure users and how users launch H2O clusters in the enterprise environment. These configurations define limits on the number of concurrent clusters a user can launch simultaneously, the size (that is, the number of nodes), and the number of resources (for example, memory per node) allocated for each H2O cluster that is launched. Configurations also define when the cluster will stop or delete if the user does not do so manually from the H2O model building code in the IDE. A set of such configurations is defined as a profile, and one or more profiles are assigned to users or user groups. Therefore, administrators can assign some users as power users and others as light users.

Users authenticate to Enterprise Steam via the same identity provider (for example, LDAP) that was implemented to authorize access to resources on the enterprise server cluster environment (for example, S3 buckets). Enterprise Steam passes the user identity when the user launches a cluster, and this identity is used during authorization challenges on the enterprise system. Users in their IDEs must authenticate against the Enterprise Steam API to connect to the clusters they have launched.

Does H2O Core Require Enterprise Steam?

Note that H2O core does not require Enterprise Steam. Enterprise administrators can configure their enterprise server cluster infrastructure to allow H2O clusters to be launched on this infrastructure.

However, this approach is not a sound enterprise practice. It introduces a loss of control and governance that Enterprise Steam provides as a centralized H2O gateway to secure, manage, and log users, as elaborated in this section. Additionally, Enterprise Steam provides benefits to users by freeing them from the technical steps involved with integrating H2O Core with the enterprise cluster when launching H2O clusters, for example, Kerberos security requirements. The enterprise benefits of Enterprise Steam are explored in greater detail in Chapter 11, The Administrator and Operations Views, and in Chapter 12, The Enterprise Architect and Security Views.

Also, bear in mind that H2O Core is free and open source, whereas Enterprise Steam is not.

Enterprise Steam and the H2O Core high-level architecture

Now that we know how H2O clusters are formed and the role Enterprise Steam plays in administering H2O users and launching H2O clusters, let's understand Enterprise Steam and the H2O Core architecture from a high-level deployment perspective. The following diagram describes this deployment architecture:

Figure 2.6 – Enterprise Steam and the H2O Core high-level deployment architecture

Figure 2.6 – Enterprise Steam and the H2O Core high-level deployment architecture

Enterprise Steam runs on its own dedicated server that communicates with the enterprise server cluster via HTTP(S). As mentioned earlier, Enterprise Steam stores the H2O Core (H2O-3 or Sparkling Water) JAR file that is pushed to the server cluster, which then self-organizes into a coordinated but distributed H2O cluster. This H2O cluster can be a native YARN or Kubernetes job, depending on which backend is implemented. Note that H2O-3 is run on a Map-Reduce framework, and Sparkling Water is run on the Spark framework.

An H2O-3 or Sparkling Water API library is installed to the data science IDE (for example, a pip install of the H2O-3 package in the Jupyter environment). It must match the version that is used to launch the cluster from Enterprise Steam. As mentioned previously, data scientists use the IDE to authenticate to Enterprise Steam, connect to the H2O cluster, and write H2O model building code. The H2O model building code is translated by the H2O client library into a REST message that is sent to the H2O cluster's leader node. Then, the work is distributed across the H2O cluster, and the results are returned to the IDE.

Note that enterprise clusters can be on-premise, cloud infrastructure-as-a-service, or managed service implementations. They can be, for example, Kubernetes or Cloudera CDH on-premise or in the cloud, or Cloudera CDP or Amazon EMR in the cloud. The full deployment possibilities are discussed in more detail in Chapter 12, The Enterprise Architect and Security Views.

H2O Platform Choices

H2O At Scale technology in this book is referred to as comprising: H2O Enterprise Steam + H2O Core (H2O-3, H2O Sparkling Water) + H2O MOJO. H2O At Scale integrates with an enterprise server cluster for model building and an enterprise scoring environment for model deployment.

H2O At Scale can be implemented with the just mentioned components alone. Alternatively, H2O At Scale can be implemented as a subset of the larger H2O machine learning platform and capability set called H2O AI Cloud. The H2O AI Cloud platform is described in greater detail in Section 5 – Broadening the View – Data to AI Applications with the H2O AI Cloud Platform.

Sparkling Water allows users to code in H2O and Spark seamlessly

The following code shows a simple example of Spark and H2O integrated in the same H2O code using H2O Sparkling Water:

# import data
loans_spark = spark.read.load("loans.csv", format="csv", sep=",", inferSchema="true", header="true")
# Spark data engineering code
loans_spark = # any Spark SQL or Spark DataFrame code
# Convert Spark DataFrame to H2O Frame
loans = h2oContext.asH2OFrame(loans_spark)
# Continue with H2O model building steps as in previous code example
loans.describe()

The code shows Spark importing data, which is held as a Spark DataFrame. Spark SQL or the Spark DataFrame API is used to engineer this data into a new DataFrame and then this Spark DataFrame is converted into an H2OFrame from which H2O model building is performed. Therefore, the user is iterating seamlessly from Spark to H2O code in the same API language and IDE.

The idea of the H2O cluster is still fundamentally true for Sparkling Water. It now expresses the H2O cluster architecture within the Spark framework. Details of this architecture are elaborated in Chapter 12, The Enterprise Architect and Security Views.

MOJOs export as DevOps-friendly artifacts

Data scientists build models, but the end goal is to put models into a production environment where predictions are made in a business context. MOJOs make this last mile of deployment easy. MOJOs are exported by a single line of code. For example, whether the model was built using Python, R, or using a generalized linear model, an XGBoost model, or stacked ensemble, all MOJOs are identical from a DevOps perspective. This makes model deployment repeatable and, thus, capable of fitting into existing automated CI/CD pipelines that are used throughout the organization.

Summary

In this chapter, we laid the foundation for understanding H2O machine learning at scale. We started by reviewing a bare minimum Hello World code example and discussed the problems of scale around it. Then, we introduced the H2O Core, Enterprise Steam, and MOJO technology components and how these can overcome problems of scale. Finally, we extracted a set of key concepts from these technologies to deepen our understanding.

In the next chapter, we will use this understanding to begin our journey of learning how to build and deploy world-class models at scale. Let the coding begin!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset