Chapter 2. Integration overview

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Integration overview

This chapter describes how to integrate the various products to build a scalable solution.

The following topic are described in this chapter:

•Architecture overview

•System configurations

•Deployment options

•IBM Watson Machine Learning Accelerator and Hortonworks Data Platform

•IBM Watson Studio Local with Hortonworks Data Platform

•IBM Watson Studio Local with IBM Watson Machine Learning Accelerator

•IBM Spectrum Scale and Hadoop Integration

•Security

2.1 Architecture overview

This architecture overview describes the software components that are used to build the artificial intelligence (AI) environment as the platform, which includes the AI frameworks; the software that is used to build a multi-tenant data science environment; the distributed computing, training, and inference environment; and a tiered data management environment (Figure 2-1). This architecture addresses some of the bigger challenges facing developers and data scientists by cutting down the time that is required for AI system training and simplifying the development experience.

Figure 2-1 How the components fit together

During the lab testing for this publication, the products and versions that were used are listed in Table 2-1.

Table 2-1 Software levels

Product	Version
IBM PowerAI	1.5.4
IBM Watson Machine Learning Accelerator	1.1.2
H20 Driverless AI	1.4.2
IBM Watson Studio Local	1.2.1.0
Hortonworks Data Platform (HDP)	2.6.5.0
Hortonworks DataFlow (HDF)	3.1.2.0
IBM Elastic Storage Server (IBM ESS)	5.3.1.1
IBM Spectrum Scale	5.0.1.2
IBM Power Systems servers	POWER9 and POWER8 processors

2.1.1 Infrastructure stack

The infrastructure includes compute nodes, GPU accelerators, a storage solution, and networking.

Compute nodes

The resource of primary interest for AI is the GPUs, and they are on the IBM Power Systems server nodes:

•IBM Power System AC922 servers feature POWER9 CPUs and support 2 - 6 NVIDIA Tesla V100 GPUs with NVLink providing CPU:GPU bandwidth speeds of 350 GBps. This system supports up to 2 TB total memory. For more information, see IBM Power System AC922.

•IBM Power System S822LC for High-Performance Computing (Power S822LC for HPC) servers feature POWER8 CPUs and support 2 - 4 NVIDIA Tesla P100 GPUs with NVLink GPUs providing CPU:GPU bandwidth speeds of 64 GBps. For more information, see IBM Power System S822LC for High Performance Computing.

Table 2-2 provides the compute configuration system details that are provisioned on the lab environment for this book.

Table 2-2 System details

Server type	GPU	OS	CPU	Memory	Drive
Power AC922 server	Four NVIDIA V100 GPUs	RHEL 7.5	40 POWER9 cores @3.8 GHz	1 TB	Two 1.92 TB solid-state drives (SSDs)
Two Power LC922 servers	N/A	RHEL 7.5	40 POWER9 cores @3.8 GHz	512 GB	Two 128 GB SATA DOM drives Four 8 TB hard disk drives (HDDs)
Two Power S822LC for Big Data servers	N/A	RHEL 7.5	20 POWER8 cores @3.5 GHz	512 GB	Two 128 GB SATA DOM drives Four 1.92 TB SSDs Two 1.6 TB NVMe SSDs
Power S822LC for HPC server	N/A	RHEL 7.5	20 POWER8 cores @3.5 GHz	1 TB	Two 3.8 TB SSDs One 2.9 TB NVMe SSD

GPU

There are four NVIDIA Tesla V100 GPUs integrated into a Power AC922 server. This model is the most advanced data center GPU ever built to accelerate AI, high-performance computing (HPC), and graphics. It is based on the NVIDIA Volta architecture, comes in 16 GB and 32 GB configurations, and offers a performance of up to 100 CPUs in a single GPU.

Storage

To support the variety and velocity of data that is used and produced by AI, the storage system must be intelligent, have enough capacity, and be highly performant. Key attributes include tiering and public cloud access, multiprotocol support, security, and extensible metadata to facilitate data classification. Performance is multi-dimensional for data acquisition, preparation, and manipulation; high-throughput model training on GPUs; and latency sensitive inference. The type of storage that is used depends on the location of the data and the stage of processing for AI. Corporate and older data usually is in an organizational data lake in Hadoop Distributed File System (HDFS).

IBM ESS combines IBM Spectrum Scale software with IBM POWER8 processor-based I/O-intensive servers and dual-ported storage enclosures. IBM Spectrum Scale is the parallel file system at the heart of IBM ESS and scales system throughput as it grows while still providing a single namespace, which eliminates data silos, simplifies storage management, and delivers high performance.

For more information, see IBM Elastic Storage Server.

The storage configuration that is allocated for the integration that is used in this publication is shown in Table 2-3.

Table 2-3 IBM ESS GS2

Specification	Details
Machine type and model number	5126-GS2
Serial number	218E54G
Drive type	Forty-eight 400 GB SSDs
File system capacity	11.17 TB
File system block size	16 MB

Networking

The exact architecture, vendors, and components that are needed to build the network subsystem depend upon the organization’s preference and skills. InfiniBand or high-speed Ethernet and a network topology that allows both north and south (server to storage) traffic and can also support east and west (server to server) traffic are required. Adopting a topology that extends to an InfiniBand Island structure enables the training environment to scale for large clusters. An adequate network subsystem with necessary throughput and bandwidth to connect the different tiers of storage is required.

IBM ESS offers network adapter options. Three PCI slots are reserved for SAS adapters and one PCI slot is configured by default with a 4-port 10/100/1000 Ethernet adapter for management. Three other PCIe3 slots are available to configure with any combination of Dual-Port 10 GbE, Dual-Port 40 GbE, or Dual-Port InfiniBand PCI adapters.

The following networking interfaces are allocated for the integration that is deployed in this publication:

•10 Gb administration network

•56 Gb EDR InfiniBand network

2.2 System configurations

A small starter configuration for IBM Watson Studio Local, IBM Watson Machine Learning Accelerator and HDP depend on the needs of the organization. For example, choosing whether to use a 3-node or 9-node IBM Watson Studio Local cluster depends on the number of users that need access to the system. If an organization expects 10 - 30 users to use the system concurrently, then a 3-node configuration is sufficient. However, if more than 30 concurrent users are expected, then a 9-node cluster should be put in place.

IBM Watson Machine Learning Accelerator is the cluster manager, and can set up service-level agreements (SLAs) between the instances and usage of resources. IBM Watson Machine Learning Accelerator provides many optimizations that accelerate performance; improve resource utilization; and reduce installation, configuration, and management complexities. HDP integration with IBM Watson Studio Local and IBM Watson Machine Learning Accelerator presents a unique mix of technologies that improve the work of a data scientist. This integration makes your work with big data more efficient, and it makes data science more accessible and scalable by bringing all enterprise data to a new level of accurate prediction.

2.2.1 IBM Watson Machine Learning Accelerator configuration

IBM Watson Machine Learning Accelerator is optimized for the Power AC922 and the Power S822LC for High-Performance Computing servers. IBM Watson Machine Learning Accelerator is optimized to use IBM Power Systems servers with NVLink and NVIDIA GPUs that are not available on any other platforms. IBM Watson Machine Learning Accelerator is supported on Power AC922 servers with NVIDIA Tesla V100 GPUs, and Power S822LC servers with NVIDIA Tesla P100 GPUs.

Powered by the revolutionary NVIDIA Volta architecture, the Tesla V100 GPU satisfies the most stringent demands of today’s next-generation AI, analytics, deep learning (DL), and HPC workloads. The Tesla V100 GPU accomplishes these tasks by delivering 112 teraflops (TFLOPS) of deep-learning performance to accelerate compute-intensive workloads. It delivers over 40% performance improvements for HPC and over 4X deep-learning performance improvements over the previous-generation NVIDIA Pascal architecture-based Tesla P100 GPU.

The base system configuration is as follows:

•Two IBM POWER9 or POWER8 CPUs.

•128 GB or more of memory.

•NVIDIA Tesla P100 or V100 with NVLink GPUs are required.

•NVIDIA NVLink interface to Tesla GPUs preferred.

Here are the software requirements:

•Red Hat Enterprise Linux 7.5

•Third-party software from NVIDIA, such as CUDA, CUDA Deep Neural Network (cuDNN), and NCCL for CUDA.

Table 2-4 and Table 2-5 list the minimum hardware and software system requirements for running IBM Watson Machine Learning Accelerator in a production environment. You might have extra requirements (such as extra CPU and RAM) depending on the Spark Instance Groups (SIGs) that run on the hosts, especially for compute hosts that run on workloads.

Table 2-4 Hardware requirements

Requirements	Management hosts	Compute hosts	Notes
RAM	64 GB	32 GB	In general, the more memory that your hosts have, the better performance is.
Disk space that is required to extract the installation files from the IBM Watson Machine Learning Accelerator installation package	16 GB (First Management host only)	N/A	None.
Disk space that is required to install IBM Spectrum Conductor	12 GB	12 GB	None.
Disk space that is required to install IBM Spectrum Deep learning impact	11 GB	11 GB	None.
Extra disk space (for Spark Instance Groups (SIG) package, logs, and other items.)	Might be 30 GB for a larger cluster	1 GB*N slots + sum of service package sizes (including dependencies)	The disk space requirements depend on the number of SIGs and the Spark application that you run. Long-running applications like notebooks and streaming applications can generate huge amounts of data that is stored in Elasticsearch.

Table 2-5 Software requirements

Hardware	Operating system	GPU software
POWER8 processor	RHEL 7.5 (ppc64le)	•cuDNN 7.3.1library •NVIDIA CUDA 10.0.130 •NVIDIA GPU driver 410.72 •Anaconda 5.2 •NVIDIA NCCL 2.3.5
POWER9 processor with the security fix RHSA-2018:1374 - Security Advisory	RHEL 7.5 (ppc64le)	•cuDNN 7.3.1 library •NVIDIA CUDA 10.0 •NVIDIA GPU driver 410.72 •Anaconda 5.2 •NVIDIA NCCL 2.3.5

To get the preferred performance from integrating Hadoop with IBM Watson Studio Local and IBM Watson Machine Learning Accelerator, start with choosing the correct hardware and software stack. The planning stage is vital in terms of determining the performance and the total cost of ownership (TCO) that is associated with it.

2.2.2 IBM Watson Studio Local configurations

This section describes what small starter configurations look like for IBM Watson Studio Local (formerly know as IBM Data Science Experience (IBM DSX)) deployed with an HDP datalake. It also describes advanced configurations.

IBM and Hortonworks worked together to integrate IBM Watson Studio Local with HDP.

The system requirements for IBM Watson Studio Local describe in detail the hardware and software requirements for 3-node, 7-node, and 9-node configurations. These configurations are recommended for production environments. For testing purposes, you can choose a smaller configuration.

In this section, the following topics are described:

•Requirements for a 3-node configuration

•Requirements for a 7-node configuration

•Requirements for a 9-node configuration

Requirements for a 3-node configuration

Here is the example 3-nodes configuration that we deployed:

•Number of servers used: Three virtual machines (VMs) (servers can be physical machines or VMs)

•Installed operating system: Red Hat Enterprise 7.5 (Maipo)

Table 2-6 Three-node configuration

Node	CPUs	RAM	Storage
Master node 1	Eight	126 GB	One 1.8 TB solid-state drive (SSD)
Master node 2	Eight	126 GB	One 1.8 TB SDD
Master node 3	Eight	126 GB	One 1.8 TB SDD

Figure 2-2 and Figure 2-3 show the output of one of the VMs’ storage space and CPUs.

[root@p460a26 ~]# lsblk

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT

vda 252:0 0 1.8T 0 disk

├─vda1 252:1 0 10M 0 part

├─vda2 252:2 0 1G 0 part /boot

├─vda3 252:3 0 1.2T 0 part

│ ├─rhel_p460a26-root 253:0 0 100G 0 lvm /

│ ├─rhel_p460a26-swap 253:1 0 4G 0 lvm [SWAP]

│ ├─rhel_p460a26-var 253:2 0 100G 0 lvm /var

│ ├─rhel_p460a26-home 253:3 0 50G 0 lvm /home

│ ├─rhel_p460a26-ibm 253:4 0 500G 0 lvm /ibm

│ └─rhel_p460a26-data 253:5 0 500G 0 lvm /data

└─vda4 252:4 0 533.5G 0 part

└─rhel_p460a26-docker 253:6 0 200G 0 lvm

Figure 2-2 Disk space of a VM

[root@p460a26 ~]# lscpu

Architecture: ppc64le

Byte Order: Little Endian

CPU(s): 8

On-line CPU(s) list: 0-7

Thread(s) per core: 1

Core(s) per socket: 1

Socket(s): 8

NUMA node(s): 1

Model: 2.0 (pvr 004d 0200)

Model name: POWER8 (architected), altivec supported

Hypervisor vendor: KVM

Virtualization type: para

L1d cache: 64K

L1i cache: 32K

NUMA node0 CPU(s): 0-7

Figure 2-3 CPU output of a VM

Requirements for a 7-node configuration

When considering a 7-node configuration, consider the following items:

•The installation requires at least 10 GB on the root partition.

•If you plan to place /var on its own partition, reserve at least 10 GB for that partition.

•IBM Statistical Package for the Social Sciences (IBM SPSS®) Modeler add-on requirement: If you plan to install the SPSS Modeler add-on, add 0.5 CPU and 8 GB of memory for each stream that you plan to create.

•All servers must be synchronized in time (ideally through NTP or Chrony).

•SSH between nodes should be enabled.

•YUM should not be already running.

Table 2-7 shows a sample configuration.

Table 2-7 Sample configuration of a 7-node cluster

Node type	Number of servers¹	CPU	RAM	Disk space
Control or storage	Three	Eight cores	46 GB	Minimum 300 GB with Extended File System (XFS) format for installer files partition + minimum 500 GB with XFS format for data storage partition + minimum 200 GB of extra raw disk space for Docker.
Compute	Three cores	16	64 GB	Minimum 300 GB with XFS format for installer file partition + minimum 200 GB of extra raw disk space for Docker. If you add more cores, a total of 48 - 50 cores that are distributed across multiple nodes is recommended.
Deployment	One	16	64 GB	Minimum 300 GB with XFS format for installer file partition + minimum 200 GB of extra raw disk space for Docker. If you add more cores, a total of 48 - 50 cores that are distributed across multiple nodes is recommended.

¹ Bare metal or VM

Requirements for a 9-node configuration

Table 2-8 shows the minimum requirements for a 9-node configuration.

Table 2-8 Nine-node configuration

Node type	Number of servers¹	CPU	RAM	Disk space
Control	Three	Four cores	16 GB	Minimum 500 GB with XFS format for installer files partition.
Storage	Three	Eight cores	48 GB	Minimum 500 GB with XFS format for installer files partition + minimum 500 GB with XFS format for data storage partition.
Compute	Three	16 cores	64 GB	Minimum 500 GB with XFS format for installer file partition. If you add more cores, a total of 48 -50 cores that are distributed across multiple nodes is recommended.

¹ Bare metal or VM

2.2.3 Configuring an HDP system

Figure 2-4 shows some recommendations for HDP deployment with IBM Watson Studio Local and IBM Watson Machine Learning Accelerator for a production system.

Figure 2-4 HDP recommended configurations

2.2.4 Configuring a proof of concept

Table 2-9 shows a proof of concept (PoC) configuration for a minimally functioning environment with cost-sensitive variations.

Table 2-9 Proof of concept configuration

Item	System management node	Master/edge node	Worker node
Cluster type	All	All	PoC
Server model	1U Power LC921server	1U Power LC921 server	1U Power LC922 server
Servers	One	One	Three
Sockets	Two	Two	Two
Cores	32	40	44
Memory	32 GB	256 GB	256 GB
Storage	Two 4 TB HDDs	Four 4 TB HDDs	Four 4 TB HDDs
Storage controller	MicroSemi PM8069 (internal)	MicroSemi PM8069 (internal)	MicroSemi PM8069 (internal)
Network¹- 1 GbE	Internal (4-port OS)	Internal (4-port OS)	Internal (4-port OS)
Cablesa- 1 GbE	Three (two OSes + one baseboard management controller (BMC))	Three (two OSes + one BMC)	Three (Two OS + one BMC)
Networkb- 10 GbE	One 2-port Intel (two ports)	Two 2-port Intel (four ports)	One 2-port Intel (two ports)
Cables²- 1 GbE	Two cables (direct-access cables (DACs))	Four cables (DACs)	Two cables (DACs)
Operating system	RHEL 7.5 for POWER9 processor-based systems	RHEL 7.5 for POWER9 processor-based systems	RHEL 7.5 for POWER processor-based systems

¹ The 1 GbE network infrastructure hosts the following logical networks: campus, management, provisioning, and service network.

² The 10 GbE network infrastructure hosts the data network.

2.2.5 Conclusion

IBM Watson Studio Local, IBM Watson Machine Learning Accelerator, and HDP deployed on IBM Power Systems servers, with its many hardware threads, high memory bandwidth, and tightly integrated NVIDIA GPUs, are suitable for running machine learning (ML) and DL computations that use open source frameworks on large data sets in enterprise environments.

2.3 Deployment options

In this section, we consider some of the deployment options for the IBM AI solutions and H2O Driverless AI.

2.3.1 Deploying IBM Watson Studio Local in stand-alone mode or with IBM Watson Machine Learning Accelerator

When IBM Watson Studio Local is installed as a stand-alone product, it provides a premier enterprise development and deployment environment for data scientists. Projects can be used to segment lines of business (LOBs) and use cases. Its integration with a datalake through the Hadoop Integration service is superb. You can use its environment for Jupyter with Python 3.6 and IBM PowerAI V1.5.3 for GPU to leverage the Power Systems nodes that have GPUs.

IBM Watson Machine Learning Accelerator is installed on POWER8 or POWER9 processor-based nodes with GPUs. Version 1.1.2 includes IBM PowerAI V1.5.4, which includes recent releases of DL frameworks like TensorFlow 1.12 with Keras.

IBM Watson Machine Learning Accelerator also supports multiple LOBs and use cases through IBM Spectrum Conductor consumers. Consumers can have different resources that are allocated to them and sharing policies can be defined statically or dynamically to give administrators full control over how the cluster is used. IBM Watson Machine Learning Accelerator Deep Learning Impact can help data scientists reach their goals quicker.

IBM Watson Studio Local can complement an IBM Watson Machine Learning Accelerator cluster by submitting remote jobs to run on it. The IBM Watson Studio Local notebooks can use Sparkmagic to connect to a Livy service application that is installed on IBM Watson Machine Learning Accelerator and submit Spark workloads on the SIG that is associated with the Livy application.

IBM PowerAI, IBM Watson Machine Learning Accelerator, and IBM Watson Studio Local can be installed in an IBM Cloud Private environment. That option enables resource sharing by all three products.

2.3.2 Using the Hadoop Integration service versus using an Apache Livy connector

IBM Watson Studio Local can connect to a Hadoop cluster by using an Apache Livy connection or the new Hadoop Integration service. The Hadoop Integration service has many advantages:

•You can push customized environments that are running in IBM Watson Studio Local to HDP.

•You can select the pushed environment in which notebooks connecting to HDP run.

•On HDP servers that are secured with Kerberos, use the data connector so that jobs may run as the calling user without needing a keytab for each user, which is important for fine-grained authorization and the associated logging of resource usage.

•You can define HDFS and Hive data sources.

•You can shape and transform data in HDP by using the IBM Watson Studio Local Refinery.

Using the IBM Watson Studio Local Hadoop Integration service to connect to HDP and Cloudera datalakes is a best practice.

2.3.3 Deploying H2O Driverless AI in stand-alone mode or within IBM Watson Machine Learning Accelerator

H2O Driverless AI can be installed as a stand-alone product or as an IBM Watson Machine Learning Accelerator application. If you install H2O Driverless AI as an IBM Watson Machine Learning Accelerator application, you can manage it (start and stop) from the IBM Watson Machine Learning Accelerator console.

To install H2O Driverless AI as an application on IBM Watson Machine Learning Accelerator, download the Linux tar.sh file at the following website:

https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/rel-1.3.1-12/ppc64le-centos7/dai-1.3.1-linux-ppc64le.sh

The H2O Driverless AI software is available for use in pure user-mode environments as a self-extracting tar.sh archive. This form of installation does not require a privileged user to install or to run.

For more information about the H2O.ai tar.sh package, see Using Driveless AI 1.5.4.

The IBM NVLink interface enables the H2O Driverless AI next-generation AI platform to get the maximum performance gains. Together, H2O Driverless AI and IBM PowerAI provide companies with a data science platform or an AI workbench that addresses a broad set of use cases for ML and DL in every industry.

These integrations happen mainly with H2O Driverless AI, but there is a possibility to integrate with H2O Sparkling Water as well.

H2O Driverless AI

H2O Driverless AI provides the following functions:

•Automates data science and ML workflows.

•H2O Driverless AI is started on a single host that can have either GPUs or run with CPUs.

•Shared file system for data and logs.

•If you use GPUs, the entire host is taken (with current integration).

•An application instance is created for each user of H2O Driverless AI.

•Environment variables through parameters are used to configure H2O Driverless AI.

•Fails over to another host if H2O Driverless AI goes down, and IBM Spectrum Conductor starts it on another host.

Figure 2-5 shows H2O Driverless AI on the management console.

Figure 2-5 H2O Driverless AI as shown in the management console

H2O Sparkling Water

H2O Sparkling Water provides the following functions:

•You can use it to combine the ML algorithms of H2O Driverless AI with the capabilities of Spark.

•It runs as a notebook in a SIG.

•When the notebook starts, it forms a mini-cluster of executors.

•These executors stay alive for the entire duration of the notebook.

•IBM Spectrum Conductor disables preemption to prevent reclamation of these hosts.

•Multiple users can share an H2O Sparkling Water notebook instance or have dedicated ones per user.

If you want to work with H2O Driverless AI in IBM Watson Studio Local, install the H2O Flow add-on to use an H20 Flow session within IBM Watson Studio Local to create documents and work with models.

For more information, see H2O Flow add-on.

IBM Spectrum Conductor integration with H2O Driverless AI

At the time of writing, H2O Driverless AI does not support NVIDIA CUDA 10.0, which is a prerequisite for IBM Watson Machine Learning Accelerator V1.1.2 and IBM PowerAI V1.5.4. The workaround is to install the NVIDIA CUDA 9.0 libraries without the driver. The IBM Watson Machine Learning Accelerator cluster continues to run with the CUDA 10.0 driver, but CUDA 9.0 libraries are available for H2O Driverless AI.

To configure the workaround, complete the following steps:

1. Run the following command:

wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda-repo-rhel7-9-0-local-9.0.176-1.ppc64le-rpm

2. Run the following command:

rpm -i cuda-repo-rhel7-9-0-local-9.0.176-1.ppc64le.rpm

3. Run the following command:

yum install cuda-toolkit-9-0

4. Install H2O Driverless AI by completing the following steps:

a. Download the Linux tar.sh package from the H2O website.

b. Download conductor-h2o-driverlessai-master.zip from GitHub.

c. Install H2O Driverless AI on all the nodes by reviewing the following video at Box.

d. Run cd /var/dai.

e. Install dai.sh.

f. Extract the GitHub package to the same directory. Make sure that the directory has the same owner and group of the executor user, for example, egoadmin.

5. After installing H2O Driverless AI, edit or create the /etc/dai/EnvironmentFile.conf file to contain the following parameter definitions:

DRIVERLESS_AI_CUDA_VERSION=cuda-9.0

LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64:/usr/local/cuda-9.0/extras/CUPTI/lib64

6. Log in to the IBM Spectrum Conduction console to register H2O Driverless AI as an application instance that can be stopped and started by using the application management support. Starting at the console, complete the following steps:

a. Click Workload and select the application instance.

b. Click Add.

c. Register by using the yaml file by selecting the yaml file and completing the form.

d. Specify the top-level consumer.

e. Specify the resource group (RG).

f. Specify the data directory, for example, /var/dai-data.

g. Specify the execution user, which is the OS user running the application, which is the same as the SIG execution user.

h. Specify the installation directory, for example, /var/dai/dai....

i. Click Register.

7. Select the H2O Driverless AI application instance and start it.

8. Review the output section for dai_url.

9. Click dai_url to start H2O Driverless AI.

2.3.4 Running Spark jobs

In an environment with IBM Watson Machine Learning Accelerator, IBM Watson Studio Local, and HDP, the preferred place to run Spark jobs depends on several factors. When the data of interest is in the HDP datalake, running Spark jobs on the datalake takes advantage of the data locality and distributed nature of Spark.

If you want to take advantage of GPUs and the IBM PowerAI optimized DL frameworks, run the jobs on IBM Watson Machine Learning Accelerator or IBM Watson Studio Local (assuming IBM Watson Studio Local has Power Systems servers with GPUs).

2.4 IBM Watson Machine Learning Accelerator and Hortonworks Data Platform

The following main components are included with IBM Watson Machine Learning Accelerator:

•IBM Spectrum Conductor Deep Learning Impact:

– Data Manager and extract, transform, and load (ETL)

– Training Visualization and Training

– Hyper-parameter optimization

•IBM Spectrum Conductor:

– Multi-tenancy support and security

– User reporting and charge back

– Dynamic resource allocation

– External data connectors

Officially, IBM Watson Machine Learning Accelerator with IBM Spectrum Conductor integrates with IBM Cloud Private and IBM Watson Studio. Through this integration, it runs notebooks from SIGs to submit Spark workloads.

However, IBM Watson Machine Learning Accelerator can easily integrate with HDP in two ways:

•You can configure data connectors that enable HDFS data to be accessed by IBM Watson Machine Learning Accelerator models.

•IBM Watson Machine Learning Accelerator can submit a Spark job to an HDP cluster that is configured with Apache Livy. Running the job directly on HDP takes advantage of the data locality.

For both approaches, you can request them directly from a notebook by using the following sample code:

•Livy

%load_ext sparkmagic.magics

To see all the available options for the sparkmagic extension inside the notebook, run the following command:

help as: spark?

%load_ext sparkmagic.magics

%spark add -s <session> -l python -u <hdp-livy_URL> -a u -k

The Livy UI provides full details about each interaction inside the session, including input and output returned.

•HDFS

df = spark.read.load("hdfs://<hdp-domain_url>:8020/user/user1/datasets/cars.csv", format="csv”,)

2.5 IBM Watson Studio Local with Hortonworks Data Platform

There are two types of integration for IBM Watson Studio Local with Hadoop:

•Hadoop is used as a data source. IBM Watson Studio Local supports both secure and non-secure connections to HDFS and Hive. You can configure a connection by using the UI or programmatically. If you configure a connection by using the UI, you can browse for files, preview them, and share data sources with project collaborators.

•Using a Spark environment in the Hadoop cluster to run IBM Watson Studio Local notebooks and batch jobs. The main advantage of this approach is cutting down the performance impact of moving data from Hadoop to the Spark cluster in IBM Watson Studio Local.

Figure 2-6 shows the logical connections between IBM Watson Studio Local and Hortonworks.

Figure 2-6 Watson Studio connections with Hortonworks

IBM Watson Studio Local interacts with an HDP cluster through four services:

•WebHDFS

WebHDFS is used to browse and preview HDFS data.

•WebHCAT

WebHCAT is used to browse and preview Hive tables.

•Livy for Spark

•Livy for Spark2

Livy for Spark and Livy for Spark2 are used to submit jobs to Spark or Spark2 engines on the Hadoop cluster.

Note: WebHCAT is no longer supported as of HDP 3.0. IBM Watson Studio Local integrates with HDP through HDFS and Spark integrates through Livy as of HDP 3.0.

The Hadoop registration service should be installed on an edge node of the HDP cluster. The gateway component authenticates all incoming requests and forwards them to the Hadoop services. In a Kerberized cluster, the keytab of the Hadoop registration service user and the Simple and Protected GSSAPI Negotiation Mechanism (SPNEGO) keytab for the edge node are used to acquire the ticket to communicate with the Hadoop services. All requests to the Hadoop service are submitted as the IBM Watson Studio Local user.

For more information, see Set up a remote Hadoop cluster to work with Watson Studio Local.

2.6 IBM Watson Studio Local with IBM Watson Machine Learning Accelerator

When IBM Watson Studio Local is installed on Power Systems servers with GPUs (such as the Power S822LC HPC or Power AC922 servers), the IBM PowerAI DL libraries are automatically included (including GPU-enabled libraries for TensorFlow and Caffe). Users of IBM Watson Studio Local can create notebooks and build models that use these frameworks and the Power Systems and GPU-optimized libraries.

These IBM PowerAI libraries that are included with IBM Watson Studio Local come without support, which is similar to how the IBM PowerAI base product is available for no charge without support. Clients who want support can purchase IBM Watson Machine Learning Accelerator, which also comes with capabilities such as IBM Spectrum Conductor Deep Learning Impact. IBM Spectrum Conductor Deep Learning Impact provides tuning for DL model training tasks, and IBM Spectrum Conductor has multi-tenant Spark cluster capabilities. IBM Watson Studio Local can be integrated with IBM Watson Machine Learning Accelerator by using the Apache Livy interface. In this scenario, IBM Watson Studio Local is the development and collaboration platform, and the IBM Watson Machine Learning Accelerator cluster is used to run model training.

In general terms, IBM PowerAI integrated with IBM Watson Studio Local provides optimized DL frameworks and libraries to leverage IBM Power Servers with GPUs (the Power AC922 server). As mentioned earlier, IBM Watson Studio Local is the platform where notebooks are created, data sources are identified and accessed, and collaboration between team members is enabled. IBM Watson Machine Learning Accelerator, if available, can be used for Spark-based ETL and DL model tuning and training. IBM Watson Studio Local is where the user creates the notebook (where the model training program is created). Without IBM Watson Machine Learning Accelerator, the training job is run on the compute cluster that is part of IBM Watson Studio Local. With IBM Watson Machine Learning Accelerator, the training job is scheduled on the IBM Watson Machine Learning Accelerator cluster.

Both IBM Watson Studio Local and IBM PowerAI are enterprise software offerings from IBM for data scientists that are built with open source components. IBM Watson Studio Local provides interactive interfaces such as notebooks and RStudio. IBM PowerAI provides DL frameworks such as TensorFlow and Caffe, among others, which are built and optimized for IBM Power Systems servers.

Although IBM PowerAI libraries are included with IBM Watson Studio Local and automatically deployed when IBM Watson Studio Local is installed, there is no customer support that is included for IBM PowerAI. To get support (in addition to extra capabilities), clients can purchase and deploy IBM Watson Machine Learning Accelerator instead. An example of how a client can leverage both of these offerings together is where data scientists can collaborate, experiment, and build their models in IBM Watson Studio Local and run the production training (with multiple GPUs) on a IBM Watson Machine Learning Accelerator cluster.

The Power AC922 server is the best one for enterprise AI. POWER processor-based servers have a number of significant advantages over Intel -based servers in performance and price. Work with your colleagues in the Analytics organization to position Power Systems as the preferred platform for the client’s IBM Watson Studio Local deployment and their data science initiatives.

IBM Watson Studio Local is available as part of the IBM enterprise private cloud offering that is called IBM Cloud Private.

2.7 IBM Spectrum Scale and Hadoop Integration

The IBM Spectrum Scale HDFS Transparency Connector is a remote procedure call (RPC) API that connects IBM GPFS with HDFS. It offers a set of interfaces that enable applications to use the HDFS Client to access the IBM Spectrum Scale namespace by using HDFS traditional requests.

Figure 2-7 shows the flow of the data from big data applications to the IBM Spectrum Scale namespace.

Figure 2-7 IBM Spectrum Scale flow of data architecture

All data transmission and metadata operations in HDFS are performed through RPCs and processed by NameNode and DataNode services within HDFS. IBM Spectrum Scale HDFS Transparency Connector integrates both NameNode and DataNode services and responds to the request like HDFS does. IBM Spectrum Scale HDFS Transparency Connector ensures that any application that attempts to read, write, or run any operation on the HDFS receives the expected response.

Some advantages of HDFS transparency are:

•HDFS-compliant APIs and shell-interface command-line interface (CLI).

•Application client isolation from storage. The application client may access the
IBM Spectrum Scale file system without the IBM Spectrum Scale Client installed.

•Improved security management by using Kerberos authentication and encryption in RPC.

•File system monitoring by Hadoop Metrics2 integration.

•IBM Spectrum Scale services management directly on the Ambari GUI.

Avoiding unnecessary data moves reduces cost in a Hadoop environment like HDP. With
IBM Spectrum Scale, HDP components may access a namespace (file system mount point) on the Hadoop cluster by using the same data that is accessible to any other software (SAS, IBM Db2, Oracle Database, SAP, and others) without using a command such as distcp, which is a common problem when you run HDFS and must access data from an external file system.

2.7.1 Information Lifecycle Management

IBM Spectrum Scale uses the policy-based Information Lifecycle Management (ILM) toolkit to automate the storage of data into tiered storage. With the ILM toolkit, you can manage file placement rules across different pools of storage, like flash drives for hot data, SATA disks for commonly accessed data, and even cold data on the cloud or a tape driver.

With IBM Spectrum Scale policy-based ILM tools, you can accomplish the following tasks:

•Create storage pools to partition a file system's storage into collections of disks or a RAID with similar properties.

•Create file sets to partition the file system namespace to allow administrative operations at a finer granularity than that of the entire file system.

•Create policy rules based on data attributes to determine the initial file data placement and manage file data placement throughout the life of the file.

Some rules conditions that can be used are:

•Date and time when the file was last accessed.

•Date and time when the file was last modified.

•File set name (directory).

•File name or extension.

•File size.

•Owner of the file (user ID and group ID).

You can create and manage policies and policy rules by using the IBM Spectrum Scale CLI or the GUI. To create a policy rule by using the GUI, complete the following steps:

1. Log in to the IBM Spectrum Scale web interface and in the left pane select Files → Information Lifecycle. Click Add Rule. The window that is shown in Figure 2-8 opens.

Figure 2-8 IBM Spectrum Scale GUI: Information Lifecycle Add Rule window

2. Select the type of rule that you want to create, such as Placement, Migration, Compression, or Deletion. The window that is shown in Figure 2-9 opens.

Figure 2-9 IBM Spectrum Scale GUI: Information Lifecycle Rule Type window

3. Select the pool where the new files must be placed (when you are creating a placement policy), the Placement Criteria (that is, only files with the extension .xml), and more.

For a Migration Policy rule, you can select the source and target pools where the files must be moved, and then the Rules to trigger the migration. An example is when the source pool achieves a defined utilization threshold or a file is not accessed for a number of days.

Figure 2-10 shows some more configurations that can be selected for the Information Lifecycle rule.

Figure 2-10 IBM Spectrum Scale: Information lifecycle rules configuration

There are many types of ILM rules that can be created on IBM Spectrum Scale. The GUI makes creating a policy to meet your needs simple and fast.

2.8 Security

Security is at the forefront of most clients’ minds. Security incidents and data breaches often make significant news because of the potential impact they have on the business and their customers. Although some companies are making a move to the public cloud, others want to keep their data local and secure behind their firewall.

Protecting the datalake is paramount in an enterprise. The security mechanisms must include authentication, authorization, audit, encryption, and policy management:

•Authentication is the process of proving who you are.

•Authorization is the process of verifying that you have the authority to access a resource.

•Records that indicate who did what provide the audit capability.

•Encryption is the mechanism that is used to protect data-at-rest and in motion.

•Policy management is done through the product administration consoles.

Ideally, these mechanisms can be managed in a consistent and cohesive fashion.

2.8.1 Datalake security

Hortonworks uses Apache Ranger for its centralized security management, although Ambari is used for some of the fundamental setup, such as enabling Kerberos. Apache Ranger is a framework and service to enable, monitor, and manage security for the Hadoop platform and services. The Administration Portal is the Ranger interface for security administration. Hadoop components like Knox, YARN, Hive, HBase, and HDFS have lightweight plug-ins that provide integration with Ranger.

From the Ambari administration console, administrators can enable and configure Kerberos for authentication and set perimeter security policies by using Knox.

From the Ranger administration console:

•Administrators can set fine-grained access control for the authorization policies.

•Administrators can view the centralized audit reports showing user activity.

•Administrators can configure Ranger KMS for cryptographic key management for HDFS Transparent Encryption. It lets you create and manage the keys, in addition to audit and access control of the keys. You cannot use to enable directly HDFS encryption.

Hadoop authentication

Hadoop uses Kerberos for authentication and the propagation of identities for users and services. Kerberos includes the client, server, and trusted party known as the Kerberos Key Distribution Center (KDC). The KDC is a separate server from the Hadoop cluster. It includes a database of users and services that are known as principals, and is composed of an Authentication Server and a Ticket Granting Service (TGS). The Authentication Server is used for the initial authentication. It issues a Ticket Granting Ticket (TGT) to the authenticated principal. The TGS is used to get service tickets by using the TGT. Host and service resources use a special file that is called a keytab that includes their principal and key. The keytab resolves the issue of providing a password when decrypting a TGT. The Hortonworks security documentation provides the detailed steps for enabling Kerberos, including the Kerberos server setup.

If you already have a Lightweight Directory Access Protocol (LDAP) server solution, you can use configure various HDP components such as Ambari, Knox, and Ranger to use it. IBM Watson Studio Local also supports the configuration of an external LDAP server for its users. The two solutions can share an LDAP server and avoid the problem of defining users in multiple places.

The Apache Knox Gateway is a reverse proxy that provides the perimeter security for a Hadoop cluster. It exposes Hadoop REST and HTTP services for HDFS, Hive, HBase, and others without revealing the cluster internals. It provides SSL for over the wire encryption and encapsulates the Kerberos authentication.

Hadoop authorization

The Apache Ranger service can be installed through the Ambari Add Service wizard. Before adding Ranger, set up a database instance and an Apache Solr instance. The database is used for administration and logging. Ranger uses Apache Solr to store audit logs. After Ranger is installed, the Ranger plug-ins can be enabled for each of the services being administered, including HDFS, HBase, HiveServer2, Storm, Knox, YARN, and Kafka, and NiFi.

Log in to the Ranger portal at http://ranger_host:6080 to open the Ranger Console. The Service Manager page opens and shows the Access Manager, Audit, and Settings tabs at the top of the page:

•The Access Manager → Resource Based Policy option opens the Service Manager page so that you can add access policies for the services that are plugged into Ranger.

•The Access Manager → Tag Based Policies option opens the Service Manager for Tag Based policies page so that you can add tag-based services that you use to control access to resources that access multiple Hadoop components.

•The Access Manager → Reports option opens the Reports page where you can generate user access reports.

•The Settings → User/Groups option shows a list of users and groups that can access the Ranger portal.

•The Settings → Permissions option opens a Permissions page where you can edit the permissions for users and groups.

Hadoop auditing

Ranger offers the ability to store audit logs by using Apache Solr. Solr is an open source highly scalable search platform. Store your audit logs in HDFS so that they can be exported to a security information and event management (SIEM) system. Having the audit log data in HDFS also provides the ability to create an anomaly detection application by using ML and DL algorithms.

Selecting the Ranger console Audit tab opens an Audit page where you can create a search to monitor a user’s activity.

Hadoop data protection

Hadoop provides protection mechanisms for data-in-motion and data-at-rest.

Hadoop wire encryption

Data on the wire is encrypted to ensure its privacy and confidentiality. Hadoop can be configured to encrypt data as it moves into the cluster, moves through the cluster, and as it moves out of the cluster. There are a number of protocols that must be configured for encryption: RPC, data transfer protocol (DTP) HTTP, and Java Database Connectivity (JDBC).

Clients use RPC to interact directly with the Hadoop Cluster. DTP is used when the client is transferring data within a data node. Clients use HTTP to connect to the cluster through a browser or a REST API. HTTP is also used between mappers and reducers during a data shuffle. JDBC is used to communicate with the Hive server.

RPC encryption is enabled by setting the HDFS property hadoop.rpc.protection=privacy.

DTP encryption is enabled by setting the HDFS property dfs.encrypt.data,transfer=true.

The DTP encryption algorithm, “3des” or “rc4”, is set by using the dfs.encrypt.data,transfer.algorithm property.

For more information about enabling SSL encryption for the various Hadoop components, see Enabling SSL for HDP Components.

HDFS Transparent Data Encryption

HDFS encryption is an end-to-end encryption mechanism for stored data in HDFS. The data is encrypted by the HDFS client during the write operation and decrypted by an HDFS client during a read operation. HDFS sees the data as a byte stream during these operations. HDFS encryption is composed of encryption keys, encryption zones, and Ranger KMS. The keys are a new level of permission-based access beyond the standard HDFS access control. The encryption creates a special HDFS directory so that all data in it is encrypted. The Ranger KMS generates and manages the encryption zone keys, provides the access to the keys, and stores audit data that is associated with the keys.

The native HDFS encryption feature is supported since HDFS Transparency 3.0.0.

2.8.2 IBM Watson Machine Learning Accelerator security with Hadoop

IBM Spectrum Conductor supports Kerberos authentication, which can be extended to HDFS. You can authenticate by using a Kerberos TGT or by using a principal with a keytab. The keytab is preferred for long-running applications.

The SIG to which you submit applications must reference a path to the Hadoop configuration. Set the SIG environment variable HADOOP_CONF_DIR to the path of the Hadoop configuration. For example:

HADOOP_CONF_DIR=/opt/hadoop-2-6-5/etc/hadoop.

When submitting a Spark batch job with a keytab for HDFS, specify the principal and keytab as options that are passed with the --conf flag. This spark-submit command shows the syntax for keytab authentication:

spark-submit --master spark://spark_master_url --conf spark.yarn.keytab=path_to_keytab --conf spark.yarn.principal=[email protected] --class main-class application.jar hdfs://namenode:9000/path/to/input

For access through Livy, follow the HDP instructions to configure the Livy service to authenticate with Kerberos. Those instructions include creating a user, principal, and keytab for Livy.

2.8.3 IBM Watson Studio Local security

When setting up HDP to work with IBM Watson Studio Local by using the Hadoop registration service, more setup is required for the Hadoop cluster that is configured with Kerberos. A gateway with JSON Web Token (JWT)-based authentication is configured so that IBM Watson Studio Local users can securely authenticate with the registration service.

Figure 2-11 shows the various components of an IBM Watson Studio Local and Hadoop Integration.

Figure 2-11 IBM Watson Studio Local Hadoop registration service

Use a Kerberos keytab for the Hadoop registration service user and an SPNEGO keytab for the edge node to acquire the ticket to communicate with the Hadoop services. The request that is submitted to the Hadoop services is submitted as the IBM Watson Studio Local user.

For access through Livy, follow the HDP instructions to configure the Livy service to authenticate with Kerberos. Those instructions include creating a user, principal, and keytab for Livy’s use.

Here are some Watson Studio considerations to note when configuring your security:

•To help with General Data Protection Regulation (GDPR) readiness, install IBM Watson Studio Local Version 1.2.0.3 or later.

•Disk volumes should be encrypted. To encrypt data storage, use Linux Unified Key Setup-on-disk-format (LUKS). If you decide to use this approach, format the partition with the highly scalable Linux XFS before you install IBM Watson Studio Local.

•IBM Watson Studio Local can be accessed only through SSL/TLS (HTTPs). For more information, see 2.8, “Security” on page 50. Use JDBC/SSL-based mechanisms to communicate to remote data sources.

•Access failures are recorded in the logs.

•The Hadoop Integration service uses HTTPs, and enables secure communication.

2.8.4 IBM Spectrum Scale security

The IBM Spectrum Conductor software component and IBM Spectrum Scale wrap security features around these frameworks for production deployments in many regulated organizations. These solutions employ the current security protocols and are subjected to extensive security scanning and penetration testing. The IBM Spectrum Conductor software implements end-to-end security, from data acquisition and preparation to training and inference.

The product security implementation has the following features:

•Authentication: Support for Kerberos, Active Directory (AD) and LDAP, and operating system (OS) authentication, including Kerberos authentication for HDFS.

•Authorization: Fine-grained access control, access control list (ACL) or role-based control (RBAC), Spark binary lifecycle, notebook updates, deployments, resource plan, reporting, monitoring, log retrieval, and execution.

•Impersonation: Different tenants may define production execution users.

•Encryption: There is SSL and authentication between all daemons and storage encryption by using IBM Spectrum Scale.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2. Integration overview

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 2. Integration overview