IBM PowerAI components
This chapter describes IBM PowerAI.
This chapter contains the following topics:
3.1 IBM PowerAI components
This section covers the components of IBM PowerAI since the first public release (V1.3) to the current release (V1.5.0) as of December 2017.1
IBM PowerAI is an IBM Cognitive Systems offering for the rapidly growing and quickly evolving artificial intelligence (AI) tier of deep learning (DL). IBM PowerAI provides a suite of capabilities from the open source community, and combines them into a single enterprise distribution of software that incorporates complete lifecycle management of installation and configuration, data ingestion and preparation, building, optimizing, and training the model, inference, testing, and moving the model into production. IBM PowerAI takes advantage of a distributed architecture to help enable your teams to quickly iterate through the training cycle with more data to help continuously improve the model over time.
IBM PowerAI is designed for enterprise scale, with software that is optimized for both single-server and cluster DL training.
IBM PowerAI provides an end-to-end DL platform for data scientists. It offers many optimizations that can ease installation and management, and can accelerate performance:
Ready-to-use DL frameworks (TensorFlow and IBM Caffe).
Distributed as easy-to-install binary files.
Includes all dependencies and libraries.
Easy updates: Code updates arrive from a repository.
Validated DL platform with each release.
Integrated ingest interfaces, with optional parallel transformation capability to unlock larger DL data sets.
Dedicated support teams for DL.
Designed for enterprise scale with multisystem cluster performance and large memory support.
IBM PowerAI software distribution consists of a meta-package that includes all libraries, DL frameworks, and software customizations to enable a fast, reliable, and optimized way to deploy a DL solution on IBM Power Systems.
IBM PowerAI is distributed through electronic download from My Entitled System Support, or it can be preinstalled on the hardware. The following items are included:
IBM international license agreement for non-warranted programs
Required installation files
 
Note: There is no physical media available. IBM PowerAI does not include the operating system or the NVIDIA graphical processing units (GPUs) drivers and libraries due to license limitations, or extra software such as Python, IBM Spectrum Conductor Deep Learning Impact (DLI), or IBM PowerAI Vision.
The following section also describes in greater detail all IBM PowerAI components and extra requirements (whether they are part of IBM PowerAI), including the versions that are supported in each release. For those freely available and not included in the IBM PowerAI meta-package, the following section provides a URL with more information and downloadable content.
For those components that are part of IBM PowerAI, at the time of writing, we document a tabulation of dependencies between the component and the specific IBM PowerAI release.
3.1.1 IBM PowerAI support and extra services from IBM
IBM offers optional Level 3 enterprise support for IBM PowerAI, and also services for the deployment and optimization of IBM PowerAI to its customers. The following section describes the services in more detail.
IBM Global Technology Services
The list of selected services that are available in your region, either as standard or customized offerings for the efficient installation, implementation, and integration of this program, can be obtained from your respective IBM Global Technology Services® Regional Offering Executive.
Sales support
Technical sales support is offered globally and includes professionals from IBM Systems Group and IBM Sales and Delivery.
Resources include Field Technical Support Specialists (FTSSs), who provide onsite support for sales opportunities through pilots, prototypes, and demonstrations. Extra technical resources include:
Advanced Technical Skills (ATS): Complex solution design, solutions assurance, proof of concept, early product introduction, and performance and benchmarking support.
Techline: Presales services include System Design and Configurations, Design Assessments, Middleware and ISV Sizings, Services Support, Intellectual Capital, Competitive Assessments, and Business Partner Enablement.
For IBM presales support, see Global Technical Sales.
IBM Systems Lab Services
IBM Systems Lab Services offers a wide array of services that are available for your enterprise. The team brings expertise on the current technologies from the IBM development community and can help with your most difficult technical challenges.
IBM Systems Lab Services exists to help you successfully implement emerging technologies that can accelerate your return on investment and improve your satisfaction with your IBM systems and solutions. Services examples include initial implementation, integration, migration, and skills transfer on IBM systems solution capabilities and preferred practices.
For more information about available services, contact your IBM representative or see IBM Systems Lab Services.
IBM Power Systems PowerAI basic startup services
IBM Systems Lab Services offers IBM Power Systems PowerAI basic startup services that include pre-implementation design workshop and provide onsite assistance with installation, configuration, and implementation for software and hardware infrastructure for the customer's IBM PowerAI solution.
 
Note: Contact your IBM representative for availability in your country.
3.1.2 IBM Power Systems for deep learning
As the main part of the solution, IBM PowerAI relies on both Power Systems servers and NVIDIA GPUs to deliver maximum performance and a reliable solution for DL workloads. This is a software offering, so hardware is not included with IBM PowerAI, but it is required for the solution to run.
IBM Power System S822LC for High Performance Computing (8335-GTB) server
The IBM Power System S822LC for High Performance Computing server is the IBM chosen platform for running DL workloads. This server benefits from POWER8 CPUs plus dedicated P100 GPUs from NVIDIA to support demanding workloads, such as the typical DL workload during the training phase.
The IBM Power System S822LC for High Performance Computing server was designed and built in collaboration with the OpenPOWER Foundation partners NVIDIA, Mellanox, and Wistron to tackle high-performance and technical computing workloads. This is the first server to incorporate NVIDIA NVLink technology into the processor technology. For more information, see the NVIDA NVlink website.
This system is designed to provide the highest performance and greatest efficiency for workloads that use GPU acceleration, including CFD and molecular modeling applications. GPU acceleration is also used extensively in machine learning (ML), DL, and cognitive workloads.
This system supports up to four NVIDIA Tesla P100 GPUs, which are connected through the NVIDIA NVLink. Each node can include up to 1 TB of DDR4 memory, and supports Mellanox 100 Gbps (Enhanced Data Rate (EDR)) InfiniBand adapters for interconnect. Large clusters accommodate high-speed interconnect between nodes.
The system also includes the following items:
Choice of air-cooled or water-cooled models for greater thermal efficiency
Option of NVMe-based storage devices for greater performance
100 Gbps InfiniBand adapters that use Mellanox ConnectX-4 technology
For computationally rich high-performance or technical computing workloads that do not benefit from GPU acceleration, see IBM Power System S822LC Technical Overview and Introduction, REDP-5283.
IBM POWER9 servers: IBM Power System AC922 server
The IBM Power System AC922 server is the next generation of the IBM POWER processor-based systems, and is designed for DL and AI, high-performance analytics, and high-performance computing (HPC).
The IBM Power System AC922 (8335-GTW and 8335-GTG) is an OpenPOWER Linux scale-out server. In 2U form factors, these servers provide two POWER9 single chip module (SCM) processors, with up to 40 processor cores, coherently sharing 16 directly attached DDR4 dual inline memory module (DIMM) slots. The supported memory DIMMs are
8 - 128 GB.
These servers support the NVIDIA SXM2 form factor graphical processor units (GPUs) with an NVLink2 interface. The system also contains four directly attached, Coherent Accelerator Processor Interface (CAPI)-enabled Gen4 Peripheral Component Interconnect Express (PCIe) slots.
Four to six GV 100 GPUs can be installed on the system backplane.
Table 3-1 shows the NVIDIA GV 100 GPU features.
Table 3-1 NVIDIA GV 100 GPU features
GPU
Peak double precision floating point performance
Memory bandwidth
GPU memory size
Tesla V100
7.8 teraflops base1
1.2 TBps
16 GB or 32 GB

1 This is a projection from NVIDIA as of November 2017.
The 6-GPU configuration is water-cooled. The 4-GPU configuration is air-cooled as a standard solution, and water-cooled as an optional solution. For the air-cooled 2-GPU configuration, a feature upgrade is required to upgrade the system to a 4-GPU configuration. For a water-cooled system, different system backplanes are required for 4-GPU and 6-GPU configurations.
The system includes several features to improve performance:
POWER9 processors:
 – Each POWER9 processor module has either 16 or 20 cores, and is based a 64-bit architecture:
 • Clock speeds for a 16-core chip of 2.6 GHz (3.09 GHz turbo)
 • Clock speeds for a 20-core chip of 2.0 GHz (2.87 GHz turbo)
 – 512 KB of L2 cache per core, and up to 120 MB of L3 cache per chip.
 – Up to four threads per core.
 – 120 GBps memory bandwidth per chip.
 – 64 GBps SMP interconnect between POWER9 chips.
DDR4 memory:
 – Sixteen DIMM memory slots.
 – Maximum of 1024 GB DDR4 system memory.
 – Improved clock 1333 - 2666 MHz for reduced latency.
NVIDIA Tesla V100 GPUs:
 – Up to six NVIDIA Tesla V100 GPUs, based on the NVIDIA SXM2 form factor connectors.
 – 7.8 TFLOPs per GPU for double precision.
 – 15.7 TFLOPs per GPU for single precision.
 – 125 TFLOPs per GPU for DL. New 640 Tensor Cores per GPU, designed for DL.
 – 16 GB HBM2 internal memory with 900 GBps bandwidth, 1.5x the bandwidth compared to Pascal P100.
 – Liquid cooling for six GPUs configurations to improve compute density.
NVLink 2.0:
 – Twice the throughput, compared to the previous generation of NVLink.
 – Up to 200 GBps of bidirectional bandwidth between GPUs.
 – Up to 300 GBps of bidirectional bandwidth per POWER9 chip and GPUs, compared to 32 GBps of traditional PCIe Gen3.
IBM OpenCAPI™ 3.0:
 – Open protocol bus to allow for connections between the processor system bus in a high speed and cache coherent manner with OpenCAPI compatible devices like accelerators, network controllers, storage controllers, and advanced memory technologies.
 – Up to 100 GBps of bidirectional bandwidth between CPUs and OpenCAPI devices.
PCIe Gen4 slots. Four PCIe Gen4 slots up to 64 GBps bandwidth per slot, twice the throughput from PCIe Gen3. Three CAPI 2.0 capable slots.
Table 3-2 provides a summary of the IBM Power System AC922 server available models.
Table 3-2 Summary of the IBM Power System AC922 server available models
Server model
POWER9 chips
Max. memory
Max. GPU cards
Cooling
8335-GTG
2
1 TB
4
Air-cooled
8335-GTW
6
Water-cooled
 
Note: For more information, see IBM Power System AC922 Introduction and Technical Overview, REDP-5472.
IBM Power System AC922 server model 8335-GTG
This summary describes the standard features of the IBM Power System AC922 model 8355-GTG:
19-inch rack-mount (2U) chassis
Two POWER9 processor modules:
 – 16-core 2.6 GHz processor module
 – 20-core 2.0 GHz processor module
 – Up to 1024 GB of 2666 MHz DDR4 error correction code (ECC) memory
Two small form factor (SFF) bays for hard disk drives (HDDs) or solid-state drives (SSDs) that support:
 – Two 1 TB 7200 RPM NL SATA disk drives (#ELD0)
 – Two 2 TB 7200 RPM NL SATA disk drives (#ES6A)
 – Two 960 GB SATA SSDs (#ELS6)
 – Two 1.92 TB SATA SSDs (#ELSZ)
 – Two 3.84 TB SATA SSDs (#ELU0)
Integrated SATA controller
Four PCIe Gen4 slots:
 – Two PCIe x16 Gen4 Low Profile slots, CAPI-enabled
 – One PCIe x8 Gen4 Low Profile slot, CAPI-enabled
 – One PCIe x4 Gen4 Low Profile slot
Two or four NVIDIA Tesla V100 GPUs (#EC4J), based on the NVIDIA SXM2 form factor connectors air-cooled
Integrated features:
 – IBM EnergyScale™ technology
 – Hot-swap and redundant cooling
 – One front USB 3.0 port for general use
 – One rear USB 3.0 port for general use
 – One system port with RJ45 connector
Two power supplies (both are required)
IBM Power System AC922 server model 8335-GTW
This summary describes the standard features of the Power AC922 model 8335-GTW:
19-inch rack-mount (2U) chassis
Two POWER9 processor modules:
 – 16-core 2.6 GHz processor module
 – 20-core 2.0 GHz processor module
 – Up to 1024 GB of 2666 MHz DDR4 ECC memory
Two SFF bays for HDDs or SSDs that support:
 – Two 1 TB 7200 RPM NL SATA disk drives (#ELD0)
 – Two 2 TB 7200 RPM NL SATA disk drives (#ES6A)
 – Two 960 GB SATA SSDs (#ELS6)
 – Two 1.92 TB SATA SSDs (#ELSZ)
 – Two 3.84 TB SATA SSDs (#ELU0)
Integrated SATA controller
Four PCIe Gen4 slots:
 – Two PCIe x16 Gen4 Low Profile slots, CAPI-enabled
 – One PCIe x8 Gen4 Low Profile slot, CAPI-enabled
 – One PCIe x4 Gen4 Low Profile slot
Four or six NVIDIA Tesla V100 GPUs (#EC4J), based on the NVIDIA SXM2 form factor connectors water-cooled
Integrated features:
 – IBM EnergyScale technology
 – Hot-swap and redundant cooling
 – One rear USB 3.0 port for general use
 – One system port with RJ45 connector
Two power supplies (both are required)
Table 3-3 shows a list of IBM POWER servers that support NVIDIA GPUs and NVLink or NVLink2 buses. IBM PowerAI is supported by IBM Power Systems servers with NVLink CPUs and NVIDIA GPUs.
Table 3-3 IBM Power Systems servers that support NVIDIA GPUs
Name
Type-model
CPU
Bus
NVIDIA GPUs
IBM Power System S822LC for High Performance Computing
8335-GTB
POWER8
NVLink
Standard configuration:
2 or 4 Tesla P100 GPUs
IBM Power System AC922
8335-GTW
POWER9
NVLink2
Standard configurations:
6 Tesla V100 GPUs (water-cooled)
Optional configurations:
4 Tesla V100 GPUs (water-cooled)
8335-GTG
Standard configurations:
4 Tesla V100 GPUs (air-cooled)
Optional configurations:
2 Tesla V100 GPUs (air-cooled)
3.1.3 Linux on Power for deep learning
IBM PowerAI is built on Linux on Power. IBM PowerAI benefits from both the Power platform performance and reliability, and the Linux open source model. At the time of writing, two Linux distributions are supported for IBM PowerAI in its different releases, as shown in Table 3-4 (although the operating system is not part of IBM PowerAI).
Table 3-4 IBM PowerAI supported operating systems
PowerAI
Linux
Extra requirements
V1.3
Ubuntu 16.04 on Power
Linux kernel: 4.4
Extra packages: >= libc6 2.23-0ubuntu51
Extra repository: Updates
V1.3.1
V1.3.2
V1.3.3
V13.4
V1.4.0
V1.5.0
RHEL 7.4 for POWER8 / RHEL 7.4 for POWER9
Extra repositories:
Optional
Extra
EFEL

1 After installing Ubuntu 16.04, update the libc6 package to version 2.23-0ubuntu5 or higher. That version fixes problems with Torch and TensorFlow. You might need to enable the updates repository to install this update.
Ubuntu 16.04 LTS for IBM POWER8 (ppc64le)2
Ubuntu for POWER8 brings the Ubuntu server and Ubuntu infrastructure to IBM POWER8. Ubuntu 16.04 continues to enable rapid innovation for POWER8, including support for new POWER8 models, memory and PCI Hotplug, Docker, and many performance and availability enhancements.
For more information, see Ubuntu Server Guide.
Red Hat Enterprise Linux for IBM POWER
Starting with Red Hat Enterprise Linux 7.1, Red Hat provides separate builds and licenses for big endian and little endian versions for IBM Power Systems servers.
Red Hat Enterprise Linux 7.4 for IBM Power LE (POWER9) introduces the Red Hat Enterprise Linux 7.4 user space with an updated kernel.
The Red Hat Enterprise Linux 7.4 kernel depends on the CPU POWER architecture, so it is different for POWER8 and POWER9 processors.
3.1.4 NVIDIA GPUs
GPUs are fundamental when it comes to calculations that are related to DL. Unlike CPUs, GPUs are designed and optimized to perform mathematical operations over matrixes and vectors, which are the core of DL, in particular, during training tasks (the most intensive).
Table 3-5 shows the main differences between CPUs and GPUs.
Table 3-5 GPUs versus CPUs
GPUs
CPUs
Hundreds or thousands of simpler cores
A few complex cores
Thousands of concurrent hardware threads
Single or low number of threads performance optimization
Maximize floating-point throughput
Not applicable
Most die surface for integer and floating-point units
Transistor space that is dedicated to ILP
IBM PowerAI relies on the IBM POWER architecture, which is known for its resilience and SMT capabilities, plus the NVIDIA GPUs and NVLink technologies, as shown in Table 3-6.
Table 3-6 NVIDIA GPUs specifications
Component
Tesla V100
Tesla P100
GPU
GV100
GP100
Architecture
Volta
Pascal
CUDA cores
5376
3840
Tensor cores
640
N/A
Core clock
Not disclosed
1328 MHz
Boost clocks
1455 MHz
1480 MHz
SMs1
84
60
Memory
HBM2
Memory bus width
4096-bit
Memory bandwidth
900 GBps
600 GBps
Shared memory
128 KB, configurable
24 KB L1, 64 KB shared
L2 cache
6 MB
4 MB
Half precision
30 teraflops
21.2 teraflops
Single precision
15.7 teraflops
10.6 teraflops
Double precision
7.8 teraflops (1/2 rate)
5.3 teraflops (1/2 rate)
For DL
125 teraflops
N/A
Die size
815 mm2
610 mm2
Transistor count
21.1 x 109
15.3 x 109
TDP
300 W
Manufacturing process
TSMC 12 nm FFN
TSMC 16 nm FinFET

1 Streaming multiprocessor
3.1.5 NVIDIA components
For proper operation, IBM PowerAI requires low-level access to all functions that are provided by NVIDIA GPU and NVLink technologies. To achieve this access, NVIDIA drivers, CUDA, and CUDA Deep Neural Network (cuDNN) software are needed, but they are not provided within IBM PowerAI due to license constraints by NVIDIA. These components can be downloaded from the vendor webpage (as shown in 3.1.5, “NVIDIA components” on page 42 and 3.1.6, “NVIDIA drivers” on page 44).
CUDA
CUDA is a parallel computing platform and programming model that enables the use of the GPU to accelerate computing performance.
CUDA broadly follows the data-parallel model of computation. Typically, each thread runs the same operation on different elements of the data in parallel.
The data is divided into a 1D, 2D, or 3D grid of blocks. Each block can be 1D, 2D, or 3D in shape, and can consist of over 512 threads on current hardware. Threads within a thread block can cooperate by way of shared memory. Thread blocks are run as smaller groups of threads that are known as warps.
NVIDIA CUDA Toolkit
The NVIDIA CUDA Toolkit provides a development environment for creating high-performance, GPU-accelerated applications. GPU-accelerated CUDA libraries enable drop-in acceleration across multiple domains, such as linear algebra, image and video processing, DL, and graph analytics. For developing custom algorithms, you can use available integrations with commonly used languages and numerical packages and well-published development APIs. Table 3-7 shows the supported NVIDIA CUDA Toolkit versions.
Table 3-7 Supported NVIDIA CUDA Toolkit versions
IBM PowerAI
NVIDIA CUDA Toolkit
Minimal
Recommended
V1.3.0
8.0
>= 8.0.44
V1.3.1
>= 8.0.44
V1.3.2
V1.3.3
8.0.61
V1.3.4
V1.4.0
V1.5.0 for POWER8
9.0
9.0.176
V1.5.0 for POWER9
9.1.85
For more information, see the CUDA Toolkit.
NVIDIA cuDNN
The NVIDIA cuDNN library is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines, such as forward and backward convolutional, pooling, normalization, and activation layers. Table 3-8 shows the supported NVIDIA cuDNNs.
Table 3-8 Supported NVIDIA cuDNNs
IBM PowerAI
NVIDIA cuDNN
Minimal
Recommended
Runtime libraries
V1.3.0
Version 5.1
N/A
cuDNN v5.1 Runtime Library for Ubuntu16.04 Power8 (Deb)
cuDNN v5.1 Developer Library for Ubuntu16.04 Power8 (Deb)
cuDNN v5.1 Code Samples and User Guide Power8 (Deb)
V1.3.1
V1.3.2
V1.3.3
V1.3.4
5.1.10
V1.4.0
Version 6.0
6.0.20
cuDNN v6.0 Runtime Library for Ubuntu16.04 Power8 (Deb)
cuDNN v6.0 Developer Library for Ubuntu16.04 Power8 (Deb)
cuDNN v6.0 Code Samples and User Guide Power8 (Deb)
V1.5.0 for POWER8
Version 7.0
7.0.4
 
V1.5.0 for POWER9
7.0.5
cuDNN accelerates widely used DL frameworks, including Caffe2, TensorFlow, and Theano.
Here are some cuDNN key features:
Forward and backward paths for many common layer types, such as pooling, LRN, LCN, batch normalization, dropout, CTC, rectified linear unit (ReLU), Sigmoid, softmax and Tanh
Forward and backward convolutional routines, including cross-correlation, which is designed for convolutional neural nets
Long and short term memory (LSTM) and GRU recurrent neural networks (RNNs) and persistent RNNs
Arbitrary dimension ordering, striding, and subregions for 4D tensors means easy integration into any neural net implementation
Tensor transformation functions
Context-based API allows for easy multithreading
For more information, see NVIDIA cuDNN.
3.1.6 NVIDIA drivers
NVIDIA drivers are software that helps the operating system access all the capabilities of the NVIDIA GPUs. These drives are provided by NVIDIA and are not part of IBM PowerAI distribution because of NVIDIA proprietary software policies.
NVIDIA drivers download
To download the appropriate drivers (Table 3-9 on page 45), complete the following steps:
1. Go to NVIDIA.
2. Select DRIVERS.
3. From the drop-down menu, select ALL NVIDIA DRIVERS.
4. Select Manually find drivers for my NVIDIA products:
a. Product Type: Tesla
b. Product Series: pSeries
c. Product: Tesla P100
d. Operating system: Select Show all operating systems and then, depending on your operating system, select Linux POWER8 Ubuntu 16.04 or Linux POWER8 RHEL7.
5. Choose the CUDA version that is recommended for your IBM PowerAI release and English as the language, and click Search. Then, you have access to a web page where you can download the drivers and read some recommendations and release notes.
Table 3-9 Supported NVIDIA drivers
IBM PowerAI
NVIDIA drivers
Minimal
Recommended
V1.3.0
>=361.93.03
V1.3.1
V1.3.2
V1.3.3
>=361.93.03
361.119
V1.3.4
361.121
V1.4.0
384.66
V1.5.0 for POWER8
384.81
V1.5.0 for POWER9
387.86
3.1.7 IBM PowerAI deep learning package
IBM provides a set of libraries, frameworks, and several customizations to get the best performance from IBM Power Systems servers in a convenient meta-package. This is the core of IBM PowerAI and facilitates a fast and an easy deployment of the product.
Table 3-10 shows the IBM PowerAI DL packages.
Table 3-10 IBM PowerAI deep learning packages
PowerAI
IBM PowerAI DL packages
V1.3.0
mldl-repo-local_1-3ibm2_ppc64el.deb
V1.3.1
mldl-repo-local_1-3ibm5_ppc64el.deb
V1.3.2
mldl-repo-local_1-3ibm7_ppc64el.deb
V1.3.3
mldl-repo-local_3.3.0_ppc64el.deb
V1.3.4
mldl-repo-local_3.4.1_ppc64el.deb
mldl-repo-local_3.4.2_ppc64el.deb
V1.4.0
mldl-repo-local_4.0.0_ppc64el.deb
V1.5.0
mldl-repo-local-cuda9.0-5.0.0-*.ppc64le.rpm
The main repository for IBM PowerAI DL packages is in an Ubuntu archive.
3.1.8 Libraries
A few libraries are included within the IBM PowerAI meta-package, which are part of the solution that is shown in Table 3-11. Many of these libraries are not used in all scenarios, although they are a prerequisite for the proper operation of IBM PowerAI. Every IBM PowerAI release contains the libraries and versions that are required for the frameworks of the release.
Table 3-11 Extra libraries included
IBM PowerAI
OpenBLAS
NVIDIA Collective Communications Library
Bazel
OpenMPI
V1.3.0
N/A
N/A
N/A
N/A
V1.3.1
Part of IBM PowerAI distributed packages
Version 1
V1.3.2
Part of IBM PowerAI distributed packages
V1.3.3
V1.3.4
V1.4.0
Part of IBM PowerAI1
V1.5.0
Part of IBM PowerAI2

1 The caffe-ibm and ddl-tensorflow packages require the IBM PowerAI OpenMPI package, which is built with NVIDIA CUDA support. That OpenMPI package conflicts with Ubuntu non-CUDA-enabled OpenMPI packages.
2 IBM PowerAI V1.5 includes IBM Spectrum MPI as a prerequisite.
OpenBLAS
OpenBLAS is an open source implementation of the Basic Linear Algebra Subprograms (BLAS) API with many hand-crafted optimizations for specific processor types. OpenBLAS is developed at the Lab of Parallel Software and Computational Science, ISCAS.
OpenBLAS adds optimized implementations of linear algebra kernels for several processor architectures. OpenBLAS is a fork of GotoBLAS2, which was created by Kazushige Goto at the Texas Advanced Computing Center.
For more information, see OpenBLAS.
NVIDIA Collective Communications Library
The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance-optimized for NVIDIA GPUs. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, and reduce-scatter, that are optimized to achieve high bandwidth over PCIe and NVLink high-speed interconnect.
 
Note: This library is usually pronounced as the word “nickel” [nik-uhl].
For more information, see NVIDIA Collective Communications Library.
Bazel
Bazel is a build tool that builds code quickly and reliably. Supported build tasks include running compilers and linkers to produce executable programs and libraries, and assembling deployable packages.
Bazel is a type of the tool that Google uses to build its server software internally, but expanded to build other software as well. For more information, see Bazel at GithHub and Bazel build.
OpenMPI
OpenMPI is a Message Passing Interface (MPI)3 library project combining technologies and resources from several other projects: FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI.
Open MPI represents the merger between three well-known MPI implementations:
FT-MPI from the University of Tennessee
LA-MPI from Los Alamos National Laboratory
LAM/MPI from Indiana University
With contributions from the PACX-MPI team at the University of Stuttgart
These four institutions are the founding members of the OpenMPI development team.
The OpenMPI code has three major code modules:
OMPI: The MPI API and supporting logic
ORTE: The Open Run-Time Environment (support for different back-end runtime systems)
OPAL: The Open Portable Access Layer (utility and glue code that is used by OMPI and ORTE)
For more information, see OpenMPI.
 
Note: Uninstall any openmpi or libopenmpi packages before installing IBM Caffe or the Distributed Deep Learning (DDL) custom operator for TensorFlow. Purge any configuration files to avoid interference by running the following commands:
$ dpkg -l | grep openmpi
$ sudo apt-get purge ..
3.1.9 Frameworks
IBM PowerAI provides in a single package, easy to deploy and set up, some of the most used frameworks for DL and their dependencies, along with the required libraries (described in 3.1.8, “Libraries” on page 46). The following section is a brief description of every one of the frameworks that are included in any of the IBM PowerAI releases post-R3.0.
Berkeley Vision and Learning Center upstream Caffe4
Caffe is a DL framework that has expression, speed, and modularity. It is developed by Berkeley AI Research (BAIR)5/Berkeley Vision and Learning Center (BVLC)6 and community contributors.
Yangqing Jia7 created the project during his PhD at the University of California Berkeley. Caffe is released under the BSD 2-Clause license. Table 3-12 shows the Caffe BVLC version for IBM PowerAI.
Table 3-12 Caffe Berkeley Vision and Learning Center versions
IBM PowerAI
Berkeley Vision and Learning Center upstream Caffe
V1.3.0
V1.0.0 rc3
V1.3.1
V1.3.2
V1.3.3
V1.3.4
V1.0.0 rc5
V1.4.0
V1.0.0
V1.5.0 for POWER8
V1.5.0 for POWER9
Not included in IBM PowerAI V1.5 for P9 ESP
Here are the main benefits:
Large user community with academic research projects and industrial applications in vision, speech, and multimedia
Extensible code fosters active development
Expressive architecture encourages application and innovation
Models and optimization are defined by configuration without hardcoding
Model Zoo8 of pre-trained networks
For more information, see Caffe Berkeley Vision.
IBM optimized version of Berkeley Vision and Learning Center Caffe
IBM Caffe is a variant of BVLC/Caffe and it is optimized for NVLink-enabled IBM Power Systems servers. Table 3-13 shows the IBM Caffe versions that are used with IBM PowerAI.
Table 3-13 IBM Caffe versions
IBM PowerAI version
IBM Optimized version of Berkeley Vision and Learning Center Caffe
V1.3.0
V1.0.0 rc3
V1.3.1
V1.3.2
V1.3.3
V1.3.4
V1.4.0
V1.0.0
V1.5.0 for POWER8
V1.5.0 for POWER9
Not included in IBM PowerAI V1.5 for P9 ESP
Aside from all the features and capabilities that are provided by BVLC Caffe, the IBM optimized version of BVLC Caffe offers extra advantages:
This a compilation that is optimized by IBM Power systems engineers and designed to run on Power Systems servers, so it is highly optimized for POWER processors.
As part of the collaboration with NVIDIA, this compilation is also optimized for NVIDIA GPUs (including most of the optimizations NVCaffe includes).
Includes support for Large Model Support (LMS).
NVIDIA fork of Caffe
NVIDIA Caffe is an NVIDIA-maintained fork of BVLC Caffe that is tuned for NVIDIA GPUs, particularly in multi-GPU configurations. Table 3-14 shows the NVIDIA Caffe version for IBM PowerAI.
Table 3-14 NVIDIA Caffe versions
IBM PowerAI version
NVIDIA fork of Caffe
Alternative
Default
V1.3.0
V0.14.15
V1.3.1
V0.14.15
V0.15.13
V1.3.2
V1.3.3
V1.3.4
V0.15.14
V1.4.0
V0.15.14
V1.5.0
Deprecated, no longer part of IBM PowerAI
Here are the major features:
16-bit (half) floating point train and inference support.
Mixed-precision support to store and compute data in either 64-, 32-, or 16-bit formats. Precision can be defined for every layer (forward and backward passes might be different too), or it can be set for the whole net.
Integration with cuDNN v6.
Automatic selection of the best cuDNN convolutional algorithm.
Integration with Version 1.3.4 of NCCL library for improved multi-GPU scaling.
Optimized GPU memory management for data and parameters storage, I/O buffers, and workspace for convolutional layers.
Parallel data parser and transformer for improved I/O performance.
Parallel back propagation and gradient reduction on multi-GPU systems.
Fast solvers implementation with fused CUDA kernels for weights and history update.
Multi-GPU test phase for even memory load across multiple GPUs.
Compatibility with earlier versions of BVLC Caffe and NVCaffe 0.15.
Extended set of optimized models (including 16-bit floating point examples).
Theano
Theano is a Python library that you can use to define, optimize, and evaluate efficiently mathematical expressions involving multi-dimensional arrays. Computations are expressed by using a NumPy-esque syntax, and compiled to run efficiently on either CPU or GPU architectures.
Theano is an open source project that is primarily developed by an ML group at the Université de Montréal. On September 28, 2017, Pascal Lamblin announced that major development is ceasing after the 1.0 release, due before the end of 2017 because of competing offerings by strong industrial players.
Table 3-15 shows the Theano versions that are supported by IBM PowerAI.
Table 3-15 Theano versions
IBM PowerAI version
Theano
V1.3.0
V 0.8.2
V1.3.1
V1.3.2
V1.3.3
V1.3.4
V 0.9.0
V1.4.0
V1.5.0
Deprecated, no longer part of IBM PowerAI
For more information, see GitHub - Theano and Theano at Deep Learning Theano at Deep Learning.
Torch
Torch is an open source ML library, a scientific computing framework, and a script language based on the Lua9 programming language. Torch provides a wide range of algorithms for deep ML, uses the scripting language LuaJIT, and an underlying C implementation. Table 3-16 shows the Torch versions that are supported by IBM PowerAI.
Table 3-16 Torch versions
IBM PowerAI version
Torch
V1.3.0
Version 7
V1.3.1
V1.3.2
V1.3.3
V1.3.4
V1.4.0
V1.5.0
Deprecated, no longer part of IBM PowerAI
Torch is the main package in Torch V7 where data structures for multi-dimensional tensors and mathematical operations over them are defined.
Here are the core features of Torch:
A powerful N-dimensional array
Many routines for indexing, slicing, and transposing
Interface to C through LuaJIT
Linear algebra routines
Neural network, and energy-based models
Numeric optimization routines
Fast and efficient GPU support
For more information, see Torch.
Deep Learning GPU Training System
Deep Learning GPU Training System (DIGITS) provides an interface for training and classification that can be used to train DNNs with a few clicks. DIGITS runs as a web application that is accessed through a web browser. Table 3-17 shows the DIGITS versions that are used with IBM PowerAI.
Table 3-17 Deep Learning GPU Training System versions
IBM PowerAI version
DIGITS
V1.3.0
N/A
V1.3.1
Version 5.0.0-rc.11
V1.3.2
V1.3.3
V1.3.4
Version 5.0.02
V1.4.0
V1.5.0
Deprecated, no longer part of IBM PowerAI

1 The digits and python-socketio-server packages conflict with the Ubuntu python-socketio package. Uninstall the python-socketio package before installing DIGITS.
2 The digits and python-socketio-server packages conflict with the Ubuntu python-socketio package. Uninstall the python-socketio package before installing DIGITS.
DIGITS makes it easy to visualize networks and quickly compare their accuracies. When you select a model, DIGITS shows the status of the training exercise and its accuracy, and provides the option to load and classify images during the time the network is training or after training completes.
Both Caffe and Torch are used by DIGITS for image classification. Table 3-18 shows the prerequisites for Torch + DIGITS.
Table 3-18 Prerequisites for Torch + Deep Learning GPU Training System
IBM PowerAI version
Prerequisites for Torch + DIGITS
V1.3.0
N/A
V1.3.1
libhdf5-serial-dev
liblmdb-dev
(+ extra luarocks)1
V1.3.2
V1.3.3
V1.3.4
N/A
V1.4.0
V1.5.0
Deprecated, no longer part of IBM PowerAI

1 Using Torch with DIGITS requires extra packages that are not part of the IBM PowerAI release 3.1 distribution. See steps 1 - 3 on page 52 to install the packages.
To make Torch work with DIGITS, complete the following steps:
1. Install IBM PowerAI Torch and DIGITS packages:
$ sudo apt-get install digits torch
2. Install prerequisite packages from Ubuntu:
$ sudo apt-get install libhdf5-serial-dev liblmdb-dev
3. Install extra luarocks that are needed for DIGITS Torch support:
$ source /opt/DL/torch/bin/torch-activate
$ luarocks install --local --dep-mode=order tds
$ luarocks install --local --dep-mode=order totem
$ luarocks install --local --dep-mode=order"https://raw.github.com/deepmind/torch-hdf5/master/hdf5-0-0.rockspec"
$ luarocks install --local --dep-mode=order"https://raw.github.com/Neopallium/lua-pb/master/lua-pb-scm-0.rockspec"
$ luarocks install --local --dep-mode=order lightningmdb 0.9.18.1-1 LMDB_INCDIR=/usr/include LMDB_LIBDIR=/usr/lib/powerpc64le-linux-gnu
$ luarocks install --local --dep-mode=order "https://raw.githubusercontent.com/ngimel/nccl.torch/master/nccl-scm-1.rockspec"
For more information, see DIGITS at NVIDIA.
Google TensorFlow
TensorFlow is an open source software library for numerical computation that uses data flow graphs. It is provided by Google.
Although new to the open source landscape, the Google TensorFlow DL framework has been in development for years as proprietary software. It was developed originally by the Google Brain Team for conducting research in ML and deep neural networks. The framework’s name is derived from the fact that it uses data flow graphs, where nodes represent a computation and edges represent the flow of information, in Tensor form, from one node to another.
TensorFlow offers a good amount of documentation for installation, learning materials and tutorials that are aimed at helping beginners understand some of the theoretical aspects of neural networks, and getting TensorFlow set up and running relatively painlessly. Table 3-19 on page 53 shows the TensorFlow versions for IBM PowerAI.
Table 3-19 TensorFlow versions
IBM PowerAI version
Google TensorFlow
TensorFlow
ddl-tensorflow
Alternative
Recommended
V1.3.0
N/A
N/A
V1.3.1
Version 0.9.0
V1.3.2
Version 0.12.0
V1.3.3
Version 0.12.0
Version 1.0.0
V1.3.4
Version 1.0.1
V1.4.0
Version 1.1.0
Technology preview1
V1.5.0
Version 1.4.0
0.4.0

1 DDL custom operator for TensorFlow.
This release of IBM PowerAI includes a technology preview of the IBM PowerAI DDL custom operator for TensorFlow. The DDL custom operator uses CUDA-aware OpenMPI and NCCL to provide high-speed communications for distributed TensorFlow.
The DDL custom operator can be found in the ddl-tensorflow package. For more information about DDL and about the TensorFlow operator, see the following files:
/opt/DL/ddl/doc/README.md
/opt/DL/ddl-tensorflow/doc/README.md
/opt/DL/ddl-tensorflow/doc/README-API.md
The DDL TensorFlow operator makes it easy to enable Slim-style models for distribution. The package includes examples of Slim models that are enabled with DDL, which you can access by running the following commands:
$ source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate
$ ddl-tensorflow-install-samples <somedir>
Those examples are based on a specific commit of the TensorFlow models repository with a small adjustment. If you prefer to work from an upstream clone, rather than the packaged examples, run the following commands:
$ git clone https://github.com/tensorflow/models.git
$ cd models
$ git checkout 11883ec6461afe961def44221486053a59f90a1b
$ git revert fc7342bf047ec5fc7a707202adaf108661bd373d
$ cp /opt/DL/ddl-tensorflow/examples/slim/train_image_classifier.py slim/
Unlike any other framework, TensorFlow can do partial subgraph computation, which involves taking a subsample of the total neural network and then training it, apart from the rest of the network. This is also called model parallelization, and allows for distributed training.
For more information, see TensorFlow and GitHub - TensorFlow.
Distributed Deep Learning TensorFlow
DDL TensorFlow is a library that enables TensorFlow to spread the workload over several nodes. As training models grow, they can reach a point where the data sets cannot fit into a one or multiple GPUs in a server. In such cases, the distributed TensorFlow architecture offers a great advantage, enabling several servers to support this workload, resulting in radically reduced processing time.
For more information, see Distributed Deep Learning TensorFlow.
Chainer
Chainer is a DL framework, primarily sponsored by Preferred Networks that focuses on the flexibility to write complex architectures simply and intuitively.
Chainer adopts a Define-by-Run scheme, for example, the network is defined dynamically through actual forward computation. More precisely, Chainer stores the history of computation instead of programming logic. This strategy enables it to fully take advantage of the power of programming logic in Python. The Define-by-Run scheme is the core concept of Chainer.
Table 3-20 shows the Chainer versions for IBM PowerAI.
Table 3-20 Chainer versions
IBM PowerAI version
Chainer
V1.3.0
N/A
V1.3.1
V1.3.2
Version 1.18.0
V1.3.3
V1.3.4
Version 1.20.0.1
V1.4.0
Version 1.23.0
V1.5.0
Deprecated, no longer part of IBM PowerAI
For more information, see Chainer and Chainer documentation.
3.1.10 Other software and functions
This section presents other software and their functions.
Python
Python is one of the most extended programming languages and the most used in the DL field. Most of the frameworks that are included in IBM PowerAI support Python (and some additional programming languages, such as C and Java).
At the time of writing, the only supported Python version in IBM PowerAI is Version 2. IBM intends to include support for Python V3 in the next releases of the product.
For more information, see Python.
Anaconda
Starting in IBM PowerAI V1.5 with Red Hate Enterprise Linux as the required operating system, Anaconda is required to distribute, update, and install software.
Anaconda is the installation program that is used by Fedora, Red Hat Enterprise Linux, and some other distributions. During installation, a target computer's hardware is identified and configured, and the appropriate file systems for the system's architecture are created. Anaconda enables the user to install the operating system software on the target computer. Anaconda can also upgrade existing installations of earlier versions of the same distribution. After the installation is complete, you can restart your installed system and continue doing customization by using the initial setup.
Anaconda is a fairly sophisticated installer. It supports installation from local and remote sources such as CDs and DVDs, images that are stored on an HDD, NFS, HTTP, and FTP. Installation can be scripted with kickstart to provide a fully unattended installation that can be duplicated on scores of machines. Anaconda can also be run over VNC on machines without an interface. Various advanced storage devices, including LVM, RAID, iSCSI, and multipath, are supported from the partitioning program. Anaconda provides advanced debugging features such as remote logging, access to the Python interactive debugger, and remote saving of exception dumps.
Starting from IBM PowerAI V1.5, Anaconda is a requirement when using TensorFlow and Caffe frameworks.
Table 3-21 shows the Anaconda versions for IBM PowerAI.
Table 3-21 Anaconda versions
IBM PowerAI version
Anaconda 2
V1.3.0
N/A
V1.3.1
V1.3.2
V1.3.3
V1.3.4
V1.4.0
V1.5.0
5.0.01

1 Anaconda is required if TensorFlow or Caffe frameworks are used.
For more information, see Anaconda at the Fedora Project.
IBM PowerAI Distributed Deep Learning
Many times, the amount of data that is required to be processed exceeds the capacity processing of a single server. In other cases, the training time can potentially be reduced by dividing the tasks among a cluster of servers.
To accelerate the time that is dedicated to training a model, the IBM PowerAI stack uses new technologies to deliver exceptional training performance by distributing a single training job across a cluster of servers.
IBM PowerAI DDL provides intelligence about the structure and layout of the underlying cluster (topology), which includes information about the location of the cluster’s different compute resources, such as GPUs and CPUs and the data on each node.
IBM PowerAI is unique in that this capability is incorporated into the DL frameworks as an integrated binary file, reducing complexity for clients as they bring in high-performance cluster capability.
As a result of this capability, IBM PowerAI with DDL can scale jobs across many cluster resources with little loss to communication impact.
For an in-depth discussion on IBM PowerAI and IBM PowerAI DDL scalability, read IBM PowerAI DDL (2017). It was written by a team of IBM scientists who describe and prove that DL workload performance when using IBM PowerAI DDL scales close to linearly with the number of nodes.
 
Note: At the time of writing, DDL is available as a technology preview with IBM PowerAI V1.4 and V1.5.0, and is compatible with bundled TensorFlow and IBM Caffe frameworks.
For more information, see Deep Learning and PowerAI Development.
Large Model Support
Because the models are becoming more complex and the data sets are getting larger, DL workloads are becoming bigger. In many cases, the size of the data set overgrows the GPUs memory space, keeping these workloads from been analyzed in these systems.
Figure 3-1 shows data fragmentation in a traditional approach that divides the problem into
16 GB chunks without LMS.
Figure 3-1 Large Model Support disabled
LMS uses the system memory with GPU memory to overcome GPU memory limitations in DL training.
Figure 3-2 shows the use of system memory with GPU memory in a system with LMS enabled.
Figure 3-2 Large Model Support enabled
IBM PowerAI from V1R4.0 includes support for LMS in IBM Caffe as a technology preview.
IBM Spectrum MPI
IBM Spectrum MPI is a high-performance, production-quality implementation of MPI that accelerates application performance in distributed computing environments. Based on OpenMPI, IBM Spectrum MPI provides a familiar interface that is easily portable. IBM Spectrum MPI incorporates advanced CPU affinity features, dynamic selection of interface libraries, superior workload manager integrations and improved performance, and improved RDMA networking, which supports NVIDIA GPUs. IBM Spectrum MPI supports a broad range of industry-standard platforms, interconnects, and operating systems, ensuring that parallel applications can run on a multitude of platforms.
IBM Spectrum MPI delivers a number of features:
Improved RDMA networking, supporting NVIDIA GPUs based on the IBM PAMI back end
Reduce time to results:
 – Improved point-to-point performance by way of proprietary PAMI back end (for specific MOFED)
 – Enhanced collective library (blocking and non-blocking algorithms)
Ease of use for installations, starts, and debugging
Cluster test options, improving startup services
Debug and instrumentation libraries
IBM High Performance Computing Toolkit for analyzing performance of applications
Single MPI with support by IBM for IBM POWER8 and x86
For more information, see IBM Spectrum MPI.
Table 3-22 recaps all additional software and functions that are described in this section.
Table 3-22 Extra software
IBM PowerAI version
Python
IBM PowerAI DDL
LMS
IBM Spectrum MPI
V1.3.0
Version 2
N/A
N/A
N/A
V1.3.1
V1.3.2
V1.3.3
V1.3.4
V1.4.0
Technology preview
Technology preview
V1.5.0 for POWER8
10.1
V1.5.0 for POWER9
10.2
3.2 IBM PowerAI compatibility matrix
Figure 3-3 shows the requirements and elements that are included in every version of IBM PowerAI as of December 2017.
Figure 3-3 Compatibility matrix

1 As of December 2017, IBM PowerAI V1.5 for POWER9 systems was released as part of an Early Ship Program. More frameworks are expected to be part of the final release of this version for POWER9 servers.
2 ppc64le is a pure little endian mode that was introduced with POWER8 as the prime target for technologies that are provided by the OpenPOWER Foundation, aiming at enabling the porting of the x86 Linux-based software with minimal effort.
3 MPI is a standardized and portable message-passing standard that is designed by a group of researchers from academia and industry to function on a wide variety of parallel computing architectures. The standard defines the syntax and semantics of a core of library routines that are useful to a wide range of users writing portable message-passing programs in C, C++, and Fortran. There are several well-tested and efficient implementations of MPI, many of which are open source or in the public domain.
4 Y. Jia, et al., “Caffe: Convolutional Architecture for Fast Feature Embedding”, arXiv preprint arXiv:1408.5093, 2014.
5 The BAIR Lab brings together University of California Berkeley researchers across the areas of computer vision, ML, natural language processing (NLP), planning, and robotics. For more information, see BAIR.
8 A standard format for packaging Caffe model information, plus tools to upload/download model information to/from GitHub Gists, and to download trained .caffemodel binary files.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset