Chapter 2. High performance clusters

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

High performance clusters

Today’s computing infrastructure is now being used increasingly as a cost-effective way to provide scalable, high-performance, high-available solutions for various workloads. For many, clustered systems are the answer. This chapter highlights the architectural aspects for high-performance computing (HPC) clusters design and management aspects. It also includes some of the major features of IBM Platform Computing products and the way these features can help to address the challenges of HPC clusters.

This chapter includes the following sections:

•Cluster management

•Workload management

•Reference architectures

•Workload optimized systems

2.1 Cluster management

IBM Platform Computing products provide a focused technical computing management software portfolio for clients who are looking for simplified, high-performance, and agile systems workload and resource management. For cluster management, we can choose between two products that are based on business requirements and functionalities: IBM Platform HPC or IBM Platform Cluster Manager. There are several differences between these cluster management products and their editions.

The differences between Platform Cluster Manager Standard Edition (PCM-SE) and Platform HPC including xCAT (Extreme Cloud Administration Toolkit) are shown in Table 2-1.

Table 2-1 Comparison of xCAT, Platform Cluster Manager Standard Edition, and Platform HPC

Features	xCAT	PCM-SE	Platform HPC
Linux x64 support	X	X	X
AIX support	X
Windows support	X
Linux support	X	X
Scalability	10,000+	2,500	200-300
Hardware management	X	X	X
Hardware and system monitoring	Third-party agents	X	X
Management portal		X	X
One-step installation		X	X
Easy of use (provision template and so on)		X	X
Third-party software kit (ICR, OFED and so on)		X	X
Workload management			X
Platform MPI			X
Job management portal			X
Open source	X
Commercial support	Optional	X	X

xCAT is an open source software package for cluster management and offers support for different operating systems with good scalability. xCAT can support clusters of more than 10,000 nodes. However, one draw back is that xCAT requires a third-party monitoring agent for the hardware and system monitoring.

PCM-SE supports only Linux x86 and Linux. In terms of scalability, PCM-SE in its current version can scale up to 2,500 nodes via the GUI. The added functionalities in PCM-SE are centralized web-interface and the Web Portal, which makes the product easy to manage and use to manage a complex cluster as a single system. PCM-SE can provision the operating system with the software components and allows administrators to define provisioning templates for ease of software package management for cluster nodes.

The kits framework allows third-party users to package multiple software components (such as InfiniBand drivers and GPU runtime software) with the configuration, then deploy them into the cluster nodes. The PCM-SE installation is quick and easy. It is a one-step installation that installs all of the required components (including embedded xCAT as a provisioning engine), all dependant packages, and completes the postinstallation that includes configuring services.

IBM Platform HPC (pHPC) is targeted to small and medium clusters. The scalability is around 200 - 300 compute nodes. pHPC has most of the PCM-SE functionality (pHPC supported only Linux x86). PCM-SE is a part of Platform HPC (providing cluster management functionality) and works seamlessly with the intelligent workload scheduler that is based on Platform LSF and Platform MPI and Platform Application Center (PAC) that are bundled in Platform HPC. PCM-SE also has the job management portal, which allows the users to submit and manage their jobs. Together, they deliver a complete set of cluster management functions for technical computing users.

IBM Platform Cluster Manager has two editions: Standard Edition and Advanced Edition. From a functionality perspective, there are some differences between Platform Cluster Manager Standard Edition (PCM-SE), and the Platform Cluster Manager Advanced Edition (PCM-AE), which are listed in Table 2-2.

Table 2-2 IBM Platform Cluster Manager Standard Edition versus Advanced Edition

Features	PCM-SE	PCM-AE
Physical provisioning	X	X
Server monitoring	X	X
Hardware monitoring	X	-
IBM Platform HPC integration	X	-
Kit framework for software deployment	X	-
VM provisioning	-	X
Multiple cluster support	-	X
User self-service portal	-	X
Storage management	-	X
Network management	-	X
Cluster definitions for whole cluster deployment	-	X
Multiple tenants support	-	X
Supported environments	PCM-SE	PCM-AE
IBM Platform LSF family	X	X
IBM Platform Symphony family	X	X
Other workload managers	X	X

As you can see in Table 2-2 on page 9, PCM-SE manages a static computing cluster for a group of users. This is a single cluster; it is static and has a single user group compared with PCM-AE, which manages a dynamic cluster with multi-tenant user groups. This feature is a significant difference because PCM-AE manages a dynamic cluster. The second difference is that PCM-AE manages a multi-tenant environment. The third differentiator is that PCM-AE also manages virtualized environments.

PCM-SE supports the non-server hardware monitoring, which is not supported in PCM-AE. In the PCM-SE, there is a kit framework that is designed for software deployment. On the other side, PCM-AE has the cluster definitions for cluster deployments. These two mechanisms are different, although they are trying to reach the same goal.

PCM-SE is also integrated with pHPC while PCM-AE has no direct integration. You must treat pHPC as a software layer in the cluster deployments. Both PCM-SE and PCM-AE can work with the Platform LSF family, the Platform Symphony family, and non-IBM platform workload management environments.

From the architecture point of view, we should design and add by default PCM-SE into all opportunities that require cluster management functions. This is because we discovered that PCM-SE can match most of the use cases. We consider only PCM-AE if the opportunities require the following two cases:

•The first case is that the cluster is dynamic, which means the size of the cluster constantly changes over time or the servers are constantly shared between clusters (this is in a multiple cluster situation) or the cluster are provisioned and not provisioned over time.

•The second case to consider PCM-AE is that there are some virtual machines in the clustered environment, which means the hypervisor must be deployed in the cluster.

2.1.1 IBM Platform HPC

Clusters that are based on open source software and the Linux operating system dominate HPC. This is due in part to their cost-effectiveness and flexibility and the rich set of open source applications available. Platform HPC (pHPC) provides a complete set of technical and high performance computing management capabilities of Linux clusters in a single product. System administrators can use Platform HPC to manage complex cluster as a single system by automating deployment of the operating system and software components. Platform HPC provides provisioning and maintenance capabilities. It also includes centralized monitoring with alerts and customizable alert actions.

Platform HPC includes the following features:

•Cluster management (embedded xCAT as the provisioning engine)

•Workload management (based on IBM Platform LSF Express)

•Workload monitoring and reporting

•System monitoring and reporting

•Robust commercial MPI Library (based on IBM Platform MPI Standard Edition)

•Application support (integrated application scripts/templates)

•Accelerator support, including GPU and Intel XeonTM Phi coprocessor scheduling, management, and monitoring

•High availability of the pHPC cluster environment

•Unified Web Portal

Use cases for Platform HPC

IBM Platform HPC allows technical computing users in industries such as manufacturing, oil and gas, life sciences, and higher education to deploy, manage, and use their HPC cluster through an easy to use web-based interface. This minimizes the time that is required for setting up and managing the cluster for users and allows them to focus on running their applications rather than managing the infrastructure.

IBM Platform HPC comes complete with job submission templates for ANSYS Mechanical, ANSYS Fluent, ANSYS CFX, LS-DYNA, MSC Nastran, Schlumberger ECLIPSE, Simulia Abaqus, NCBI Blast, NWChem, ClustalW, and HMMER. By configuring these templates that are based on the application settings in your environment, users can start using the cluster without writing scripts. Cluster users who deploy home-grown or open source applications can use the Platform HPC scripting guidelines. These interfaces help minimize job submission errors and are self-documenting, which enables users to create their own job submission templates.

Platform Application Center (PAC) Integration: Platform LSF add-ons are not included in Platform HPC and not installed with it. The add-on must be downloaded and installed separately. Platform HPC contains some functions of PAC (job submission, job management, and application templates). If a customer purchases PAC Standard, they receive the entitlement. By applying the entitlement to existing Platform HPC, some other functions (remote 2D and 3D visualization) are enabled. However, the rest of the PAC Standard functions exist in PAC binary only. Therefore, if the customer requires these functions (specifically, Role Based Access Control), they must install PAC separately.

Component model

Platform HPC software components support various computationally intensive applications that are running over a cluster. To support such applications, Platform HPC software components that are shown in Figure 2-1 must provide several services.

Figure 2-1 Platform HPC software components diagram

Before starting any software applications, all of the nodes must be installed with the operating system and any application-specific software. This function is provided by the provisioning engine. Here, the user creates or uses a predefined provisioning template that describes the wanted characteristics of the compute node software. This provisioning engine listens for boot requests over a selected network and installs the system with the wanted operating system and application software. After the installation is complete, the target systems are eligible to run applications.

Although the compute images can run application software, access to these images is normally controlled by the job scheduler (Platform LSF) that is running as a workload manager. This scheduler function ensures that computational resources on the compute nodes are not overused by serializing access to them. The properties of the job scheduler are normally defined during installation setup. The scheduler can be configured to allocate different workloads to be submitted to one of the job placement agents (Platform LSF agents). This job placement agent starts particular workloads at the request of the job scheduler. There are multiple job placement agents on the system, one on each of the operating system images.

The monitoring and resource agents report back to the provisioning agent and job scheduler about the state of the system on every operating system image. This provides a mechanism to provide alerts when there is a problem, and to make sure that jobs are only scheduled on operating system images that are available and have resources.

The web portal provides an easy-to-use mechanism for administrators to control and monitor the overall cluster, while for the users it provides easy-to-use access to the system for job submission, management and reporting.

Operational model

The sample high available environment that is shown in Figure 2-2 is used to show how to design a deployment of the Platform HPC cluster. This is only one of several possible configurations. In our sample, there are four networks (public, provisioning, management, and application) and one shared cluster storage that is supplemented with a two-node GPFS cluster.

Figure 2-2 Platform HPC cluster deployment on the physical hardware

Cluster nodes

Management nodes, compute nodes, and visualization nodes can be used in the Platform HPC cluster. Each node has its own role.

Management node

A management node is the first node that is installed in your cluster. Every cluster requires a management node. It controls the rest of the nodes in the cluster. In previous versions of pHPC, this node is also called the head node or master node. A management node acts as a deployment node at the user site and contains all of the software components that are required for running the application in the cluster. After the management node is connected to a cluster of nodes, it provisions and deploys the compute nodes with client software. The software that is installed on the management node provides the following functions:

•Administration, management, and monitoring of the cluster

•Installation of compute nodes

•Stateless and stateful management

•Repository management and updates

•Cluster configuration management

•HPC kit management

•Provisioning templates management

•Application templates management

•Accelerated parallel application processing and application scaling by using the Platform MPI kit

•Workload management, monitoring, and reporting by using the Platform LSF kit

•User logon, compilation, and submission of jobs to the cluster

•Acting as a firewall to shield the cluster from external nodes and networks

•Acting as a server for many services such as DHCP, TFTP, HTTP, and optionally DNS, LDAP, NFS, and NTP

Compute node

Compute nodes are designed for computationally intensive applications to satisfy the functional requirements of planned use cases. The compute node is provisioned and updated by the management node and performs the computational work in a cluster. The workload management system (Platform LSF) sets the number of job slots on a compute node to the number of CPU cores. After the compute node is provisioned, it is installed with the operating system (OS) distribution, the Platform LSF kit (workload manager agent, monitoring, and resource management agent), the Platform MPI kit, and other custom software (as defined by user). The compute node can have some local disk for the OS and temporary storage that is used by running applications. The OS might also be configured as booted on diskless system to improve I/O performance (by using stateless provisioning).

The compute nodes also mount NFS or can be configured with GPFS for shared storage. These compute nodes can cooperate in solving problem by using MPI. This is facilitated by the connectivity to a high-speed interconnect network. Some applications do not require large disk storage on each compute node during simulation. However, large models might not fit in the available memory and must be solved out-of-core and then can benefit from robust local storage.

Visualization node

The visualization node is the same as a compute node, except it contains one or more graphics processing units (GPUs) for rendering 3D graphics, computer-aided engineering (CAE) design, validation of product parts with dynamic simulations, or stress analysis on individual components. Depending on the applications, each GPU can support several simultaneous interactive sessions. The pre- and post-processing applications are mostly serial; therefore, the processor resources in the node should be sufficient to handle their computation requirements. The visualization nodes often have some local disk space for the OS and temporary storage use by running applications. The visualization nodes also mount NFS or GPFS file systems for shared storage.

The login node functions as a gateway into the cluster. When users want to access the cluster from the public network, they must first log in to the login node before they can log in to other cluster nodes. In general, we recommend this as a best practice to prevent unauthorized access of the management node.

Cluster networks

There are several networks that are used in a pHPC cluster. Each cluster might have a dedicated network or might share a common network with others.

Public network

A public network connects the pHPC cluster to a corporate network.

Provisioning network

A provisioning network (private network) is an internal network to provision and manage the cluster nodes. The provisioning network cannot be accessed by nodes on the public network. The provisioning network often is a Gigabit Ethernet network. In general, the provisioning network serves the following purposes:

•Cluster management and monitoring

•Workload management and monitoring

•Message passing

It is common practice to perform message passing over a much faster network by using a high-speed interconnect with low latency. For more information, see “Application network” on page 15.

Management network

The management network (BMC network) is a network that provides out-of-band access to cluster nodes for hardware management. The network provides access to the CMM and the IMM of each cluster node. The management network cannot be accessed by nodes on the public network. (If public access is needed, the switch for the public network can be configured to enable routing between the public and management networks.)

Application network

This network (compute network) is used mainly by applications (for example, MPI applications) to efficiently share data among different tasks within an application across multiple nodes. This network is often used as a data path for applications to access the shared storage. The application network uses a high-speed interconnect, such as 10 Gb/40 Gb Ethernet or QDR/FDR InfiniBand. If the pHPC cluster includes a visualization node, there must be a route to the compute network from the external network. This routing is not necessary (except to the management node) if the system is intended only for batch work. It is possible to combine these networks by using virtual local area networks (VLANs).

These cluster networks can be combined into one or two physical networks to minimize the network cost and cabling in some configurations.

A typical combined deployment can be one of the following examples:

•Combined management network and provisioning network, plus a dedicated high-speed interconnect for applications. This is often the case if the high-speed interconnects are InfiniBand.

•Combined provisioning network and application network by using 10-Gigabit Ethernet, plus a dedicated management network. This network architecture can be implemented when management work has a dedicated switch on the chassis.

Both combined deployment options are available and supported by the pHPC cluster.

Cluster storage and file system

The following types of file systems are supported by the pHPC cluster:

•NFS: This file system is recommended for applications that are not I/O-intensive. The storage is connected to the management node, which acts as an NFS server.

•IBM General Parallel File Systems (GPFS): This file system is recommended for I/O-intensive applications. In this case, the management node acts as the GPFS server.

If there is an external storage system that contains data that is required by the application, the following supported access methods are available:

•Connect the storage to the cluster private network (that is, the application network). This method allows applications that are running on compute nodes to access the storage. This method should be the only option if the application requires constant changes to the files and the application performance is heavily dependent on the performance of accessing those files. If the data is stored in a database, this option should also be used.

•Connect the storage to the management node. When a job requires access to certain files, these files must be named explicitly during the job submission time. These required files are copied automatically to the compute nodes as part of the job scheduling process before the job starts. This is a viable option if the file size is small (less than 100 MB) and the data is not stored in a database. Similarly, the output files the job creates can be transferred automatically back to the management node after the job is completed.

NFS

If there is no external shared storage for the pHPC cluster, the local storage on the management server (including SAS attached disk arrays) provides the shared file system. The management node is configured and sized with enough resources to simultaneously allow file serving and other management functions. For many use cases, NFS access is sufficient to support the workload. For better performance, the external shared storage can be connected to the management node that is based on the system host connectivity option via Fibre Channel by using SAN.

GPFS

When management nodes (MN01 and MN02) are configured as a high available environment, shared storage is required to share user home directories and system working directories. We recommend building over these management nodes a two-node GPFS cluster with tiebreaker disks. We create a GPFS cluster by using two quorum nodes that are also the storage (NSD server) nodes. All the remaining compute nodes are NSD clients and do not participate in GPFS cluster quorum voting (non-quorum nodes).

Cluster NFS: In addition to the traditional exporting of GPFS file systems by using the Network file system (NFS) protocol, the use of GPFS on Linux allows you to configure a subset of the nodes in the cluster to provide a highly available solution for exporting GPFS file systems by using NFS. The participating nodes in this case acts as a GPFS client and are designated as Cluster NFS (CNFS) member nodes and the entire setup is frequently referred to as CNFS or a CNFS cluster.

High availability

A high availability cluster minimizes downtime by providing one active management node (MN01) and one standby management node (MN02). Services are only run on the active management node. If at any point a service stops or quits unexpectedly, the service is restarted on the same node. When a failover process occurs, the standby management node takes over as the management node and all the running services.

Virtual IP addresses

For the services to switch nodes, service access points are defined to enable the high availability process. A service access point defines a virtual IP address that is used by the active management node to access HPC services. In a failover, the active management node also takes over the virtual IP address. A virtual IP address for the public network and a virtual IP address for the provisioning network must be defined in the IP address ranges of your networks.

Shared file system

Shared file systems are required to set up a high availability environment on pHPC. Shared file systems are used to store user and system work data. In a high availability environment, all shared file systems must be accessible by the provisioning network for both management nodes and compute nodes.

2.1.2 IBM Platform Cluster Manager Standard Edition

IBM Platform Cluster Manager Standard Edition (PCM-SE) is easy-to-use, powerful cluster management software for technical computing users. PCM-SE delivers a comprehensive set of functions to help manage hardware and software from the infrastructure level. It automates the deployment of the operating system and software components, and complex activities, such as provisioning and maintenance of a cluster. It includes support for RedHat Enterprise Linux family operating systems for x86 64-bit and IBM POWER®.

By using the centralized user interface, system administrators can manage complex clusters as a single system and flexibility as users can add customized features that are based on specific requirements of their environment. It provides a Kit framework within an x86 ecosystem for easy software deployment, such as InfiniBand drivers and GPU runtime software.

PCM-SE provides monitoring capability for most components within a cluster so users can easily visualize the performance and condition of the cluster. The monitoring agent is the same technology that is used in IBM Platform LSF and IBM Platform Symphony and is easy to extend and customize. It also can monitor non-server components, such as chassis, network switches, IBM GPFS, GPU and co-processors, and customized devices for efficient usage of the overall infrastructure. It also adds management node automatic failover capability to ensure continuity of cluster operations.

By using xCAT technology, PCM-SE offers greater management scalability by scaling up to 2,500 nodes via the GUI. PCM-SE runs on various types of IBM servers that include the most recent iDataPlex® servers, NextScale servers, FlexSystem nodes, and System x rack-based servers. It is also supported on industry-standard non-IBM x86 hardware.

PCM-SE includes the following features:

•Quick and easy installation

•Cluster management (embedded xCAT as the provisioning engine)

•Kit framework that is designed for software deployment and maintenance

•Robust and scalable system monitoring and reporting

•Centralized Web Portal

•Cross-provisioning compute nodes

Use cases for PCM-SE

Typical use of PCM-SE is with HPC workload managers, such as IBM Platform LSF, IBM Platform Symphony, Oracle Grid Engine, PBS, Maui/Moab, and Hadoop. Its scalability is used as a commercially supported cluster manager to manage Big Data clusters, large HPC clusters, and scale out application appliances.

Component model

The PCM-SE software components that are shown in Figure 2-3 are based on Extreme Cloud Administration Toolkit (xCAT), which provides a unified interface for hardware control, discovery, and operating system and software components deployment. The back-end PCM-SE features are coded as xCAT Plug-ins complementing xCAT as a provisioning and management foundation. xCAT plug-ins store cluster configuration and user settings for the cluster inside PostgreSQL DB as a Platform Cluster Manager database (PCM DB). The cluster monitoring data is also stored in a PCM DB for reporting and analysis. The PERF service collects and aggregates performance data from the compute nodes. Other agentless monitoring data is loaded to the PCM DB by PERF data loaders.

Figure 2-3 PCM-SE software components diagram

The Web Portal is the front end of PCM-SE, which provides the following capabilities:

•Resources dashboard for comprehensive view of cluster status (cluster health, cluster performance, and rack view)

•Manage hosts (nodes and node groups), unmanaged devices, licenses, and networks

•Manage provisioning templates through image profiles and network profiles (packages, kit-components, kernel modules, networks, post-install, and post-boot scripts)

•Manage networks

•Manage OS distributions

•Manage kit library

•View and manage resource reports and resource alerts

The high availability manager (HA manager) is a service that runs on both management nodes (active and standby). It monitors the heartbeat signal and controls all services by using the HA service agent. PCM-SE uses EGO service controller (EGOSC) as the HA manager. If any service that is controlled by the HA manager fails, the HA manager restarts that service. If the active management node fails, the HA manager detects that an error occurred and migrates all controlled services to the standby management node.

Operational model

From the architecture point of view, the operational model of the cluster by using PCM-SE shares topology with the pHPC cluster operational model. One of the differences is scalability. PCM-SE offers greater management scalability by scaling up to 2,500 nodes, and it can be used to deploy small and large clusters. A small cluster is considered to be a one- rack solution (maximum of 42 - 56 nodes), and a large cluster is considered to be more than a one-rack solution (up to 2,500 nodes). This affects networking design and if you want to run a cluster with more nodes than a single rack can contain, a multiple-rack setup is required. As you can see on a sample of the operational model that is shown in Figure 2-4, spine switches are used to connect top-of-rack (TOR) switches to create a single cluster from multiple machine racks.

Figure 2-4 PCM-SE cluster deployment on the physical hardware

The operational model for PCM-SE is broken down into five areas: management nodes, log in nodes, compute nodes, shared storage, and networking.

Management nodes

PCM-SE has built-in failover capability for the management node. A high availability environment includes two installed PCM-SE management nodes (MN01 as active and MN02 as standby) locally with same software and network configuration (except the host name and IP address). All IP addresses (management nodes IP addresses and virtual IP address) are in the IP address range of your networks. The management node connects to public and provisioning networks.

As a best practice to prevent unauthorized access of the management nodes, we recommend the use of login nodes (LN01 and LN02) as a gateway into the cluster. The login nodes connect to the public and provisioning networks.

Compute nodes

The compute nodes (CN01 to CN60) are provisioned and updated by the management node and perform the computational work in a cluster. In PCM-SE, node provisioning installs an operating system and applications on a node. To provision a node, associate it with a provisioning template. The provisioning template includes an image profile, a network profile, and a hardware profile. The operating system (OS) distribution that you want to use to provision your nodes can be added to the Web Portal. When the OS distribution is added, two default image profiles are automatically created: one stateful image profile and one stateless image profile.

Stateless provisioning loads the operating system image into memory. Changes that are made to the operating system image are not persistent across compute node reboots. You can use diskless provisioning by RAM-root or by compressed RAM-root.

Stateful provisioning loads the operating system image onto persistent storage. Changes that are made to the operating system image are persistent across compute node reboots. The persistent storage can be a local disk, SAN, or iSCSI device.

Users often install homogeneous clusters where management nodes and compute nodes use the same OS distribution. In some cases, the management node and compute nodes can be different. This is called a Cross-Distro cluster. This is an advanced feature that is supported by PCM-SE.

Note: A mix of OS distributions in the same cluster is supported. However, a mix of x86 and Power nodes in the same cluster is not supported.

Shared storage

To create a high available environment, shared storage is required to share user home directories and system working directories. All shared file systems must be accessible by the provisioning network for both management nodes and compute nodes. For I/O-intensive applications, we recommend building a two-node GPFS cluster with tiebreaker disks over both shared storage nodes (MN01 and MN02).

We can improve GPFS performance when InfiniBand is used as a high speed and low latency interconnect. In the following cases, InfiniBand provides two modes to help increase performance:

•GPFS cluster management can use IP over InfiniBand (IPoIB).

•GPFS Network Shared Disk (NSD) communication can use Remote Direct Memory Access (RDMA) InfiniBand protocol.

GPFS can define a preferred network subnet topology; for example, designate separate IP subnets for intra-cluster communication and the public network for GPFS data. This provides for a clearly defined separation of communication traffic and allows you to increase the throughput and possibly the number of nodes in a GPFS cluster. Instead of separate IP subnets for intra-cluster communication (GPFS cluster management and heartbeat), we can use IPoIB.

GPFS on Linux supports an RDMA InfiniBand protocol to transfer data to NSD clients. GPFS has verbsRdma and verbsPorts options for the RDMA function. The InfiniBand specification does not define an API for that; however, the OpenFabrics Enterprise Distribution (OFED) is a package for Linux that includes all of the needed software (libibverbs package as a Verbs API) to work with RDMA. The verbsRdma parameter enables the use of RDMA and the verbsPort parameter sets the device that you want to use. These parameters are set by using the GPFS mmchconfig command. For more information about GPFS by using RDMA, see Implementing the IBM General Parallel File System (GPFS) in a Cross Platform Environment, SG24-7844, and the General Parallel File System (GPFS) Wiki page, which is available at this website:

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Network%20Configuration

Networking

The PCM-SE cluster uses the same networks as the p-HPC cluster (public, provisioning, monitoring, and application). Each of the clusters might have a dedicated network or share a common network with others. In a multi-rack cluster, we recommend the use of top-of-rack switches and set up VLANs for different network instead of using one top-of-rack switch per network.

When a server contains two network ports of the same speed, there is a way to tie them together by using Link Aggregation Control Protocol (LACP). Each rack has two top-of-rack switches with HA (by using inter-switch links at the top-of-rack level). This two-switch cluster with configured Virtual Link Aggregation Group (vLAG) feature allows multi-switch link aggregation, which provides higher performance and optimizes parallel active-active forwarding. From each top-of-rack switch, aggregated uplinks might be configured to the up-level spine switches to build redundant two-tier, and layer 3 fat tree network.

2.1.3 IBM Platform Cluster Manager Advanced Edition

Platform Cluster Manager Advanced Edition (PCM-AE) manages the provisioning of multiple multi-tenant analytics and technical computing clusters in a self-service and flexible fashion. PCM-AE provides secure multi-tenancy with access controls, policies, and resource limits to enable sharing. Based on assigned user roles, it provides rapid self-service provisioning of heterogeneous HPC environments and gets the clusters that you need, on physical and virtual resources.

You can deploy an HPC cluster with underlying physical servers composing your cluster infrastructure, or you can make provisioning on top of a virtualization layer on top of the bare metal, or mix both approaches to create a hybrid cluster. You can maximize the consolidation level of your infrastructure as a whole with virtualization, or you can isolate workloads by engaging only physical servers.

PCM-AE helps decrease operating costs by increasing the usage of pooled resources and the operational efficiency (managing multiple separate clusters through a single point of administration). It provides elasticity where the size of user’s logical cluster can be dynamically expanded and shrunk over time based on workload and resource allocation policy. PCM-AE runs on various types of IBM servers that include the most recent iDataPlex servers, NextScale servers, FlexSystem nodes, and System x rack-based servers. It is also supported on non-IBM industry standard x86_64 hardware.

The PCM-AE environment includes the following components:

•User self-service and administration portal.

•The management server, which is responsible for running the system services and managing provisioned clusters.

•The xCAT provisioning engine, which provisions clusters with physical machines. The provisioning engine is responsible for managing the physical machines that make up provisioned clusters.

•A database to store operational data. You can specify PCM-AE to install a new MySQL, or you can use an existing MySQL or Oracle database.

•Physical machines, which are the compute nodes within a cluster.

Optionally, PCM-AE includes the following components:

•Hypervisor hosts, which run and manage virtual machines (VMs). When you are provisioning clusters with VMs, the hypervisor hosts provide the actual virtual resources that make up the clusters.

•A Lightweight Directory Access Protocol (LDAP) server for user authentication in a multi-tenant environment.

•An IBM General Parallel File System (GPFS) server for secure storage by using GPFS with PCM-AE.

•A Mellanox Unified Manager (UFM) server for an InfiniBand secure network with PCM-AE.

Use cases for PCM-AE

PCM-AE is part of a family of cluster and grid workload management solutions. PCM-AE is an enabling technology that is used to provision the cluster and grid workload managers on a shared set of hardware resources. The customer can run multiple, separate clusters and includes almost any combination of IBM and third-party workload managers.

Typical use of PCM-AE is with HPC workload managers, such as IBM Platform LSF, IBM Platform Symphony, IBM Platform Symphony MapReduce, IBM InfoSphere® BigInsights™, IBM InfoSphere Streams, Oracle Grid Engine, Altair PBS Professional, Hadoop, and others. For its capabilities, it is used as a commercially supported cluster manager to manage Big Data clusters and multi-tenant HPC clouds.

Component model

PCM-AE has its own particular internal software components architecture, and it makes use of other software components to create a manageable PCM-AE cluster infrastructure environment. Figure 2-5 shows the software components of a PCM-AE environment. We can classify them in two distinct components: PCM-AE internal software components and PCM-AE external software components.

Figure 2-5 PCM-AE software components diagram

Internal software components are based on Enterprise Grid Orchestrator (EGO), which provides the underlying system infrastructure to control and manage cluster resources. EGO manages logical and physical resources and supports other software components that are in the product. PCM-AE features the following components:

•To control the multi-tenancy characteristic of PCM-AE based on accounts

•One that defines the rules for dynamic cluster growth or shrinking based on required service level agreements

•An allocation engine component that manages resource plans, prioritization and how the hardware pieces are interconnected

•To provide feedback on cluster utilization and resources that are used by tenants

•Handles overall operational resource management

•One with which you can define, deploy, and modify clusters

•To visualize existing clusters and the servers within them

The resource integrations layer allows PCM-AE to use external software components to provide resource provisioning. The integrations layer manages machine or virtual machine allocation, network definition, and storage area definition. As shown in Figure 2-5, we can use xCAT to perform bare metal provisioning of servers with Linux with dynamic VLAN configuration.

KVM also is an option as a hypervisor host that is managed by PCM-AE. Another external software component, GPFS, can be used to provide shared storage that can be used, for example, to host virtual machines that are created within the environment.

PCM-AE also can manage and provision clusters by using the Unified Fabric Manager platform with InfiniBand. PCM-AE can be used to provide a high-speed and low-latency private network among the servers while maintaining multi-tenant cluster isolation by using virtual lanes (VLs). The PCM-AE is a platform that can also offer support for the integration of other provisioning software (IBM SmartCloud® Provisioning, VMware vSphere with ESXi), including custom adapters that you might already have or need in your existing environment.

Operational model

The PCM-AE architecture provides you with the benefit and flexibility of dynamically creating HPC clusters that can be expanded later or reduced based on workload demand, or even destroyed after temporary workloads are run. PCM-AE provides cluster management that allows the provisioning of multiple clusters (supported on physical or virtual machines), which feature a self-service for minimal administrator intervention. Figure 2-6 shows how the technology infrastructure components might be set up for this use case.

Figure 2-6 PCM-AE cluster deployment on the physical hardware

The operational model for PCM-AE is broken down into five areas: management nodes, provisioning engine, hypervisor hosts, compute nodes, shared storage, and networking. Each of these cluster components is described next.

Management nodes

To improve performance and reduce the load of using a single host, we recommend creating a multi-hosts environment and installing the management server and provisioning engine packages on separate hosts. Within the PCM-AE cluster, there is only one master management server (MS01). However, you can have more management servers for failover. This means that if the master management server fails, the system restarts on another management server that is called master candidate (MS02).

For failover to work, install the management server on each management server candidate host and configure each host to access the same shared storage location as the master management server. In that case, the PCM-AE cluster administrator account must exist as the same user on all management server candidates. The management server host is responsible for running the system services and managing provisioned clusters. If you are managing RHEL KVM hosts, the management server communicates directly with the agent that is installed on each hypervisor host. The management node connects to public and provisioning networks.

You can choose for the management server installation package to automatically deploy an Oracle Database XE. However, you can optionally use a remote database (where an Oracle database is a separate server from the management server). A separate external database is ideal for a multi-host PCM-AE environment because it allows for larger scalability for an enterprise-level database to store operational data. To support management server failover, your Oracle database cannot be on the same host as your management server.

Provisioning engine

The provisioning engine (PE01) provisions clusters with physical machines and is responsible for managing the physical machines that make up provisioned clusters. In a multi-host environment (typically, if you plan to work with many clusters in a larger environment, such as for a production environment), we recommend the use of a dedicated host for each type of management. The provisioning engine contains node groups that are templates that define how a group of machines (nodes) is configured and what software is installed. The PMTools node group is used for provisioning physical machines, while the KVM node group is used for provisioning hypervisor hosts on physical machines. The provisioning engine connects to public and provisioning networks.

Hypervisor hosts

Hypervisor hosts (HH01 to HH20) run and manage virtual machines (VMs) within a cluster. When clusters are provisioned with VMs, hypervisor hosts provide the actual virtual resources that make up the cluster. PCM-AE requires specific prerequisites for the physical machine that is used as the hypervisor host. In general, any physical x86_64 powerful machine can be used that supports RHEL (64-bit) with the KVM kernel module installed and with mounted NFS or LVM-based storage with enough disk space for the template configuration files and VM images. The amount of disk space that is required depends on the size of the VM guest operating system (the operating system that runs on the VM) and how many VMs are in the hypervisor host. The hypervisor hosts connect to the provisioning network and to the application network (high speed and low latency interconnect), if required.

Compute nodes

Compute nodes (CN01 to CN40) that are presented by physical machines are the physical resources of the shared compute infrastructure within the PCM-AE cluster. Compute nodes connect to the same private network as the provisioning and to the application network (high speed and low latency interconnect), if required.

Adding compute nodes into the xCAT adapter instance makes the nodes available for provisioning. PCM-AE is tightly integrated with Platform LSF to allow the provisioning of multiple Platform LSF clusters on demand. You can quickly deploy a complete Platform LSF cluster by creating a cluster instance from the sample Platform LSF cluster definition. You can customize the cluster definition, including rule-based policies to suit your own specific computing environment. Cluster policies help the system to deliver the required cluster environments and provide workload-intelligent allocations of clusters. Policies balance the supply and demand of resources according to your business requirements and allow you to change the available capacity (up or down), depending on workload requirements.

Note: In a secure multi-tenant environment, the tenant's users might not have access to the LAN and to their cluster machines. For example, they cannot use SSH to log in to a machine. To allow user access, an administrator must configure the network accordingly. For example, add to the compute nodes another network configuration to the public VLAN that users can connect through.

Shared storage

A shared storage repository is required to benefit from various features, such as high availability, load balancing, and migration. For failover to work, each management server host is configured to access the same shared storage location for failover logs as the master management server. To support the failover, the NFS or GPFS file system can be used as a shared file system that is connected to the private network.

If you want to share data between physical machines (for example, Platform LSF hosts), connect a shared file system to the provisioning network. By default, the system is configured to use NFS on the master management server. In the case of the PCM-AE multi-host environment, we recommend building a two-node GPFS cluster with tiebreaker disks (SS01 and SS02) as a shared storage.

If you are using hypervisor hosts to provision virtual machines (VMs), the PCM-AE supports VMs by using NFS-based file systems for storage. By default, template configuration files are stored on the master management server, VM configuration files are stored locally on the RHEL KVM hypervisor host, and VM images are stored on the NFS server. By using an NFS or GPFS as a shared storage repository, these files and images can be shared between multiple hypervisor hosts of the same type.

The PCM-AE integration with GPFS allows provisioning of the secure multi-tenant HPC clusters that are created with secure GPFS storage mounted on each server. After the storage is assigned to an account, only the users or groups of this account are given the permissions (read, write, and execution) to access the storage. To achieve this, PCM-AE communicates with the GPFS master node to add a user or group to the ACL (access control list) for the related directory in the GPFS file system.

When the cluster machines are provisioned with the GPFS file system that contains the storages, the user is not required to manually mount the file system. You also can use GPFS for PCM-AE as an extended storage, which looks like one folder in the operating system for virtual machines or physical machines. You can also use GPFS for PCM-AE as an image storage repository for virtual machines. For more information about GPFS administration, see the chapter “Secure storage using a GPFS cluster file system” in Platform Cluster Manager Advanced Edition Version 4 Release 1: Administering, SC27-4760-01.

Networking

The PCM-AE cluster uses the public, provisioning, monitoring, and application networks (if high speed and low latency interconnect is required). In a multi-rack cluster, we recommend the use of top-of-rack switches and setting up VLANs for different networks. When a server contains two network ports of the same speed, there is a way to tie them together by using Link Aggregation Control Protocol (LACP). For hypervisor hosts that are running and managing VMs, we recommend the use of 10 Gbps network adapters. Each rack has two top-of-rack switches with HA (that use inter-switch links at the top-of-rack level). This two-switch cluster with configured Virtual Link Aggregation Group (vLAG) feature allows multi-switch link aggregation, which provides higher performance and optimize parallel active-active forwarding. From each top-of-rack switch, aggregated uplinks can be configured to the up-level spine switches to build redundant two-tier and layer 3 fat tree networks.

In a secure multi-tenant fashion, PCM-AE should be configured to use VLAN secure networks and InfiniBand secure networks to create clusters on separate VLANs, on separate partitioned InfiniBand networks, or a combination. For more information about secure networks, see “VLAN secure networks” and “InfiniBand secure networks” chapters in Platform Cluster Manager Advanced Edition Version 4 Release 1: Administering, SC27-4760-01.

2.2 Workload management

IBM Platform Computing offers a range of workload management capabilities to optimize the running of various applications that use HPC clusters and ensure high resource usage with diverse workloads, business priorities, and application resource needs. Workload management uses computing resources efficiently to complete workloads as fast as possible. To enable an efficient workload allocation, an intelligent scheduling policy is required. An intelligent scheduling policy is based on understanding shared computing resources, the priority of the application, and user policies. Providing optimal service-level agreement (SLA) management and by providing greater versatility, visibility, and control of job scheduling helps reduce operational and infrastructure costs that are needed for maximum return of investment (ROI).

2.2.1 IBM Platform Load Sharing Facility

IBM Platform Load Sharing Facility (LSF) is a powerful workload management platform for demanding, distributed, and mission-critical HPC environments. IBM Platform LSF manages batch and highly parallel workloads. It provides flexible policy-driven scheduling features, which ensure that shared computing resources are automatically allocated to users, groups, and jobs in a fashion that is consistent with your service level agreements (SLAs), which improves resource usage and user productivity.

The advanced scheduling features make Platform LSF practical to operate at high usage, which translates to lower operating costs. Many features combine to reduce wait-times for users and deliver better service levels so that knowledge workers are more productive, which leads to faster, higher-quality results. Its robust administrative features make it more easily managed by a smaller set of administrators, which promotes efficiency and frees valuable staff time to work on other projects. For example, you can delegate control over a particular user community to a particular project or department manager. You also can reconfigure the cluster for one group without causing downtime for all other groups and use a new type of application that benefits from general-purpose GPUs. Having these features translates into flexibility.

Platform LSF functionality scales to meet your evolving requirements. In terms of scalability, Platform LSF is scalable in multiple dimensions. It scales to hundreds of thousands of nodes and millions of jobs. It also is scalable in other dimensions; for example, in the breadth of resources it supports. Whether you are managing Windows, Linux, GPU workloads, or floating application licenses, Platform LSF can provide flexible controls over vast numbers of users and resources across multiple data centers and geographies. It is also scalable to different workload types, whether you are managing single MPI parallel jobs that run for days across thousands of nodes, or millions of short-duration jobs that are measured in milliseconds. Platform LSF has scheduling features to meet these diverse needs and handle workloads at scale. Platform LSF is unique in its ability to solve a wide-range of scheduling problems, which enables multiple policies to be active on a cluster at the same time.

The smart scheduling policies of Platform LSF include the following features:

•Fairshare scheduling

•Topology and core-aware scheduling

•Backfill and preemption

•Resource reservations

•Resizable jobs

•Serial and parallel controls

•Advanced reservation

•Job starvation

•License scheduling

•SLA-based scheduling

•Absolute priority scheduling

•Checkpoint and resume

•Job arrays

•GPU-aware scheduling that is supported NVIDIA GPU and Intel Xeon Phi accelerators

•Tight integration with IBM Platform MPI and IBM Parallel Environment

•Plug-in schedulers

Platform LSF is available in the following editions to ensure that users have the right set of capabilities to meet their needs:

•Express Edition: Ideal for single-cluster environments and optimized for low throughput parallel jobs and simple user grouping structures.

•Standard Edition: Ideal for multi-cluster or grid environments and optimized for high throughput serial jobs and complex user grouping structures.

•Advanced Edition: Supports extreme scalability and throughput 100k+ cores and concurrent jobs.

The performance of Platform LSF depends upon many factors, including the number of nodes in the cluster, the number of concurrently running jobs, the number of pending jobs, the number of users querying the system, and the frequency of queries. As these tasks increase, the scheduling cycle and user response time increases. For high-throughput workloads, the overall system performance is dependent upon the processing power, I/O capacity, and memory of the scheduling node. Table 2-3 on page 29 provides sizing guidelines that are based on tested cluster configurations. For large clusters, it is recommended that users seek configuration assistance from IBM.

Table 2-3 Platform LSF scalability and throughput

Scalability and performance limits	Express	Standard	Advanced
Nodes	100	6,000	180,000
Cores	200	48,000	160,000
Concurrent short jobs	200	48,000	160,000
Pending jobs	10,000	500,000	2,000,000

The notion of Platform LSF heterogeneity is important because few organizations run only one operating system on only one hardware platform. Platform LSF scales from Windows to UNIX and Linux to Cray, NEC, and IBM supercomputers, which employ the world’s most advanced architectures by offering customers complete freedom of choice to run the best platform for the best job with a fully supported software product.

Platform LSF is supported on any of the following operating environments and architectures:

•IBM AIX 6.x and 7.x on IBM Power 6 and POWER7

•HP UX B.11.31 on PA-RISC

•HP UX B.11.31 on IA64

•Solaris 10 and 11 on Sparc

•Solaris 10 and 11 on x86-64

•Linux on x86-64 Kernel 2.6 and 3.x

•Linux on IBM Power 6 and IBM POWER7 Kernel 2.6 and 3.x

•Windows 2003/2008/2012/XP/7/8 32-bit and 64-bit

•Apple Mac OS 10.x

•Cray XT3, XT4, XT5, XE6, and XC-30 on Linux Kernel 2.6

•glibc 2.3, SGI Performance Suite on Linux Kernel 2.5

•glibc 2.3 and ARMv7 Kernel 3.6 glibc 2.15 (Platform LSF slave host only)

For information about Platform LSF system support varies on Platform LSF Edition), see this website:

http://www-03.ibm.com/systems/technicalcomputing/platformcomputing/products/lsf/index.html

IBM Platform LSF provides optional add-ons that can be installed to extend the set of workload management capabilities. The following add-ons are designed to work together to address your high performance computing needs:

•IBM Platform Application Center (PAC): Portal management and application support that provides a rich environment for building easy-to-use, application-centric web interfaces, which simplify job submission, management, and remote 3D visualization.

•IBM Platform Process Manager (PPM): A powerful visual interface for designing complex engineering computational processes and multi-step workflows, and capturing repeatable best practices that can be used by other users.

•IBM Platform RTM: A flexible, real-time dashboard for monitoring global workloads and resources, including resource usage reporting. With better cluster visibility and cluster alerting tools, administrators can identify issues before the issues lead to outages, which helps avoid unnecessary service interruptions.

•IBM Platform Analytics: An advanced tool for visualizing and analyzing massive amounts of workload data for improved decision-making, more accurate capacity planning, optimizing asset usage and identifying and removing bottlenecks.

•IBM Platform License Scheduler: A license management tool that enables policy-driven allocation and tracking of commercial software licenses.

•IBM Platform Session Scheduler: A high-throughput and low-latency scheduling solution that is ideal for running short jobs, whether they are a list of tasks or job arrays with parametric execution.

•IBM Platform Dynamic Cluster: An innovative cloud management solution that transforms static, low-usage clusters into dynamic and shared cloud resources.

Use cases for Platform LSF

Platform LFS family products focus on the following technical computing markets:

•Electronics: Electronics design automation (EDA), electronic circuit design, and software development/QA.

•Manufacturing (automotive and aerospace and Defense): Computationally intensive simulations, crash and occupant safety, computational fluid dynamics, NVH, aerodynamics, durability, mechatronics design, engineering process and product data management, remote visualization, and materials engineering.

•Life Sciences: Human genome sequencing, QCD simulations, and therapeutic drug design.

•Energy/Oil & Gas: 3D visualization, reservoir simulation, seismic processing, and downstream chemical and mechanical engineering applications.

•Higher education and research: Electromagnetic simulations, finite element analysis, micro-scale optics, simulation, QCD simulations, visualization and image analysis, climate modeling, and weather forecast.

•Media and digital content creation: Animation, simulation, and rendering.

IBM Platform LSF is successfully deployed across many industries to manages batch and highly parallel workloads. Platform LSF use cases benefits from the key industry-leading ISV applications support. IBM Platform LSF within Platform Application Center comes complete with application templates for ANSYS Mechanical, ANSYS Fluent, ANSYS CFX, ClustalW, CMGL STARS, CMGL IMEX, CMGL GEM, HMMER, LS-DYNA, MATLAB, MSC Nastran, NCBI Blast, NWChem, Schlumberger ECLIPSE, Simulia Abaqus, STAR-CCM, and generic templates for in-house or open source applications. By standardizing access to applications, Platform Application Center makes it easier to enforce site policies and address security concerns within Role Based Access Control (RBAC).

Within Platform LSF, the computing resources are available to users through dynamic and transparent load sharing. Through its transparent remote job running, Platform LSF provides powerful remote hosts to improve application performance, which enables users to access resources from anywhere in the system.

Platform LSF architecture

Platform LSF is a layer of software services on top of heterogeneous enterprise resources. This layered service model is shown in Figure 2-7 and accepts and schedules workload for batch and non-batch applications, manages resources, and monitors all events.

Figure 2-7 Platform LSF Layered Service Model

The three core components of the workload and resource management layer as shown in Figure 2-7 are LSF Base, LSF Batch, and LSF Libraries. Together, they help create a shared, scalable, and fault-tolerant infrastructure that delivers faster and more reliable workload performance.

LSF Base provides basic load-sharing services for the distributed system, such as resource usage information, host selection, job placement decisions, transparent remote running of jobs, and remote file option. These services are provided through the following components:

•Load Information Manager (LIM). The LIM on each host monitors its host’s load and reports load information to LIM that is running on the master node. The master LIM collects information from all slave hosts that are running in the cluster and provides the same information to the applications.

•Process Information Manager (PIM). It is started by LIM and runs on each node in the cluster. It collects information about job processes that are running on the host, such as CPU and memory that is used by the job and reports the information to sbatchd.

•Remote Execution Server (RES). The RES on each server host accepts remote run requests and provides fast, transparent, and secure remote task running.

There are a few utilities, such as lstools, lstcsh, and lsmake available to manage the workloads.

LSF Batch extends Platform LSF base services to provide a batch job processing system with load balancing and policy-driven resource allocation control. To provide this functionality, LSF Batch uses the following Platform LSF base services:

•Resource and load information from LIM to do load balancing

•Cluster configuration information from LIM

•The master LIM election service that is provided by LIM

•RES for interactive batch job running

•Remote file operation service that is provided by RES for file transfer

The core component of Platform LSF Batch is the scheduler framework that is based on the Master Batch Scheduler daemon (mbschd), which is combined with multiple plug-ins. All scheduling policies are implemented in the plug-in. For each cycle, the framework triggers scheduling, then control flow goes through each plug-in.

In different scheduling phases, the plug-in can intercept scheduling flow and influence the final decision. It means that to make scheduling decisions, Platform LSF uses multiple scheduling approaches that can run concurrently and be used in any combination, including user-defined custom scheduling approaches. This unique modular architecture makes the scheduler framework extendable to add new policy such as a new affinity plug-in.

LSF Batch services are provided by two daemons. The Master Batch daemon (mbatchd) runs on the master host and is responsible for the overall state of the job in the system. It receives job submission and information query requests. The daemon manages jobs that are held in queues and dispatches jobs to hosts as determined by mbschd. The Slave Batch daemon (sbatchd) runs on each slave host. The daemon receives requests to run the job from mbatchd and manages local running of the job. It is responsible for enforcing local policies and maintaining the state of the jobs on the hosts. The daemon creates a child sbatchd to handle every job run. The child sbatchd sends the job to the RES, which creates the environment on which the job runs.

LSF libraries provide APIs for distributed computing application developers to access job scheduling and resource management functions. The following Platform LSF libraries are available:

•LSLIB: The LSF base library that provides Platform LSF base services to applications across a heterogeneous network of computers. The Platform LSF base API is the direct user interface to the Platform LSF base system and provides easy access to the services of Platform LSF servers. A Platform LSF server host runs load-shared jobs. A LIM and a RES run on every Platform LSF server host. They interface with the host’s operating system to give users a uniform, host-independent environment.

•LSBLIB: The LSF batch library gives application programmers access to the job queuing processing services that are provided by the Platform LSF batch servers. All Platform LSF batch user interface utilities are built on top of LSBLIB. The services that are available through LSBLIB include Platform LSF batch system information service, job manipulation service, log file processing service, and Platform LSF batch administration service.

Component model

The component model consists of multiple Platform LSF daemon processes that are running on each host in the distributed system, a comprehensive set of utilities that are built on top of the Platform LSF API, and relevant Platform LSF add-ons components that complement the required features. The type and number of running Platform LSF daemon processes depends on whether the host is a master node, one of the master node candidates, or a compute (slave) node, as shown in Figure 2-8.

Figure 2-8 Platform LSF software components diagram

On each participating host in a Platform LSF cluster, an instance of LIM runs and collects host load and configuration information and forwards it to the master LIM that is running on the master host. The master LIM forwards load information to mbatchd, which forwards this information to mbschd to support scheduling decisions. If the master LIM becomes unavailable, a LIM on a master candidate automatically takes over. The External LIM (ELIM) is a site-definable executable file that collects and tracks custom dynamic load indexes (for example, information about GPUs). An ELIM can be a shell script or a compiled binary program that returns the values of the dynamic resources you define.

In addition to LIM, RES and PIM are other daemons that are running on each server host. RES accepts remote run requests to provide transparent and secure remote running of jobs and tasks. PIM collects CPU and memory usage information about job processes that are running on the host and reports the information to sbatchd.

Platform LSF can be accessed by users and administrators via the command-line interface (CLI), an API, or through the PAC Web Portal. The submission host that can be in a server host or a client host submits a job with commands by using the CLI or an application by using the API.

Platform LSF base execution (non-batch) tasks are user requests that are sent between the submission, master, and execution hosts. From the submission host, lsrun submits a task into the Platform LSF base. The submitted task proceeds through the Platform LSF base API (LSLIB). The LIM communicates the task’s information to the cluster’s master LIM. Periodically, the LIM on individual machines gathers its 12 built-in load indexes and forwards this information to the master LIM. The master LIM determines the best host to run the task and sends this information back to the submission host’s LIM.

Information about the chosen execution host is passed through the Platform LSF base API back to lsrun, which creates network input/output server (NIOS), which is the communication pipe that talks to the RES on the execution host. Task execution information is passed from the NIOS to the RES on the execution host. The RES creates a child RES and passes the task execution information to the child RES. The child RES creates the execution environment and runs the task. The child RES receives completed task information and sends it to the RES. The output is sent from the RES to the NIOS. The child RES and the execution environment is destroyed by the RES. The NIOS sends the output to standard out (STDOUT).

In cases of Platform LSF batch execution of the (batch) tasks, the submission host does not directly interact with the execution host. From the submission host, bsub or lsb_submit() submits a job to the Platform LSF batch. The submitted job proceeds through the Platform LSF batch API (LSBLIB). The LIM communicates the job’s information to the cluster’s master LIM. Based on gathered load indexes, the master LIM determines the best host to run the job and sends this information back to the submission host’s LIM. Information about the chosen execution host is passed through the Platform LSF batch API back to bsub or lsb_submit().

To enter the batch system, bsub or lsb_submit() sends the job by using LSBLIB services to the mbatchd that is running on the cluster’s master host. The mbatchd puts the job in an appropriate queue and waits for the appropriate time to dispatch the job. User jobs are held in batch queues by mbatchd, which checks the load information about all candidate hosts periodically. Then, mbatchd dispatches the job when an execution host with the necessary resources becomes available where it is received by the host’s sbatchd. When more than one host is available, the best host is chosen.

After a job is sent to a sbatchd, that sbatchd controls the execution of the job and reports the job’s status to mbatchd. The sbatchd creates a child sbatchd to handle job execution. The child sbatchd sends the job to the RES. The RES creates the execution environment to run the job. The job is run and results of the job are sent to the user through email system (SMTP service).

In case of the job submission through the web interface, the Platform Application Center (PAC) manages the Platform LSF Library calls. PAC components include web portal, reporting services, and database (MySQL or Oracle). To support PAC failover, the configuration files and binaries are stored on the shared file system (NFS or GPFS), and failover services are provided by EGO. Two Platform LSF master candidate hosts are used for failover (for best performance, do not use the Platform LSF master host as the Platform Application Center host). When the primary candidate host on which PAC is running fail, EGO can start PAC services and database instance on the backup candidate host.

When Platform LSF is installed without EGO enabled, resource allocation is done by Platform LSF in its core. Part of the EGO functionality is embedded on Platform LSF, which enables the application to perform the parts of the job for which EGO is responsible. When EGO is enabled (required for PAC failover), it adds more fine-grained resource allocation capabilities, high availability services to sbatch and RES, and faster cluster startup. If the cluster has PAC and PERF controlled by EGO, these services are run as EGO services (each service uses one slot on a management host).

Platform LSF fault tolerance depends on the event log file, lsb.events, which is kept on the primary file server in the shared directory LSB_SHAREDIR, which should be configured to maintain copies of these logs to use as a backup. If the host that contains the primary copy of the logs fails, Platform LSF continues to operate by using the synchronized duplicate logs. When the host recovers, Platform LSF uses the duplicate logs to update the primary copies. LSB_SHAREDIR is used for temporary work files, log files, transaction files, and spooling must be accessible from all potential Platform LSF master hosts.

The component model can be extended with installed Platform LSF add-on components for Platform License Scheduler, Platform RTM, and Platform Analytics. Other add-on components for Platform Dynamic Cluster, Platform Process Manager, and Platform Session Manager with MultiCluster configuration can be used to support extreme scalability and throughput for multi-site or geographic cluster environments.

The component model also can be extended with integrated components to support running MPI Jobs. Platform LSF supports Open MPI, Platform MPI, MVAPICH, Intel MPI, and mpich2.

Operational model

The operational model of the Platform LSF environment varies depending on the functional and non-functional requirements. One of the differences can be the number of the supported users that affects scalability and manageability. Another difference can be the requirement for multi-site deployment that is based on a customer’s datacenter locations.

Also, the number of the supported applications and the type of running applications are significant when the appropriate technology platform is selected and it is decided whether to use cluster management to provision and manage or dynamically change the Platform LSF cluster environment.

The sample of the high available Platform LSF environment that is shown in Figure 2-9 shows one of the several possible configurations for MSC and ANSYS Application Ready Solution.

Figure 2-9 Platform LSF cluster deployment on the physical hardware

If the cluster is smaller and if cost-effectiveness is important, the cluster manager for easy provisioning should be omitted. Also, if applications other OS than Linux are required to run, the pHPC and PCM-SE cannot be used. Otherwise, for clusters that are running Linux supported applications, the cluster management solution is recommended.

Cluster management nodes

The sample solution includes one active PCM-SE management node (PCM1) and one standby node (PCM2). When a failover process occurs, the standby management node takes over as the management node with all running services. The management nodes connect to a public and provisioning network. Because the architecture is optimized for a specific application and most software deployments are fully automated, a cluster can be deployed in a short time. System administrators should use the PCM-SE web-based interface as the management console for performing daily cluster management and monitoring. This eliminates the need for a full-time system administrator with extensive technical or HPC expertise for managing the cluster. Device drivers, such as OFED for InfiniBand, VNC server, and DCV for 3D visualization, also are deployed as part of the cluster deployment. PCM-SE supports kits, which provide a framework that allows third-party packages to be configured with the system.

Shared storage

To support I/O-intensive applications, we recommend building a two-node GPFS cluster with tiebreaker disks over both shared storage nodes (SS1 and SS2). The shared storage nodes connect to public, provisioning, and application networks. GPFS provides to the LSF users secure storage to maintain user profiles, and to share compute model data files, which allows applications that are running on the cluster within high speed and low latency interconnect access any required data, which significantly improves application performance.

To create a high available PCM-SE environment, shared storage is required to share user home directories and system working directories. All shared file systems must be accessible by the provisioning network for PCM-SE management nodes and all compute nodes. The Platform LSF working directory must be available through a shared file system in the master and master candidate hosts. PAC DB data files are stored in the shared drive so the database on the secondary master candidate node uses the same data on the shared directory when it is started. The RTM DB data files are also stored in the shared drive.

LSF master node

To achieve the highest degree of performance and scalability, we strongly recommend that you use a powerful master host. There is no minimum CPU requirement (we recommend the use of multi-core CPUs) and any host with sufficient physical memory can run LSF as master host or master candidate host. Active jobs use most of the memory that LSF requires (we recommend the use of 8 GB RAM per core). The LSF master node (MN1) connects to public and provisioning networks.

LSF master candidate nodes

Two LSF master candidate nodes (MCN1 and MCN2) are used for failover of the Platform Application Center (PAC), which provides a graphic user interface to Platform LSF. Ideally, these nodes should be placed on separate racks for resiliency. Both LSF master candidate nodes connect to public and provisioning networks. PAC is the primary interface to the ANSYS and MSC applications through the application templates. For ANSYS applications, the AR-Fluent and AR-Mechanical application templates are provided. For MSC applications, the AR-NASTRAN and AppDCVonLinux (to Patran through DCV) application templates are provided.

RTM node

Platform RTM server (RTM1) is an operational dashboard for IBM Platform LSF environments that provides comprehensive workload monitoring and reporting. RTM displays resource-related information, such as the number of jobs that are submitted, the details of individual jobs (load average, CPU usage, and job owner), or the hosts on which the jobs ran. RTM also is used to monitor FlexNet Publisher License Servers and GPFS. By using GPFS ELIM, we can monitor GPFS on a per LSF host in the RTM GUI and a per LSF cluster basis as a whole or per volume level. The RTM node connects to the provisioning network.

FlexNet Publisher License nodes

The Platform License Scheduler works with master FlexNet Publisher License Server to control and monitor ANSYS and MSC license usage. To support high availability of the FlexNet Publisher License Server, the three-server redundancy configuration is used (FNP1, FNP2, and FNP3). If the master fails, the secondary license server becomes the master and serves licenses to FLEX enabled applications. The tertiary license server can never be the master. If the primary and secondary license servers go down, licenses are no longer served to FLEX enabled applications. The master does not serve licenses unless there are at least two license servers in the triad that are running and communicating. The FlexNet Publisher License Servers connects to the provisioning network.

MSC SimManager

MSC SimManager (SM1) with its mscdb database instance is installed on a separate node. We recommend setting the same values of the MSC variables into Platform Application Center (PAC) templates for MSC. If this information is embedded in an MSC template in PAC, it is important that when changes are made to the installation of the MSC application suite, these changes are propagated to the MSC application templates in PAC. PAC, which is complementary to SimManager, provides a robust interface to such key simulation components as MSC Nastran and MSC Patran. MSC SimManager can be used to submit jobs to the back-end LSF cluster in addition to providing project and data management. The MSC SimManager (SN1) connects to public and provisioning networks.

LSF compute node

The LSF compute nodes (CN01 to CN20) are designed for computationally intensive applications that are developed by ANSYS Mechanical, ANSYS Fluent, and MSC Nastran. They are required to satisfy the computational requirements for solving use cases. They have some local disk space for the OS and temporary storage that is used by the running applications. The compute nodes also mount the shared GPFS storage that is needed for cooperation in solving a single problem. This is facilitated by the connectivity to a high-speed, low-latency InfiniBand interconnect. Also, ANSYS parallel computing by using MPI merits the inclusion of an InfiniBand interconnect. Large models in ANSYS might not fit in the available memory and must be solved out-of-core. In that case, these nodes can benefit from robust local storage that is attached to each compute server.

In addition to the operating system, the application and runtime libraries are installed on all LSF compute nodes. The monitoring and resource management agents are connected to the cluster management software and the workload management software is installed.

The performance of the LSF compute node (depending on the configuration of ANSYS Fluent) does not require a large amount of memory per node when multiple nodes are used to run large problems. We recommend the use of 8 GB RAM per core node with two eight-core Intel Xeon E5-2600 v2 CPUs. ANSYS Fluent does not perform any I/O locally at each node in the cluster. Therefore, a robust shared file system satisfies the I/O requirements of most ANSYS Fluent applications. ANSYS Mechanical requires a large amount of memory so that the simulations can run within memory most of the time. We recommend the use of 16 GB RAM per core node with two eight-core Intel Xeon E5-2600 v2 CPUs. MSC Nastran also requires a large amount of memory so that the simulations can run within memory most of the time. We recommend the use of 16 GB RAM per core node with two eight-core Intel Xeon E5-2600 v2 CPUs. The LSF compute nodes connects to provisioning and application networks.

Visualization node

A visualization node is required to support the remote 3D visualization. A pool of nodes (VN01-VN20) is designed as visualization nodes, which are excluded from the computational work queues. These nodes are equipped with GPUs, which are officially supported by ANSYS Workbench and MSC Patran. Each visualization node can be equipped with up to four GPUs. A single visualization node can support several simultaneous interactive sessions. Platform LSF keeps track of the allocation of these interactive sessions. When a user requests an interactive login session through the Platform Application Center web portal, a free session is allocated to the user. Through remote visualization, the rendering of a graphics application is done on the graphics adapter. The rendered graphics are compressed and transferred to the thin client. The remote visualization software engine that is supported on the visualization node is Desktop Cloud Visualization (DCV), which is produced by the NICE Software.

When an interactive session that uses DCV is requested and allocated, PAC downloads session-related information of the DCV to the client workstation that is requesting the session. The client component of the visualization software, which is on the client workstation, uses this session information and connects with the visualization node on which Platform LSF allocated the session. The client part of the visualization prompts the user for credentials for authentication. If the login is successful, the interactive session is established to start graphics-oriented applications, such as ANSYS Workbench or MSC PATRAN.

In addition to the operating system, the application and runtime libraries are installed on all visualization nodes. The monitoring and resource management agents are connected to the cluster management software and the workload management software is installed.

The performance of the visualization node depends on the configuration because the pre- and post-processing of the applications require large amounts of memory. We recommend the use of 256 GB RAM and one eight-core Intel Xeon E5-2600 v2 product family CPU for two NVIDIA GPU adapters or 512 GB RAM and two eight-core Intel Xeon E5-2600 v2 product family CPUs for four NVIDIA GPU adapters per visualization node. The visualization nodes connect to provisioning and application networks.

Networking

The Platform LSF cluster uses the same networks as the PCM-SE cluster (public, provisioning, monitoring, and application networks). In a multi-rack cluster, we recommend the use of top-of-rack switches and set up VLANs for different networks. When a server contains two network ports of the same speed, there is a way to tie them together by using LACP. Each rack has two top-of-rack switches with HA (that uses inter-switch links at the top-of-rack level). This two-switch cluster with configured Virtual Link Aggregation Group (vLAG) feature allows multi-switch link aggregation, which provides higher performance and optimizes parallel active-active forwarding. From each top-of-rack switch, aggregated uplinks can be configured to the up-level spine switches to build redundant Two-Tier and Layer 3 Fat Tree network. The Equal-Cost Multi-Path Routing (ECMP) L3 implementation is used for scalability if Virtual Router Redundancy Protocol (VRRP) is not applicable.

2.2.2 IBM Platform Symphony

IBM Platform Symphony is the most powerful enterprise-class management for running distributed applications and Big Data analytics on a scalable and shared grid. Platform Symphony’s efficient, low-latency middleware and scheduling architecture is designed to provide the performance and agility that is required to predictably meet and exceed throughput goals for running diverse workloads. It accelerates various compute-intensive and data-intensive applications for faster results and better use of all available resources.

Platform Symphony features the following characteristics:

•Low latency and high throughput: Platform Symphony provides submillisecond responses. The sending throughput per task was bench marked repeatedly over 17,000 tasks per second per application.

•Large scale: An individual session manager (per application) can schedule a task up to 10k cores. Platform Symphony manages up to 40k cores per grid and allows multiple grids to be linked. Platform Symphony Advanced Edition with multi-cluster feature can manage up to 100k cores.

•Performance enhancements: Low Latency ‘Push’ Infrastructure reduces the wait time for task allocation. Service-oriented architecture Framework that is written in C++ reduces the wait time for jobs to start running, a sophisticated scheduling engine that reduces the wait time for pending jobs/tasks, and data management reduces the wait time for getting data to the jobs or tasks.

High availability and resiliency reduce the wait time on recovery of jobs and tasks, Sharing resources across application boundaries reduces the server wait time for new work when there are pending tasks, shared memory logic for MapReduce reduces data movement, single service instance for multiple tasks (MTS), and parallel EGO service starts (30,000 services in under two minutes).

•Dynamic resource management: Slot allocation changes dynamically based on job priority and server thresholds. Lending and borrowing resources from one application to another within advanced resource sharing ensures SLAs while encouraging resource sharing.

•Application lifecycle: Support for rolling upgrades of the Platform Symphony software. Support for multiple versions of Hadoop co-existing on the same cluster.

•Reliability: Platform Symphony makes all MapReduce and HDFS-related services highly available (name nodes, job trackers, task trackers, and so on).

•Sophisticated scheduling engine: Platform Symphony has a fair share of scheduling with 10,000 levels of prioritization. Also, preemptive and resource threshold-based scheduling with runtime change management.

•Open: Platform Symphony supports multiple APIs and languages. Fully compatible with Java, Pig, Hive, and other MR applications. Platform Symphony also supports multiple data sources, including HDFS and GPFS.

•Management tools: Platform Symphony provides a comprehensive management capability for troubleshooting, alerting, and tracking jobs and rich reporting capabilities.

Platform Symphony is available in the following editions, which gives users the best set of capabilities to meet their needs:

•IBM Platform Symphony Developer Edition: Builds and tests applications without the need for a full-scale grid (available for download at no cost).

•IBM Platform Symphony Express Edition: For departmental clusters where this is an ideal, cost-effective solution.

•IBM Platform Symphony Standard Edition: This version is for enterprise class performance and scalability.

•IBM Platform Symphony Advanced Edition: This is the best choice for distributed compute and data intensive applications, including optimized and low-latency MapReduce implementations.

Platform Symphony clients and services can be implemented on different operating environments, languages, and frameworks. Clusters also can consist of nodes that are running multiple operating systems. For example, 32- and 64-bit Linux hosts can be mixed that are running different Linux distributions and multiple Microsoft Windows operating systems can be deployed. Platform Symphony can manage all of these different types of hosts in the same cluster and control what application services run on each host.

For more information about system support (which varies on Platform Symphony Edition), see IBM Platform Symphony 6.1.1: Supported System Configurations, SC27-5373-01, which is available at this website:

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/com.ibm.cluster.platform_product_libraries.doc/platform_product.htm

Table 2-4 on page 41 summarizes the features that are associated with each Platform Symphony edition and provides sizing guidelines that are based on tested cluster configurations. For advanced clusters, it is recommended that customers seek configuration assistance from IBM.

Table 2-4 Platform Symphony features and scalability

Features	Developer	Express	Standard	Advanced
Low-latency HPC SOA	X	X	X	X
Agile service and task scheduling	X	X	X	X
Dynamic resource orchestration	-	X	X	X
Standard and custom reporting	-	-	X	X
Desktop, server and virtual server harvesting capability	-	-	X	X
Data affinity	-	-	-	X
MapReduce framework	X	-	-	X
Multi-cluster management	-	-	-	X
Max hosts/cores	2 Hosts	240 Cores	5k Hosts, 40k Cores	5k Hosts, 40k Cores
Application managers	-	5	300	300

The following add-on resource harvesting tools can be used with Platform Symphony Standard and Advanced Editions:

•IBM Platform Symphony Desktop Harvesting: This add-on harnesses the resources from available idle desktops and adds them to the pool of potential candidates to help complete tasks. Platform Symphony services do not interfere with other applications that are running on the desktops and harvested resources are managed directly through the integrated management interface.

•IBM Platform Symphony Server/VM Harvesting: To use more of your enterprise’s resources, this addition allows you to tap idle or under-used servers and virtual machines (VM). Instead of requiring new infrastructure investment, Platform Symphony locates and aggregates these server resources as part of the grid whenever more capacity is needed to handle larger workloads or when the speed of results is critical.

•IBM Platform Symphony GPU Harvesting: To unleash the power of general-purpose graphic processing units (GPUs), this tool enables applications to share expensive GPU resources more effectively and to scale beyond the confines a single GPU. Sharing GPUs more efficiently among multiple applications and detecting and addressing GPU-specific issues at run time helps improve service levels and reduce capital spending.

•IBM Platform Analytics: An advanced analysis and visualization tool for analyzing massive amounts of workload and infrastructure usage data that is collected from IBM Platform Symphony clusters. It enables you to easily correlate job, resources, and license data from multiple Platform Symphony clusters for data driven decision making.

The following complementary products can be used with Platform Symphony:

•IBM InfoSphere BigInsights (IBM Distribution of Apache Hadoop)

•IBM Algorithmics® (Full-featured Enterprise Risk Management Solution)

•IBM General Parallel File Systems (High-performance enterprise file management platform)

•IBM Platform Process Manager (Design and automation of complex processes)

•IBM Intelligent Cluster™ (Pre-configured, pre-integrated, optimized, and fully supported HPC solution)

Use cases for Platform Symphony

Platform Symphony is successfully deployed across many industries to manages multiple compute and data intensive workloads. Platform Symphony fits all time critical use cases where service-oriented applications are running and services are calling programmatically by using APIs. Whereas a batch scheduler can schedule jobs in seconds or minutes, Platform Symphony can schedule tasks in milliseconds. Because of this difference, Platform Symphony can be described as supporting online or near real-time requirements. Well-documented APIs enable fast integrations for applications that are written in C, C++, C#, .NET, Visual Basic, Java, Excel COM, R, and various popular scripting languages. Platform Symphony provides flexible distributed run time to support various Big Data and analytics applications that benefit from the best-in class Hadoop MapReduce implementation (Enhanced Platform Symphony MapReduce framework).

Platform Symphony targets the following markets:

•Financial Services: Market risk (VaR) calculations, credit risk including counterparty risk (CCR), Credit Value Adjustments (CVA), equity derivatives trading, stochastic volatility modeling, actuarial analysis and modeling, ETL process acceleration, fraud detection, and mining of unstructured data.

•Manufacturing: Data warehouse optimization, predictive asset optimization, process simulation, Finite Elements Analysis, and failure analysis.

•Health and Life Sciences: Health monitoring and intervention, Big Data biology emerging as a MapReduce workload, genome sequencing and analysis, drug discovery, protein folding, and medical imaging.

•Government and Intelligence: Weather analysis, collaborative research, MapReduce logic implementation, data analysis in native formats, enhanced intelligence and surveillance insight, real-time cyber attack prediction and mitigation, and crime prediction and protection.

•Energy, Oil, and Gas: Distribution load forecasting and scheduling, enable customer energy management, smart meter analytics, advanced condition monitoring, drilling surveillance and optimization, and production surveillance and optimization.

•Media and Entertainment: Optimized promotions effectiveness, real-time demand forecast, micro-market campaign management, and digital rendering.

•E-Gaming: Game-and Player-related analytics.

•Telco: Network analytics, location-based services, pro-active call center, and smarter campaigns.

•Retail: Customer behavior and trend analysis driving large and complex analytics, merchandise optimization, and actionable customer insight.

Platform Symphony architecture

Platform Symphony is a layer of software services on top of the heterogeneous enterprise resources that provide workload and resource management. This layered service model (as shown in Figure 2-10 on page 43) presents a simplified Platform Symphony architecture.

A resource manager provides the underlying system infrastructure to enable multiple applications to operate within a shared resource infrastructure. A resource manager manages the computing resources for all types of workloads. The Enterprise Grid Orchestrator (EGO), as a resource manager, manages the supply and distribution of resources, which makes them available to applications. EGO provides resource provisioning, remote execution, high availability, business continuity, and cluster management tools.

Figure 2-10 Platform Symphony Layered Service Model

A workload manager interfaces directly with the application, receiving work, processing it, and returning the results. A workload manager provides a set of APIs or can interface with more runtime components to enable the application components to communicate and perform work. Platform Symphony works with the Service-Oriented Application (SOA) model. In a SOA environment, workload is expressed in terms of messages, sessions, and services. SOAM (SOA Middleware) works as a workload manager within the Platform Symphony.

When a client submits an application request, the request is received by SOAM. SOAM manages the scheduling of the workload to its assigned resources, requesting more resources as required to meet service-level agreements. SOAM transfers input from the client to the service, returns results to the client, and then releases excess resources to the resource manager.

Within Platform Symphony, the Enhanced Hadoop MapReduce Processing Framework supports data-intensive workload management by using a special implementation on SOAM for MapReduce workload. The MapReduce Processing Framework is available only with the Advanced Edition. Significant performance improvement for the Symphony’s MapReduce framework is attained for most of the MapReduce jobs when compared with the open source Hadoop framework, and especially for the short-run jobs. This is based mainly on the low latency and the immediate map allocation and job startup design features of the SOAM (JobTracker and TaskTracker components of Hadoop are replaced with Symphony SOAM components, which are much faster at allocating resources to MapReduce jobs).

For more information, see the following publications:

•Platform Symphony Version 6 Release 1.1: Platform Symphony Foundations, SC27-5065-02

•Platform Symphony Version 6 Release 1.1: User Guide for the MapReduce Framework, GC27-5072-02

These publications are available at this website:

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/com.ibm.cluster.platform_product_libraries.doc/platform_product.htm

Component model

The component model that is shown in Figure 2-11consists of multiple Platform Symphony processes that are running on each host in a distributed system, a comprehensive set of utilities that are built on top of the Platform Symphony API, and relevant Platform Symphony add-on components complement the required features. The type and number of running Platform Symphony processes depend on whether the host is a master node, master node candidate, one of the management nodes, or a compute node. Other considered nodes are Platform Symphony clients and nodes (relational database servers and GPFS servers) that provide high available services that are related to the production environment.

Figure 2-11 Platform Symphony software components diagram

Management nodes are designated to run the management components of the grid. By default, these nodes do not run the workload for users. The master node is the first node that is installed and the resource manager of the grid is here. There is only one master node at a time. The master candidate node act as the master if the master fails and usually is configured as one of the possible management nodes. The management node (or management nodes) run session managers (there is one session manager per available slot on a management host and one session manager per application) and provide an interface to the clients. Compute nodes are designated to run work and provide computing resources to users. Client nodes (Platform Symphony client) are used for submitting work to the grid and normally they are not members of the grid.

On the master node, the master lim starts vemkd and process execution monitor (pem). There is one master lim per grid. The vemkd (EGO kernel) starts the service controller egosc, maintains security policies (allowing only authorized access), and maintains resource allocation policies (distributing resources accordingly).

There is one vemkd per cluster and it runs on the master node. The pem monitors vemkd, and notifies the master lim if vemkd fails. The EGO service controller (egosc) is the first service that runs on top of the EGO kernel. It functions as a bootstrap mechanism for starting the other services in the cluster. It also monitors and recovers the other services. After the kernel boots, it reads a configuration file to retrieve the list of services to be started. There is one egosc per cluster, and it runs on the master node.

On other management nodes, the session director (sd) acts as a liaison between the client application and the session manager (ssm). There is one session director process per cluster, and it can run on the master or other management node. The repository service (rs) provides a deployment mechanism for service packages to the compute nodes in the cluster. There is one repository service per cluster, and it can run on the master or other management node.

The load information manager (lim) monitors the load on the node, and starts pem. The pem starts Platform Symphony processes on the node. The ssm is the primary workload scheduler for an application. There is one session manager per application. The web service manager (wsm) runs the Platform Management Console. The PERF loader controller (plc) loads data into the reporting database. The PERF data purger (purger) purges reporting database records. If the cluster has wsm and plc controlled by EGO (required for high availability), the relevant services are run as EGO services.

On the master candidate nodes, the lim monitors the load on the master candidate node and starts pem. The lim also monitors the status of the master lim. If the master node fails, lim also elects a new master node. The pem starts relevant Platform Symphony processes on the node. At a minimum, three management nodes are needed to provide high available management services. If the management node fails, the master candidate node (as one of the management nodes) must be configured to start all relevant EGO services.

On the compute nodes, the lim starts pem on the node, monitors the load on, and passes the configuration information and load information to the master lim on the master node. The pem monitors the lim process. The service instance manager (sim) is started on the compute node when the workload is submitted to the node if the application is preconfigured. The sim then starts service instance (si). There is one sim per service instance.

A Platform Symphony client connects to Platform Symphony’s session director (sd) and ssm servers and to EGO’s kernel daemon (vemkd) and the service controller (egosc). This is because a Platform Symphony client indirectly uses the EGO API to communicate with EGO. More specifically, a Platform Symphony client is linked with the Platform Symphony SDK library that uses the Platform Symphony API (for sessions). The Platform Symphony API internally uses the EGO API to communicate with EGO, so the Platform Symphony SDK client internally is also an EGO client.

The Service-Oriented Architecture Middleware (SOAM) is responsible for the role of workload manager and manages service-oriented application workloads within the cluster, which creates a demand for cluster resources. The SOAM components consist of the sd, ssm, sim, and si.

To support failover for multi-node clusters, a shared file system is required to maintain the configuration files, binaries, deployment packages, and logs. To enable this shared file system, you must create a shared directory that is fully controlled by the cluster administrator and is accessible from all management nodes.

To eliminate single point of failure (SPOF), the file system should be high-available by using HA-NFS or IBM GPFS, for example. For the best performance, parallel multi-node access for shared data and fast inter-node replication, we recommend the use of GPFS.

For the compute nodes, we recommend the use of the GPFS File Placement Optimizer (FPO) feature that is designed for Big Data applications that process massive amounts of data. GPFS implements the POSIX specification natively, which means that multiple applications (MapReduce and non-MapReduce applications) can share the same file system, which improves flexibility. The use of GPFS eliminates the HDFS node name as a SPOF, which improves file system reliability and recoverability. Within GPFS, you can employ the right storage architecture, depending on the application need by using GPFS FPO with n-way block replication for Hadoop workloads and traditional GPFS for non-Hadoop workloads to improve flexibility and minimize costs.

The Platform Symphony Multi-Cluster (SMC) feature can be used to extend scalability of the grid to distribute workload execution among clusters, and to repurpose hosts between clusters to satisfy peak load. This feature is available within Platform Symphony Advanced Edition and allows spanning up to 20 silo clusters that are scaling up to the 100,000 hosts with 800,000 cores and up to 1,000 application managers in total.

The MultiCluster management system (SMC-Master) that is shown in Figure 2-12 is composed of software components that run outside of individual clusters to manage movement of resources between clusters and coordinate and track them. The SMC Master cluster is on a set of hosts to ensure failover and uses EGO technology for high availability and to manage services within the MultiCluster. In each cluster that you want to manage, you enable the MultiCluster proxy (SMC-Proxy) on a management host. The MultiCluster proxy is the SMCP service that discovers hosts in the cluster and triggers actions to repurpose them.

Figure 2-12 Platform Symphony Multi-Cluster software components diagram

There is one MultiCluster proxy per silo cluster. The MultiCluster agent (SMC-Agent) is a daemon that runs on demand on a host that is repurposed. It is responsible for stopping and starting the lim and starting move-in and move-out scripts and reporting their results to the MultiCluster proxy. By using MultiCluster, you can monitor all clusters from one central console (SMC-Webgui).

For more information, see Platform Symphony Version 6 Release 1.1: MultiCluster User Guide, SC27-5083-02, which is available at this website:

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/topic/com.ibm.cluster.platform_product_libraries.doc/platform_product.htm

Operational model

The sample of a Platform Symphony cluster as a high available environment is shown in Figure 2-13. The number of supported applications and type of running applications (compute-intensive versus data-intensive) are significant to select the appropriate technology platform and required scalability, and determine whether to use the cluster management to provision the cluster. If the cluster is smaller and if cost-effectiveness matters, the cluster manager for easy provisioning should be omitted.

Figure 2-13 Platform Symphony cluster deployment on the physical hardware

Also, if applications are required to run on an operating system other than Linux, PCM-SE cannot be used. Otherwise, for clusters that are running Linux supported applications, the cluster management solution is recommended. In our case (which is one of several possible configurations), we use PCM-SE for provisioning, including one shared cluster storage that is supplemented with two-node GPFS cluster.

Cluster management nodes

The sample solution includes one active PCM-SE management node (CMN1) and one standby node (CMN2). When a failover process occurs, the standby management node takes over as the management node with all running services. Because the software deployments are fully automated, the Platform Symphony cluster can be deployed in a short time. System administrators should use the PCM-SE web-based interface as the management console for performing daily management and monitoring of the cluster. PCM-SE supports kits, which provide a framework that allows third-party packages to be configured with the system. The management nodes connect to the public and the provisioning networks.

Shared storage for cluster management

To create a high available PCM-SE environment, shared storage is required to share home and system working directories. All shared file systems must be accessible by the provisioning network and all provisioned Platform Symphony nodes. This requirement can be covered by a two-node GPFS cluster with tiebreaker disks on the storage subsystem over both storage (NSD server) nodes (CMN1 and CMN2).

Shared storage for Platform Symphony

To support Platform Symphony management failover, the ego.shared directory (which contains configuration files, binaries, deployment packages, and logs) must be accessible to all Platform Symphony management nodes. This requirement is covered by Platform Symphony’s GPFS cluster.

For data-intensive workloads, we recommend building a GPFS FPO cluster over all Platform Symphony nodes. GPFS FPO makes applications aware of where the data chunks are kept. It helps data-aware scheduling (data affinity) to intelligently schedule application tasks and improve performance by taking into account data location when dispatching the tasks (a GPFS API maps each data chunk to its node location). Data affinity is a feature that is available in the Advanced Edition version of Platform Symphony. Data chunks allow applications to define their own logical block size; therefore, GPFS FPO via optimized variable block-sizes provides good performance across diverse types of workloads.

GPFS is flexible in how its node roles might be assigned. In the context of GPFS FPO, the following recommendations are suggested:

•Assign all Platform Symphony management nodes as GPFS nodes. All of these nodes require a GPFS server license.

•Platform Symphony’s GPFS cluster is distributed over several physical racks with both management and compute nodes also distributed evenly across these racks.

•The GPFS cluster primary and secondary cluster configuration nodes should be on separate racks for resiliency.

•Ideally, there should be at least one GPFS quorum node per rack. There should be an odd number of quorum nodes to a maximum of seven.

•Ideally, there should be at least one GPFS metadata disk node per rack.

•Each GPFS metadata node should have at least two disks for metadata, and all metadata nodes should have the same number of metadata disks. For better performance, consider the use of solid-state drives (SSDs) for the metadata. For better fault tolerance, it is recommended to have at least four nodes with metadata disks when you are using metadata replication that is equal to three.

•All Platform Symphony compute nodes should be GPFS FPO data disk nodes. All of these nodes require a GPFS FPO license.

•Use hybrid allocation treating metadata and data differently. Metadata is suited for regular GPFS and data is suited for GPFS FPO. Make the allocation type as a storage pool property accordingly. For metadata pools, use the no write affinity and for data pools use the write affinity.

Platform Symphony’s GPFS cluster architecture is shown in Figure 2-14, which relies on local disks to store data. The use of local disks provides performance and cost improvement over other solutions. Because a local disk is available to one server only, data replication is also used for data redundancy and to enable scheduling tasks to where the data exists. For more information, see Best Practices: Configuring and Tuning IBM Platform Symphony and IBM GPFS FPO, which is available at this website:

https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/IBM%20Platform%20Symphony%20Wiki/page/Focus%20on%20Big%20Data

Figure 2-14 Sample GPFS FPO cluster nodes and disks

Platform Symphony master node

The Platform Symphony master node (MN1) provides grid resource management. If it fails, this service fails over to other master candidate nodes. A similar failover strategy can be implemented for the SSM, which is running on the Platform Symphony management node. A shared file system among the management nodes facilitates the failover strategy. Within Platform Symphony’s GPFS cluster, the server acts as a GPFS quorum node and primary configuration manager node.

To achieve the highest degree of performance and scalability, we recommend using a powerful master host. We recommend the use of multi-core CPUs with sufficient physical memory (we recommend the use of 8 GB RAM per core). The Platform Symphony master node connects to the public, provisioning, and high-speed interconnect networks.

Platform Symphony management node

The Platform Symphony management node (MN2) runs session managers (there is one session manager per available slot on a management host, and one session manager per application), and provides multiple ways for workload submission. From the administration perspective, the management tasks can be performed from the web-based Platform Management console GUI that is running on the Platform Symphony management node and the command line.

The GUI is a modern and complete web-based portal for management, monitoring, reporting, and troubleshooting purposes. As a differentiator feature, it offers a high level of interactivity with the running jobs (suspend, resume, and kill job and tasks, and can even change the priority of a running job). If the Platform Symphony management node fails, the services fail over to a master candidate node. A shared file system among the management nodes facilitates the failover strategy.

Within Platform Symphony’s GPFS cluster, the server acts as a GPFS quorum node and secondary configuration manager node. To achieve the highest degree of performance and scalability, we recommend the use of a powerful master host. We also recommend the use of multi-core CPUs with sufficient physical memory (8 GB RAM per core is recommended). The Platform Symphony management node connects to the public, provisioning, and high-speed interconnect networks.

Platform Symphony master candidate node

The Platform Symphony master candidate node (MN3) is used for failover of any of the management nodes (MN1 or MN2). The configuration of the master candidate node is the same as the Platform Symphony master node. Within Platform Symphony’s GPFS cluster, the server acts as a GPFS quorum node and file system manager node. The master candidate node connects to the public, provisioning, and high-speed interconnect networks.

Platform Symphony compute node and data node

The Platform Symphony compute and data nodes (CN01 to CN54) are designed to run Platform Symphony’s SOA execution management services to support running and managing compute-intensive or data-intensive applications that are scheduled by the upper-level SOA workload management services. Platform Symphony EGO services that are running on all data nodes collect computational resource information and help Platform Symphony SOA workload management service to schedule jobs more quickly and efficiently.

Within Platform Symphony’s GPFS cluster, the GPFS NSD service is running on all compute and data nodes to consolidate all local disks, and provides an alternative distributed file system (DFS) to HDFS. The GPFS FPO function and its specific license is enabled on all compute and data nodes. To achieve the highest degree of performance and scalability, we recommend the use of a powerful master host. We also recommend using multi-core CPUs (one processor core to one local data disk) and sufficient physical memory (a minimum of 8 GB RAM per core, explicitly for analytic applications, is recommended). The Platform Symphony compute and data nodes connect to the provisioning and high-speed interconnect networks.

Networking

The Platform Symphony cluster uses similar networks as the PCM-SE cluster (public, provisioning, monitoring, and application networks). The corporate network (public) represents the outside customer environment. The admin network (provisioning) is a 1 GbE network that is used for the management of all Platform Symphony nodes. The management network (monitoring) is a 1 GbE network that is used for out-of-band hardware management that uses Integrated Management Modules (IMM). Based on customer requirements, the service and management links can be separated into separate VLANs or subnets.

The management network often is connected directly into the client’s management network. Data network (application) is a private 10GbE cluster data interconnect among compute and data nodes that are used for data access, moving data across nodes within the cluster, and importing data into GPFS. The Platform Symphony cluster often connects to the client’s corporate data network by using one or more management nodes that are acting as interface nodes between the internal cluster and the outside client environment (data that is imported from a corporate network into a cluster, for example). The data network is important to the performance of data-intensive workloads. Use of a dual-port 10GbE adapter in each compute and data node is recommended to provide higher bandwidth and higher availability.

In a multi-rack cluster, we recommend the use top-of-rack switches and set up VLANs for different networks. Each cluster node has two aggregated 10 Gb links that use Link Aggregation Control Protocol (LACP) to the data network, one 1 Gb link to the admin network and 1 Gb link to the IMM network. Each rack has two, 10 Gb top-of-rack switches with HA (that uses inter-switch links at the top-of-rack level). These 10 Gb switches with configured Virtual Link Aggregation Group (vLAG) feature allows multi-switch link aggregation, which provides higher performance and optimizes parallel active-active forwarding.

Each rack also has one, 1 Gb top-of-rack switch that is dedicated to the administration and IMM networks with two 10 Gb uplinks to the up-level switches. From each top-of-rack switch, aggregated uplinks can be configured to the up-level spine switches to build a redundant Two-Tier and Layer 3 Fat Tree network. The Equal-Cost Multi-Path Routing (ECMP) L3 implementation is used for scalability if a Virtual Router Redundancy Protocol (VRRP) is not applicable.

Note: There is no definitive rule between a Layer 2 and a Layer 3 network configurations. However, a reasonable measure is that L3 should be considered whether the cluster is expected to grow beyond 5 - 10 racks.

The important task is to find out how the Platform Symphony cluster fits in a customer environment. We recommend that you to work with a network architect to collect the customer network requirements and to customize the network configurations. You need to know the following information:

•The different methods s of data movement in and out of the Platform Symphony cluster

•Customer network bandwidth requirements

•Customer corporate network standards

•If the segment allocated in the corporate network has enough room for IP allocation growth

2.3 Reference architectures

IBM Application Ready Solutions for Technical Computing are based on IBM Platform Computing software and powerful IBM systems, which are integrated and optimized for leading applications and backed by reference architectures. IBM created Application Ready Solution reference architectures for target workloads and applications. Each of these reference architectures includes recommended small, medium, and large configurations that are designed to ensure optimal performance at entry-level prices. These reference architectures are based on powerful, predefined, and tested infrastructure with a choice of the following systems:

•IBM Flex System™ provides the ability to combine leading-edge IBM POWER7®, IBM POWER7+™ and x86 compute nodes with integrated storage and networking in a highly dense, scalable blade system. The IBM Application Ready Solution supports IBM Flex System x240 (x86), IBM Flex System p260, and p460 (IBM Power) compute nodes.

•IBM System x helps organizations address their most challenging and complex problems. The Application Ready Solution supports IBM NeXtScale System, a revolutionary new x86 high-performance system that is designed for modular flexibility and scalability, System x rack-mounted servers and System x iDataPlex dx360 M4 systems are designed to optimize density, performance, and graphics acceleration for remote 3-D visualization.

•IBM System Storage DS3524 is an entry-level disk system that delivers an ideal price and performance ratio and scalability. You also can choose the optional IBM Storwize® V7000 unified for enterprise-class, midrange storage that is designed to consolidate block-and-file workloads into a single system.

•IBM Intelligent Cluster is a factory-integrated, fully tested solution that helps simplify and expedite deployment of x86-based Application Ready Solutions.

The solutions also include the following pre-integrated IBM Platform Computing software that is designed to address technical computing challenges:

•IBM Platform HPC is a complete technical computing management solution in a single product, with a range of features that are designed to improve time-to-results and help researchers focus on their work rather than on managing workloads.

•IBM Platform Cluster Manager Standard Edition provides easy-to-use yet powerful cluster management for technical computing clusters that simplifies the entire process, from initial deployment through provisioning to ongoing maintenance.

•IBM Platform LSF provides a comprehensive set of tools for intelligently scheduling workloads and dynamically allocating resources to help ensure optimal job throughput.

•IBM Platform Symphony delivers powerful enterprise class management for running Big Data, analytics, and compute-intensive applications.

•IBM General Parallel File System (GPFS) is a high-performance enterprise file management platform for optimizing data management.

2.3.1 IBM Application Ready Solution for Abaqus

The IBM Application Ready Solution for Abaqus is a technical computing architecture that supports linear and nonlinear structural mechanics and multiphysics simulation capabilities in Abaqus, which is part of the SIMULIA realistic simulation software applications that are available from Dassault Systèmes. Abaqus provides powerful structural and multiphysics simulation capabilities that are based on the finite element method. It is sold globally by Dassault Systèmes and their reseller channel.

For more information, see the following resources:

•IBM Application Ready Solution for Abaqus: An IBM Reference Architecture based on Flex System, NextScale and Platform Computing v1.0.1, which is available at this website:

http://public.dhe.ibm.com/common/ssi/ecm/en/dcl12368usen/DCL12368USEN.PDF

•IBM Platform Computing:

http://www.ibm.com/technicalcomputing/appready

2.3.2 IBM Application Ready Solution for Accelrys

Designed for healthcare and life sciences, the Application Ready Solution for Accelrys simplifies and accelerates mapping, variant calling, and annotation for the Accelrys Enterprise Platform (AEP) NGS Collection. It addresses file system performance (the biggest challenge for NGS workloads on AEP) by integrating IBM GPFS for scalable I/O performance. IBM systems provide the computational power and high-performance storage that is required, with simplified cluster management to speed deployment and provisioning.

For more information, see the following resources:

•IBM Application Ready Solution for Accelrys: An IBM Reference Architecture based on Flex System, System x, and Platform Computing V1.0.3, which is available at this website:

http://public.dhe.ibm.com/common/ssi/ecm/en/dcl12370usen/DCL12370USEN.PDF

•IBM Platform Computing:

http://www.ibm.com/technicalcomputing/appready

2.3.3 IBM Application Ready Solution for ANSYS

The IBM Application Ready Solution for ANSYS is a technical computing architecture that supports software products that are developed by ANSYS in the areas of computational fluid dynamics and structural mechanics. ANSYS computational fluid dynamics (CFD) software solutions (including ANSYS Fluent and ANSYS CFX) are used to predict fluid flow and heat and mass transfer, chemical reactions, and related phenomena by numerically solving a set of governing mathematical equations (conservation of mass, momentum, energy, and others). ANSYS Structural Mechanics software (including ANSYS Mechanical) offers a comprehensive solution for linear or non-linear and dynamic analysis. It provides a complete set of elements behavior, material models, and equation solvers for various engineering problems.

For more information, see the following resources:

•IBM Application Ready Solution for ANSYS: An IBM Reference Architecture based on Flex System, NeXtScale, and Platform Computing v2.0.1, which is available at this website:

http://public.dhe.ibm.com/common/ssi/ecm/en/dcl12372usen/DCL12372USEN.PDF

•IBM Platform Computing:

http://www.ibm.com/technicalcomputing/appready

2.3.4 IBM Application Ready Solution for CLC bio

This integrated solution is designed for clients who are involved in genomics research in areas ranging from personalized medicine to plant and food research. Combining CLC bio software with high-performance IBM systems and GPFS, the solution accelerates high-throughput sequencing and analysis of next-generation sequencing data while improving the efficiency of CLC bioGenomic Server and CLC Genomics Workbench environments.

For more information, see the following resources:

•IBM Application Ready Solution for CLC bio: An IBM Reference Architecture based on Flex System, System x, and Platform Computing V1.0.2, which is available at this website:

http://public.dhe.ibm.com/common/ssi/ecm/en/dcl12371usen/DCL12371USEN.PDF

•IBM Platform Computing:

http://www.ibm.com/technicalcomputing/appready

2.3.5 IBM Application Ready Solution for Gaussian

Gaussian software is widely used by chemists, chemical engineers, biochemists, physicists, and other scientists who are performing molecular electronic structure calculations in various market segments. The IBM Application Ready Solution is designed to help speed results by integrating the latest version of the Gaussian series of programs with powerful IBM Flex System POWER7+ blades and integrated storage. IBM Platform Computing provides simplified workload and resource management.

For more information, see the following resources:

•IBM Application Ready Solution for Gaussian: An IBM Reference Architecture based on POWER Systems V1.0.1 is available at:

http://public.dhe.ibm.com/common/ssi/ecm/en/dcl12373usen/DCL12373USEN.PDF

•IBM Platform Computing:

http://www.ibm.com/technicalcomputing/appready

2.3.6 IBM Application Ready Solution for InfoSphere BigInsights

The Application Ready Solution for IBM InfoSphere BigInsights provides a powerful Big Data MapReduce analytics environment and reference architecture that is based on IBM PowerLinux™ servers, IBM Platform Symphony, IBM GPFS, and integrated storage. The solution delivers balanced performance for data-intensive workloads, with tools and accelerators to simplify and speed application development. The solution is ideal for solving time-critical, data-intensive analytics problems in various industry sectors.

For more information, see the following resources:

•IBM Application Ready Solution for InfoSphere BigInsights: An IBM Reference Architecture V1.0, which is available at this website:

http://public.dhe.ibm.com/common/ssi/ecm/en/dcl12376usen/DCL12376USEN.PDF

•IBM Platform Computing:

http://www.ibm.com/technicalcomputing/appready

2.3.7 IBM Application Ready Solution for mpiBLAST

The mpiBLAST is a freely available, open source, parallel implementation of National Center for Biotechnology Information (NCBI) Basic Local Alignment Search Tool (BLAST). IBM Application Ready Solution for mpiBLAST simplifies the deployment of a life sciences open source parallel BLAST simulation environment. It provides an expertly designed, tightly integrated, and performance optimized architecture based on Flex System, System x, and Platform Computing for simplified workload and resource management.

For more information, see the following resources:

•IBM Application Ready Solution for mpiBLAST: An IBM Reference Architecture based on Flex System, System x, and Platform Computing Version 1.0, which is available at this website:

http://public.dhe.ibm.com/common/ssi/ecm/en/dcl12377usen/DCL12377USEN.PDF

•IBM Platform Computing:

http://www.ibm.com/technicalcomputing/appready

2.3.8 IBM Application Ready Solution for MSC Software

The IBM Application Ready Solution for MSC Software features an optimized platform that is designed to help manufacturers rapidly deploy a high-performance simulation, modeling, and data management environment, complete with process workflow and other high-demand usability features. The platform features IBM systems (IBM Flex System and IBM NeXtScale System), Platform HPC workload management, and GPFS parallel file system that are seamlessly integrated with MSC Nastran, MSC Patran, and MSC SimManager to provide clients robust and agile engineering clusters for accelerated results and lower cost.

For more information, see the following resources:

•IBM Application Ready Solution for MSC: An IBM Reference Architecture V1.0.1, which is available at this website:

http://public.dhe.ibm.com/common/ssi/ecm/en/dcl12367usen/DCL12367USEN.PDF

•IBM Platform Computing:

http://www.ibm.com/technicalcomputing/appready

2.3.9 IBM Application Ready Solution for Schlumberger

Fine-tuned for accelerating reservoir simulations that use Schlumberger ECLIPSE and INTERSECT, this Application Ready Solution provides application templates to reduce set up time and simplify job submission. Designed specifically for Schlumberger applications, the solution enables users to perform more iterations of their simulations and analysis, which ultimately yields more accurate results. Easy access to Schlumberger job-related data and remote management improves user and administrator productivity.

For more information, see the following resources:

•IBM Application Ready Solution for Schlumberger: An IBM Reference Architecture based on Flex System, System x, and Platform Computing V1.0.3 is available at:

http://public.dhe.ibm.com/common/ssi/ecm/en/dcl12374usen/DCL12374USEN.PDF

•IBM Platform Computing:

http://www.ibm.com/technicalcomputing/appready

2.3.10 IBM Application Ready Solution for Technical Computing

The IBM Application Ready Solutions for Technical Computing architecture supports both compute and data intensive applications. Technical computing users are often technical within their respective field (engineers who are designing automobile parts, for example), but not experts in computer and software technology. Independent software vendors (ISVs) made significant investments to increase their application's performance and capability by enabling them to run in a distributed computing cluster environment. However, many users are unable to fully use the capabilities of these applications because they do not have the technical ability to efficiently deploy and manage a technical or HPC cluster.

For more information, see the following resources:

•IBM Application Ready Solutions for Technical Computing: An IBM Reference Architecture based on Flex System, NextScale, System x, and Platform Computing V2.0, which is available at this website:

http://public.dhe.ibm.com/common/ssi/ecm/en/dcl12369usen/DCL12369USEN.PDF

•IBM Platform Computing:

http://www.ibm.com/technicalcomputing/appready

2.3.11 IBM System x and Cluster Solutions configurator

The IBM System x and Cluster Solutions configurator (x-config) is the hardware configurator that supports the configuration of Cluster Solutions. The reference architectures for IBM Application Ready Solutions are provided as a predefined template within x-config. The configurator is a stand-alone, Java based application that runs on a workstation after it is downloaded and does not require an internet connection. However, if an internet connection is available, the most recent version is installed automatically when the tool is started.

The configurator is available at this website:

http://www.ibm.com/products/hardware/configurator/americas/bhui/asit/index.html

Complete the following steps to access the predefined templates:

1. From the x-config starter window (as shown on Figure 2-15), select Express and Cluster Support ON.

Figure 2-15 x-config starter page

2. In the next window, select No-options solutions from the Solution drop-down menu.

3. Select IBM Application Ready Solutions from the Type drop-down menu.

4. Select the applicable Application Ready Solution from the Template drop-down menu.

5. After the template is loaded into configurator, click View details of this configuration to see a complete list of parts and list prices.

6. Click Configure to customize the predefined solution that is based on your requirements.

The x-config provides support for the configuration of Intelligent Clusters, iDataPlex systems, and stand-alone System x servers. Its intention is to translate rack-level architectural design into orderable configurations. It is also a single tool to support the design and configuration of multiple product lines (Blade, iDPx, Power, Flex, System x Rack Mount servers, racks, switches, cables, PDUs, and so on).

It also provides rack-level diagramming so that you can visualize the component location within a rack, which is important within a data center design; for example, where items are in the rack, airflow, and how much power you put in the rack. It also expands out-to-floor layouts, which is critical, especially in the high-performance computing area when you have specific networks that are expensive. Therefore, the ability to model the actual data center and floor can give you specific cable lengths. You can also view list prices as they are configured so that you can estimate more accurately where you are at list prices.

The x- config should not be considered a design tool because. Instead, it allows architects to translate designs into priceable, orderable, and buildable solutions and aids the user by providing the following benefits:

•Performing system-level checks and validation (SOVA)

•Aiding in calculating cable lengths that are based on the floor locations of the racks

•Having a high-level rule enforcement that is based on best-practices

2.4 Workload optimized systems

Workload optimized hardware and software nodes are key building blocks for every technical computing environment. To review the potential of the use of workload optimized systems with IBM Platform Computing products, you can use the reference architectures as part of an overall assessment process with a customer. While you are working on a proposal with a client, you can discover and analyze the client’s technical requirements and expected usage (hardware, software, data center, workload, current environment, user data, and high availability).

The following hardware evaluations must be considered:

•Determine data storage requirements, including user data size and compression ratio.

•Determine shared storage requirements.

•Determine whether data node OS disks require mirroring.

•Determine memory requirements.

•Determine throughput requirements and presumable bottlenecks.

The following software aspects must be considered:

•Identify cluster management strategy, such as node firmware and OS updates.

•Identify a cluster rollout strategy, such as node hardware and software deployment.

•Determine use of the GPFS performance.

The following data center aspects must be evaluated and considered:

•Determine cooling requirements, such as airflow and BTU requirements.

•Determine server spacing, racking, networking and electrical cabling, and cooling.

The following workload aspects must be considered:

•Determine workload characteristics, such as performance sensitive, compute-intensive, data-intensive, or a combination.

•Identify workload management strategy.

•Determine business-driven scheduling policies.

The following current environment aspects must be considered:

•Determine customer corporate networking requirements, such as networking infrastructure and IP addressing.

•Determine data storage and memory existing shortfalls.

•Identify system usage inefficiencies.

The following user data aspects must be considered:

•Determine the current and future total data to be managed.

•Determine the size of a typical data set.

•In case of import, specify the volume of data to be imported and import patterns.

•Identify the data access and processing characteristics of common jobs and whether they are query-like frameworks.

The following high availability aspects must be considered:

•Determine high availability requirements.

•Determine multi-site deployments, if required.

•Determine disaster recovery requirements, including backup and recovery and multi-site disaster recover requirements.

Recommendation: To design an HPC cluster infrastructure, conduct the necessary testing and proof of concepts against representative data and workloads to ensure that the proposed design achieves the necessary success criteria.

2.4.1 NeXtScale System

NeXtScale System is the next generation dense system from System x for clients that require flexible and scale-out infrastructure. The building blocks include dense 6U chassis (NextScale n1200) and contain 12 bays for half-wide compute (NeXtScale nx360 M4), storage (Storage NeX), and planned acceleration via graphics processing unit (GPU) or Intel Xeon Phi coprocessor (PCI NeX). NeXtScale System uses industry-standard components, including I/O cards and top-of-rack networking switches for flexibility of choice and ease of adoption.

From the performance point of view, the NeXtScale server solution benefits from the use of (IVB EP) next generation Intel processors top bin E-5 2600 v2 processors (up to 24 cores, 48 threads per server), fasted memory that is running at 1866 MHz (up to 256 GB per server), choice of SATA, SAS, or SSD on board (ultimate in I/O throughput that uses SSDs), and open ecosystem of high-speed I/O interconnects (10 GB Ethernet, QDR/FDR10, or FDR14 InfiniBand via slotless Mezz adapter).

From the operation point of view, the NeXtScale solution benefits from the use of S3, which allows systems to come back into full production from a low-power state much quicker (only 45 seconds) than a traditional power-on (270 seconds). When you know that a system will not be used because of time of day or state of job flow, it can be sent into a low-power state to save power and bring it back online quickly when needed.

By using 80+ Platinum power supplies that are operating at 94% efficiency to save power and shared in the chassis with 80 mm fans, shared power and cooling is provided for all nodes installed. NeXtScale reduces the total number of parts that are needed for power and cooling solution, which saves money in part cost and reduces the number of PSUs and fans, which reduces power draw.

Servicing NeXtScale from the front of the rack is an easier task. Everything is right there in front of you, including the power button, cabling, alert leds, and node naming and tagging. It reduces chances of missing cabling or pulling a wrong server. Having the cables arranged in the back of the rack also is good for air flow and good for energy efficiency.

The hyper-scale server NeXtScale nx360 M4 provides a dense, flexible solution with a low total cost of ownership. The half-wide, dual-socket NeXtScale nx360 M4 server is designed for data centers that require high performance but are constrained by floor space. By taking up less physical space in the data center, the NeXtScale server significantly enhances density.

NeXtScale System can provide up to 84 servers (or 72 servers with space for six 1U switches) that are installed in a standard 42U rack. Supporting Intel Xeon E5-2600 v2 series up to 130 W and 12-core processors provides more performance per server. The nx360 M4 compute node contains only essential components in the base architecture to provide a cost-optimized platform.

Native expansion means that we can add function and capabilities seamlessly to the basic node. There is no need for exotic connectors, unique components, or high-speed back or mid planes. NeXtScales Native Expansion capability adds hard disk drives (HDDs) to the node with a simple storage NeX (tray) + standard RAID card, SAS cable, and HDDs. Adding GPUs to a node is done through a PCI NeX supplementing PCI riser and a passively cooled GPU from nVidia or Intel. You also have a powerful acceleration solution for HPC, virtual desktop, or remote graphics (two GPUs per server in 1U effective space).

For more information about how to implement NeXtScale System and positioning within other System x platforms (iDataPlex, Flex System, rack-mounted System x servers), see IBM NeXtScale System Planning and Implementation Guide, SG24-8152, which is available at this website:

http://www.redbooks.ibm.com/abstracts/sg248152.html

For more information about the NeXtScale System product, see this website:

http://www.ibm.com/systems/x/hardware/highdensity/nextscale/index.html

2.4.2 iDataPlex

IBM System x iDataPlex is an innovative data center solution that maximizes performance and optimizes energy and space efficiencies. The building blocks include the iDataPlex rack cabinet, which offers 100 rack units of space (up to 84 servers per iDataPlex rack, 8 top-of-rack switches, 8 PDUs), 2U FlexNode chassis that supports up to two half-depth 1U dx360 M4 compute nodes, which can be extended with PCIe trays that support I/O- and GPGPU-intensive applications.

The iDataPlex rack has 84U slots for server chassis and 16 vertical slots for network switches, PDUs, and other appliances. The rack is oriented so that servers fit in side-by-side on the widest dimension. For ease of serviceability, all hard disk, planar, and I/O access is from the front of the rack. In addition, the optional liquid-cooled Rear Door Heat eXchanger that is mounted to the back of the rack can remove 100% of the heat that is generated within the rack, which draws it from the data center before it exits the rack. It can also help to cool the data center and reduce the need for Computer Room Air Conditioning (CRAC) units. This allows racks to be positioned much closer together, which eliminates the need for hot aisles between rows of fully populated racks.

Note: IBM also supports the installation of iDataPlex servers in standard 19-inch racks.

The highly dense iDataPlex dx360 M4 server is a modular solution with a low total cost of ownership. The unique half-depth, dual-socket iDataPlex dx360 M4 server is designed for data centers that need energy efficiency, optimized cooling, extreme scalability, high density at the data center level, and high performance at an affordable price. Supporting Intel Xeon E5-2600 v2 series up to 130 W processors provides more performance per server (up to 24 cores, and 48 threads) and maximize the concurrent execution of multi-threaded applications.

Each 2U chassis is independently configurable to maximize compute, I/O, or storage density mix configurations for a tailored dense solution. You can use faster memory that is running at 1866 MHz (up to 512 GB per server), choice of SATA, SAS, or SSD on board, select high-speed I/O interconnects (two 10 GB Ethernet ports, or two FDR14 InfiniBand ports via slotless Mezz adapter). The chassis use a shared fan pack with four 80 mm fans and redundant highly efficient (80 PLUS Platinum) power supplies. With the iDataPlex chassis design, air needs to travel only 20 inches front to back (shorter distance means better airflow). This shallow depth is part of the reason that the cooling efficiency of an iDataPlex server is high.

For more information about the iDataPlex System and liquid-cooling, see these resources:

•Implementing an IBM System x iDataPlex Solution, SG24-7629-04, which is available at this website:

http://www.redbooks.ibm.com/abstracts/sg247629.html

•http://www.ibm.com/systems/x/hardware/highdensity/dx360m4/index.html

•http://www.redbooks.ibm.com/abstracts/tips0878.html

2.4.3 Intelligent Cluster

The IBM Intelligent Cluster is a factory-integrated, interoperability-tested system with compute, storage, networking, and cluster management that is tailored to your requirements and supported by IBM as a solution. With optimized solution design and by using interoperability-tested best-of-industry technologies, it simplifies complex solutions and removes the time and risk within the deployment.

The Intelligent Cluster solution includes the following building blocks:

•IBM System x:

– IBM NeXtScale System: nx360 M4

– IBM Flex System: x220, x240, x440 compute nodes

– Blade servers: HX5, HS23

– Enterprise servers: x3850 X5, x3690 X5

– iDataPlex servers: dx360 M4

– Rack servers: x3550 M4, x3630 M4, x3650 M4, x3750 M4, and x3650 M4 HD

•Interconnects:

– Ethernet Switches: IBM System Networking, Brocade, Cisco, Mellanox, Edgecore,

– Ethernet Adapters: Chelsio, Mellanox, Emulex, Intel

– InfiniBand Switches and Adapters: Mellanox, Intel

– Fibre Channel: Brocade, Emulex, and Intel

•Storage systems (System Storage): DS5020, DS5100, DS5300, DS3950, DS3500, DS3512, DS3524, and IBM Storwize V3700

•Storage expansions: EXP5000 Storage Expansion Unit, EXP 2512 Storage Expansion Unit, EXP 2524 Storage Expansion Unit, EXP 520 Storage Expansion Unit, and EXP 395 Storage Expansion Unit

•OEM storage solution: DDN SFA 12000 InfiniBand (60 and 84 drive enclosures)

•Graphic Processing Units (GPUs): NVIDIA: Quadro 5000, Tesla K10, Tesla M2070Q, Tesla M2090, Tesla K20, and Tesla K20X

•Operating systems: Red Hat Enterprise Linux (RHEL) and SUSE Linux Enterprise Server (SLES)

•Cluster management software:

– IBM Platform HPC

– IBM Platform Cluster Manager

– IBM Platform LSF, xCAT (Extreme Cloud Administration Toolkit)

– Moab Adaptive HPC SuiteMoab Adaptive Computing Suite

– IBM General Parallel File System (GPFS) for Linux

– IBM LoadLeveler®

– IBM Parallel Environment

For more information about Intelligent Cluster, see this website:

http://www.ibm.com/systems/x/hardware/highdensity/cluster/index.html

2.4.4 Enterprise servers

The new IBM X6 enterprise servers are high-end servers that are designed for heavy vertical workloads, virtualization, and legacy system replacements. IBM invested in its enterprise X Architecture to deliver industry-leading performance, scalability, and reliability on industry standard x86-based systems. IBM X6 rack-mount servers are available in four-socket (x3850 X6) and eight-socket (x3950 X6) and incorporate a new book design.

The IBM X6 server offer pay-as-you-grow scalability. The System x3850 X6 server features a modular design that includes so-called books for each of the three subsystems (I/O, storage, and compute). Front and rear access means that you can easily add and remove the various components without removing the server from the rack, which is a revolutionary concept in rack servers. You add components as you need them, and some components, such as storage and I/O adapters, are hot-swappable, so you do not need to power off the server to add them. With a design that helps prevent component failures from bringing down the entire machine, you can feel confident that an X6 server is an ideal platform for any mission-critical application.

The following building blocks for new X6 servers are available:

•Compute books

Each compute book contains one processor (Intel Xeon E7-4800v2 or E7-8800v2) and 24 DIMM slots. It is accessible from the front of the server. The x3850 X6 has up to four compute books. The x3950 X6 has up to eight compute books (with support for 64 GB LRDIMMs, you can have up to 6 TB of memory in the x3850 X6 or 12 TB in the x3950 X6).

•Storage books

The storage book contains standard 2.5-inch drives or IBM eXFlash 1.8-inch SSDs (up to 12.8 TB of SAS 2.5-inch disk or up to 6.4 TB of eXFlash 1.8-inch SSDs). It also provides front USB and video ports, and has two PCIe slots for internal storage adapters. The storage book is accessible from the front of the server.

•I/O books

The I/O book is a container that provides PCIe expansion capabilities. I/O books are accessible from the rear of the server. The rear contains the primary I/O book, optional I/O books, and up to four 1400 W/900 W AC or 750 W DC power supplies.

The following types of I/O books are available:

•Primary I/O book

This book provides core I/O connectivity, including a dedicated mezzanine LOM (ML) slot for an onboard network, three PCIe slots, an Integrated Management Module II, and four rear ports (USB, video, serial, and management).

•Full-length I/O book

This hot-swappable book provides three full-length PCIe slots. This book supports a co-processor or GPU adapter up to 300 W if needed.

•Half-length I/O book

This hot-swappable book provides three half-length PCIe slots.

The X6 offering also includes new IBM eXFlash memory-channel storage and IBM FlashCache Storage Accelerator, two key innovations that eliminate storage bottlenecks. This new IBM eXFlash memory-channel storage brings storage closer to the processor subsystem, which improves performance considerably.

These storage devices have the same form factor as regular memory DIMMs, are installed in the same slots, and are directly connected to the memory controller of the processor. IBM eXFlash DIMMs are available in 200 GB and 400 GB capacities and you can install 32 of them in a server.

New eXFlash DIMMs have latency values lower than any SSD or PCIe High IOPS adapter. This represents a significant advantage for customers who need the fastest access to data. FlashCache Storage Accelerator is intelligent caching software and uses intelligent write-through caching capabilities to achieve better IOPS performance, reduced I/O load, and ultimately increased performance in primary storage.

All these features mean that the new X6 enterprise servers offer excellent upgradability and investment protection.

For more information, see these websites:

•http://www.ibm.com/systems/x/x6/index.html

•http://www.ibm.com/systems/info/x86servers/ex5/index.html

2.4.5 High volume systems

The volume space is over half the total x86 server market and IBM has a broad portfolio of rack servers to meet various client needs, from infrastructure to technical computing specially for the analytics (compute-intensive workloads) and Big Data (data-intensive workloads) where the new System x3650 M4 server types (HD and BD) are focused.

The new System x3650 M4 HD (High Density) two-socket 2U server is optimized for high-performance, storage-intensive applications, including data analytics or business-critical workloads. Supporting Intel Xeon E5-2600 v2 series up to 130 W processors provides more performance per server (up to 24 cores, and 48 threads) and maximizes the concurrent execution of multi-threaded applications.

You can use faster memory that is running at 1866 MHz (up to 768 GB per server with 30 MB cache), which yields low latency for data access and faster response times. It has 12 Gb RAID on board, which doubles the bandwidth of the x3650 M4 for optimized performance for data protection. It also allows up to 4 RAID adapters, which provides flexible storage controller configurations for up to 4x performance increase for demanding storage intensive workloads versus single RAID adapter design.

It provides flexible internal storage options, including 26 x 2.5-inch HDD or SSD or 16 x 2.5-inch HHD or SSD + 16 x 1.8-inch SSD, up to 41 TB. You can select up to 16 HDDs + 16 SSDs for optimum storage performance through storage tiering and use IBM FlashCache Storage Accelerator option to deliver high I/O performance without the need for tuning.

It also provides the ability to boot from the rear HDDs with a separate optional RAID adapter to keep OS and business data separate, which means easy set up, management, and configuration. You can use standard and high-speed I/O interconnects (4 x 1 Gb Ethernet ports + 1 Gb IMM + optional 2 x 10 Gb Ethernet NIC design where no PCIe slot is required).

Extensions are provided through up to six PCIe 3.0 ports (you can add up to two optional GPUs) or optional 4 PCI-X. The server optionally supports up to two GPUs (NVIDIA adapters). On the rear, it can use up to 4x Hot swap redundant fans and 2x Hot swap redundant Efficient 80+ Platinum Power Supply Units.

For more information about the System x3650 M4 HD product, see this website:

http://www.ibm.com/systems/x/hardware/rack/x3650m4hd/index.html

For more technical information about the System x3650 M4 HD, see this website:

http://www.redbooks.ibm.com/abstracts/tips1049.html

The new System x3650 M4 BD (Big Data) two-socket 2U server is optimized for the capacity, performance, and efficiency you need for Big Data workloads. Supporting Intel Xeon E5-2600 v2 series up to 115 W processors provides more performance per server (up to 24 cores, and 48 threads) for fast response time and business outcomes for Big Data workloads. You can use faster memory that is running at 1866 MHz (up to 512 GB per server with 30 MB cache). For instance, it supports 1+1 RAID, which enables the ability to boot from the separate rear drives, which keeps the OS and business data separate.

It also offers flexible options, including a choice of JBOD for maximum capacity or RAID for up to 12 Gb for optimal data protection that is supported by up to 1 GB, 2 Gb, or 4 GB flashed-back cache. Another option is 6 Gb RAID with 200 or 800 GB flash for caching or data or boot volume.

It provides flexible internal storage options by choosing 12+2 x3.5-inch Hot Swap SATA HDD, and 2.5-inch SSDs up to 56 TB. You can use standard and high-speed I/O interconnects (4 x 1 Gb Ethernet ports + 1 Gb IMM + optional 2 x 10 Gb Ethernet NIC design where no PCIe slot is required). Extensions are provided through up to five PCIe 3.0 ports. On the rear, it can use up to 4x Hot swap redundant fans and 2x Hot swap redundant Efficient 80+ Platinum PSU.

For more information about the System x3650 M4 BD product, see this website:

http://www.ibm.com/systems/x/hardware/rack/x3650m4bd/index.html

For more technical information about System x3650 M4 BD, see this website:

http://www.redbooks.ibm.com/abstracts/tips1102.html

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 2. High performance clusters

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 2. High performance clusters