Compute Clusters

Many industry sectors harness the power of multiple computer systems to perform tasks that could not otherwise be accomplished on a single server. Moreover, some tasks can be achieved more cost effectively by a collection of smaller, lower-cost systems. Typical of this class of user tasks are computer animation in the entertainment industry and circuit simulation to verify chip design in the Electronic Design Automation (EDA) industry.

Scalability rather than availability is usually the foremost concern for applications in this sector. Task complexity rather than criticality warrants a cluster approach. If an arbitrary cluster node failure occurs, unfinished jobs can be resubmitted to the system for processing. Although there may be a chain of dependency within the jobs submitted, it is unusual for external systems to depend directly on the results of the computation. A few hybrid situations exist. These situations often involve business processes that batch process, data mine, or otherwise transform data held in corporate repositories through a series of stages.

Distributed Clusters

Distributed clusters differ from the tightly coupled solution that Sun Cluster 3.0 offers by providing more loosely coupled facilities. A cluster of distributed servers brings together the CPU, memory, and I/O resources from multiple systems without the accompanying membership monitor or extensive suite of fault probes in Sun Cluster. In contrast, distributed clusters often add a high-performance parallel file system and batch-queuing systems to schedule jobs across the available resources. One common component that both architectures rely on is a high-throughput, low-latency system area interconnect.

Parallel Processing

Applications to use parallel processing can be written two ways—function partitioning and data partitioning. Function partitioning uses different threads of execution to process data in different ways, concurrently. Data partitioning simultaneously processes multiple, independent portions of the data set in the same way. The amenability of the algorithm to parallel techniques and the volume of data that must be transferred between nodes are among the factors that govern the scalability of such clusters.

High-Performance Computing

High-performance computing (HPC) application implementation can use one of two basic approaches: single process, which limits the computation to the resources of a single server; or multiprocess, which utilizes the resources of multiple servers. Both approaches enhance performance if the application can be threaded in one of the ways described previously. Single-process programs can be made multithreaded in two ways—with parallelization directives (for example, OpenMP and the Sun parallelizing compilers) or through explicit use of the Solaris operating environment or POSIX threads in the program. Similarly, multiprocess applications can be written with the message passing interface (MPI) API or the rival parallel virtual machine (PVM) API to enable execution both within and across servers. You can also combine both approaches and use threading and MPI in a single implementation.

Often HPC clusters use a job scheduling system to distribute the workload across the available resources within a cluster. Using job scheduling maximizes utilization of the underlying hardware so that you do not waste precious CPU cycles. Combining the job scheduler, a homogeneous view of file space, and uniform access to application binaries creates the illusion of a single, seamless computing resource.

Sun HPC Clusters

Typically, high performance computing (HPC) environments are computation intensive. Often they are technical applications that require powerful or high performance systems to solve computations in a timely fashion. The Sun HPC ClusterTools™ package provides a software development environment to aid the creation, debugging, and tuning of MPI applications. These tools can be deployed subsequently, either on a single, large SMP server or on a cluster of SMP servers, depending on the resources available at the point of execution. This package also provides tools to manage the workload across the cluster. The majority of the Sun HPC ClusterTools toolkit is also available through the Sun Community Source License (SCSL) mechanism.

The SunHPC ClusterTools software has the following components:

  • Sun™ Cluster Runtime Environment

  • Sun MPI Communications Library

  • Sun™ Parallel File System

  • Sun™ Scalable Scientific Subroutine Library (Sun S3L)

  • Prism™ Parallel Development Environment

Sun Cluster Runtime Environment

The Sun Cluster Runtime Environment software enables you to specify the resource requirements of an MPI application and then launch, monitor, and control it throughout its execution. You can also start applications under the control of the Prism debugger. Additionally, the Sun Cluster runtime environment interfaces with distributed resource managers (DRMs), such as the Sun™ Grid Engine software (see “Sun Grid Engine Software”) and the Platform Computing load-sharing facility (LSF) that enable you to schedule jobs across multiple systems.

Sun MPI Communications Library

The highly optimized, native MPI implementation conforms to the majority of the MPI-2 standard. It also includes full support for MPI I/O, which gives applications direct access to the Sun Parallel File System (PFS). Using the Sun RAID MPI-API, applications can take advantage of up to 64 nodes and use 1,024 processors in a single cluster. As with Sun Cluster 3.0 systems, communication between HPC cluster nodes can use standard TCP/IP interfaces, such as Gigabit Ethernet, but you can achieve high performance by using the lower latency, higher bandwidth available through the RSM™ protocol over PCI-SCI. Alternative, high-performance interconnects, such as Myricom Myrinet2000, are available as protocol modules developed through the SCSL initiative that can be loaded.

Sun Parallel File System

HPC applications often make substantial demands on the I/O subsystem of a platform. The Sun PFS, coupled with the MPI I/O routines, provides a mechanism for achieving scalable, parallel I/O. A clear and obvious distinction exists between the goals of the Sun HPC ClusterTools software PFS and the Sun Cluster 3.0 software global file system (GFS). PFS trades off availability for performance, whereas GFS makes availability paramount. Striping data across the I/O systems of multiple nodes concomitantly decreases the mean time between failure (MTBF). FIGURE 2-6 shows a high-level overview of the Sun PFS.

Figure 2-6. Sun Parallel File System, High-Level View


Sun Scalable Scientific Subroutine Library

S3L is a set of parallel and scalable tools that are commonly used in science and engineering. S3L enables you to take advantage of multiprocessor and cluster technology without writing your own parallel code. For example, parallel, multiprocessor-optimized matrix operations can be used in an otherwise single threaded program with a function call to the S3L library.

Prism Parallel Development Environment

The Prism™ debugging tools provide a wide range of facilities for debugging multithreaded, distributed programs and visualizing their data structures. Debugging multithreaded programs has been possible for many years using products such as the Forte™ development tools and debugger. The synchronization required between the threads being debugged and the debugger is accomplished with the Solaris kernel tracing facilities. However, debugging distributed programs requires a higher level distributed debugging environment. Prism provides this environment and manages the synchronization of distributed programs and the debugger. The Prism debugging tools are invaluable for developing distributed HPC programs.

Sun Grid Engine Software

The Sun Grid Engine 5.2 software is a full-featured distributed resource manager (DRM). Its features include load-balancing and batch scheduling facilities to enable you to schedule batch, parallel, and interactive jobs across a diverse range of distributed computer resources. Using the Sun Grid Engine software can raise hardware utilization levels from their typical 20 to 30 percent range, to as high as 98 percent. Primary markets include EDA, MCAE, geosciences, and software development.

Installing the Sun Grid Engine software on a collection of workstations or servers makes the collection a single, virtual pool of computational resources. Within this virtual entity, work queues can be defined and given owners, and resource limits can be imposed; you can subsequently enable and disable these items, as required. Similarly, users can be created with priorities and job limits to constrain the resource consumption of users. All of these features can be configured from the management command line and GUI interfaces.

The Sun Grid Engine software consists of four daemon processes—execd, commd, schedd, and qmaster. These daemon processes coordinate the scheduling, dispatch, and execution of batch jobs, monitor job and machine status, report on the systems, and manage communications among components.

The Sun Grid Engine software is ported to a variety of operating environments, including the Solaris operating environment and Linux. The source code is available under the Sun Industry Standards Source License. For details, see http://www.sun.com/software/gridware.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset