6
Monitoring and Controlling Power and Performance of Servers and Data Centers

This chapter will present some techniques to monitor and control the power and performance of IT devices (we will focus on servers), and the data center infrastructure itself with its pumps, chillers and so on.

When changes have to be made on a system, it is important to understand what is the potential impact of the changes on the system’s behavior and therefore to have models or tools to predict their impact. These tools can predict the impact of a frequency change on a server’s performance or energy or the impact of a cooling change on the data center PUE. We will present first the low-level components and application programming interface (API) to measure power and performance of servers equipped with Xeon processors and NVIDIA accelerators, then some modeling techniques to predict the power and performance of servers and finally high-level software to manage and control the power and performance of servers in the data centers.

6.1. Monitoring power and performance of servers

Measuring and monitoring accurately is mandatory step before controlling the behavior. We will discuss first the sensors and related APIs to measure power and temperature and next how to monitor the performance.

6.1.1. Sensors and APIs for power and thermal monitoring on servers

Power and thermal measurement is error prone since the accuracy of measurements can vary a lot depending on the granularity and the accuracy of the sensor and API used. By granularity, we mean the sampling rate at which the sensor is reading the data, and the reporting rate at which the readings are reported to the user through the API or high-level software. That is why we will describe in detail the power accuracy and granularity of power measurements.

6.1.1.1. Power and thermal monitoring on Intel platforms

On Intel platforms, node manager (NM) and running average power limit (RAPL) are the fundamental interface reporting power. Figure 6.1 presents the different components for reporting AC and DC node power on a Lenovo dense ThinkSystem server.

Image

Figure 6.1. Node power management on Lenovo ThinkSystem. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip

PCH is the platform controller hub (i.e. south bridge), ME is the management engine (embedded in PCH) that runs Intel NM firmware and HSC is the hot swap controller that provides power readings.

On Lenovo ThinkSystem servers, XCC (xClarity Controller) runs the node-level system management. XCC is based on a baseboard management controller (BMC) using a dual-core ARM Cortex A9 service processor. It monitors DC power consumed by the node (as a whole and by the CPU and memory subsystems), it monitors inlet air temperature for the node and it caps DC power consumed by the node as a whole. It also monitors the CPU and memory subsystem throttling caused by node-level throttling and enables or disables power savings for node.

FPC is the fan/power controller for the chassis-level systems management. It monitors AC and DC power consumed by individual power supplies and aggregates to chassis level. It also monitors DC power consumed by individual fans and again aggregates to chassis level.

As shown in Figure 6.1, the information reported by the node-level system management can be reported in-band where the measurement is done on the node itself, or out-band where the measurement is requested by another node such as the cluster manager.

Figure 6.2 presents the reporting frequencies at the different levels from sensors to the application for NM and RAPL power management flows.

Image

Figure 6.2. Node manager and RAPL reporting frequencies

For NM, the sensor level sampling rate is 1 KHz and the node power reported at the application level has a 1Hz frequency. For RAPL, the sampling from the energy Model Specific Registers (MSRs) for CPU and memory is 500 Hz and the reporting frequency is 200 Hz.

Therefore, as RAPL only reports CPU and memory power, the DC node power sampling frequency is only 1 Hz. This limits the scope of DC node power measurements to large granularity, such as job level, and is clearly not enough fine grain to measure DC node power at the subroutine or loop level. To achieve such fine grain measurements, other solutions have been implemented on top of NM (Hackenberg et al. 2014; Benini 2018; Libri et al. 2018).

An example of such a solution has been developed by Lenovo for the ThinkSystem SD650 with a new circuit and data flow. Figure 6.3 presents the new data flow and the reported frequency at each level, leading to a 100Hz DC node power frequency.

Image

Figure 6.3. New circuit to get higher resolution DC node power on SD650

Figure 6.4 shows the circuits that provide both the usual NM interface and this new high accuracy DC node power measurement through an Intelligent Platform Management Interface (IPMI) raw command (see section 6.1.1.3).

Image

Figure 6.4. New circuit to get higher resolution DC node power on SD650. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip

Figure 6.5 presents the percentage of error of the power readings of the different circuits we have presented, depending on the power load in the node. In the light-yellow rectangle on the left side of Figure 6.5, the server is idle or very little used, while on the light-yellow rectangle at the right side of the Figure 6.5 the server is highly loaded. When the node is idle or little used, the error is dominated by the offset error, while when the node is highly loaded the error is dominated by the gain error. The green, blue and orange curves present the error with three different circuits: green and blue curves represent the error for NM readings on an industry standard solution based on circuits such as TI LM25066, Analog Devices ADM1278 or Maxim VT505 (green) and on Lenovo NeXtScale nx360 m5 (blue). The orange curve represents the error on Lenovo SD650 with the high accuracy circuits presented in Table 4.15. This shows that these new circuits report not only higher frequency, but also much better accuracy.

Image

Figure 6.5. Power accuracy of different circuits. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip

RAPL stands for running average power limit. It is the low-level interface used by Intel to collect energy information from the CPU and memory subsystems. NM consolidates the raw energy data and converts it to power by subtracting two energy readings and dividing by the time between the two readings.

6.1.1.2. Circuits and interface for monitoring power and thermal on NVIDIA GPUs

The basic tool for power and thermal monitoring on NVIDIA graphics processing units (GPUs) is the NVIDIA Management Library (NVML 2019). This is an API, based on C language, for monitoring and managing states of NVIDIA GPU devices, such as GPU utilization rate, running process, clock and performance, state, temperature and fan speed, power consumption and power management. Its sampling rate is ~500 Hz and the power accuracy for the entire board is ±5 W. NVIDIA System Management Interface (NVSMI) is a command line utility based on NVML to aid in the management and monitoring of NVIDIA GPU devices (NVSMI n.d.). NVSMI is a high-level command line interface and has a low sampling rate (~1 Hz) (Ferro et al. 2017).

NVML has query commands to query the state of the different GPU devices such as:

  • – ECC error counts: both correctable single bit and detectable double bit errors are reported. Error counts are provided for both the current boot cycle and for the lifetime of the GPU;
  • – GPU utilization: current utilization rates are reported for both the compute resources of the GPU and the memory interface;
  • – active compute process: the list of active processes running on the GPU is reported, along with the corresponding process name/id and allocated GPU memory;
  • – clocks and P-State: max and current clock rates are reported for several important clock domains, as well as the current GPU performance state.
  • – temperature and fan speed: the current core GPU temperature is reported, along with fan speeds for non-passive products;
  • – power management: for supported products, the current board power draw and power limits are reported.

NVML also has devices commands, like setting the power management limit (nvmlDeviceSetPowerManagementLimit), also called power cap elsewhere in this book, or setting the clock frequencies of the memory of graphics devices (nvmlDeviceSetApplicationsClocks) at which the compute and graphics applications will be running.

6.1.1.3. IPMI

IPMI is the major standard to request and collect information from IT devices (IPMI 2009). IPMI commands are a convenient low-level method to get sensor readings from monitored devices.

Out-of-band commands collect information through an access server or card (such as a BMC), which is connected to a management port of the monitored devices. In-band commands collects information on the device itself. IPMI raw commands are specified in hexadecimal value and are specific to the vendor.

Examples of IPMI commands like “IPMI-DCMI power set limit” to set a power limit in watts or “IPMI-sensors” to get CPU temperature, voltage and fan speed are provided by freeIPMI (2018)1.

The following IPMI raw command reads the energy from the NM on SD650: “ipmitool raw 0x2e 0x81 0x66 0x4a 0x00 0x20 0x01 0x82 0x0 0x08”.

Such a command has a reporting rate of ~1 Hz with NM on Intel platforms. A similar reporting rate is achieved with NVSMI on NVIDIA GPUs.

The following IPMI raw command will retrieve energy from the high sampling rate sensors on SD650 with a reporting rate of ~100 Hz: “ipmitool raw 0x3a 0x32 4 2 0 0 0”.

For higher sampling and reporting rates (300–500 MHz), RAPL for Intel platforms and NVML for NVIDIA GPU should be used.

6.1.1.4. Power API

IPMI is a low-level interface to get information but it does not abstract the system and the relation between all its components, which is the goal of power API.

Power API (Grant et al. 2016) is the result of collaboration among national laboratories, universities and major HPC vendors to provide a range of standardized power management functions, from application-level control and measurement to facility-level accounting, including real-time and historical statistics gathering.

Power API describes a system by a hierarchical representation of objects (cabinet, chassis, node, board, power plane, core). It defines roles that interact with the system. Each object has certain attributes (e.g. power cap, voltage) that can be accessed depending on the role of the requester. Get/set functions enable basic measurement and control for the exposed object attribute.

Support is already available for Intel and AMD CPUs with the support of different hardware vendors like Cray, HPE, IBM and software vendors like Adaptive Computing.

6.1.1.5. Redfish

Redfish is a standard led by the Distributed Management Task Force (DMTF n.d.). The goal of Redfish is to replace and extend IPMI to deliver simple and secure management for converged, hybrid IT and the Software Defined Data Center, both human readable and machine capable. The main advantage over IPMI is its RESTful API. Redfish v1.0 is supported by all the major IT vendors.

It allows the user to retrieve basic server information and sensor data (like temperatures, fans, power supply) and facilitates remote management tasks such as reset, power cycle and remote console. Until now, the Redfish specification has covered only IT equipment. However, work is in progress to extend its scope to cover power with the Data Center Equipment Schema Bundle2.

6.1.2. Monitoring performance on servers

6.1.2.1. Performance monitoring on Intel platforms

Hardware performance counters are specific registers built in the processor to store the count of hardware events like number of instructions executed, number of floating point/integer instructions executed, number of cycles, number of L1/L2 cache misses and so on. As the number of hardware counters is limited, prior to using the counters, each counter has to be programmed with the index of the event type to be monitored (Intel 2017).

The Performance Application Programming Interface (PAPI, see Terpstra et al. 2010) is an open source machine independent set of callable routines that provides access to the hardware performance counters on most modern processors like x86, ARM, Power and GPUs. It also provides access to RAPL functions for Intel x86.

In Chapter 5, we used cycles per instructions (CPI) [5.1] and gigabytes of memory read or written per second (GBS) [5.2] to characterize application performance. CPI was computed with the PAPI core events as:

[6.1]

GBS was computed with the uncore events (Intel 2012) of the Integrated Memory Controller (IMC) as:

[6.2]

On Intel platforms, the Intel Performance Tuning Utility (ptu or ptumon) is an easy way to collect performance temperature and power metrics reported in Chapter 5, although it is not officially supported by Intel anymore.

High-level tools are also available to visualize and help the developer to analyze the performance of workload running on a system. Some are proprietary like Intel Vtune3 and Intel Advisor4 who run on Intel processors and like ARM Forge and ARM Performance Reports5 who run on Intel, AMD, ARM, OpenPower processors and NVIDIA GPU accelerators. There are also many open source tools such as gprof, TAU, Vampir, Paraver, Scalasca, Periscope and so on.

6.1.2.2. Performance monitoring on NVIDIA platforms

NVIDIA GPUs also have hardware performance counters to monitor the activity of the multiple functional units (NVIDIA 2015).

The NVIDIA CUDA Profiling Tools Interface (CUPTI) provides performance analysis tools with detailed information about GPU usage in a system. CUPTI is used by performance analysis tools such as nvprof, NVIDIA Visual Profiler, NVIDIA NSight, TAU and Vampir Trace.

PAPI CUDA is a PAPI version available on NVIDIA CUDA platform, which provides access to the hardware counters inside the GPU. PAPI CUDA is based on CUPTI and provides detailed performance counter information regarding the execution of GPU kernels.

NVIDIA Nsight Systems is a system-wide performance analysis tool designed to visualize an application’s algorithms. It helps to identify the largest opportunities to optimize, and tune to scale efficiently across any number of CPUs and GPUs.

Open source high-level tools such as TAU, Vampir and Scalasca are also available on NVIDIA GPUs.

6.2. Modeling power and performance of servers

Through the measurements we presented in Chapter 5, we saw that power consumption depends on the workload and the system on which it is executing. In this section, we present a few models to quantity this relation.

6.2.1. Cycle-accurate performance models

Cycle-accurate simulators have been use for years by microprocessor designers to help them in their work (Hu et al. 2003). These tools are supposed to be the most accurate since they simulate the execution of every instruction at each cycle based on a register-transfer level (RTL) description of the circuit they want to simulate. RTL abstraction is used in hardware description languages to create high-level representations of a circuit, from which lower level representations and ultimately actual wiring can be derived.

Unfortunately, these tools take a huge amount of time to simulate a small piece of codes and that is why they cannot be used to make near real time decision to control the power or performance of a code while it is executing.

6.2.2. Descriptive models

Descriptive models are used when the events to simulate can be described quantitatively in an analytical way. A large drawback is that the descriptive model approach will not work for complex system problems, because the system is too complex to model completely or accurately. That is why these models are more educational or useful as a first-level analysis than production oriented. A descriptive model is clearly at the opposite spectrum end to a cycle-accurate simulator, a macroscopic model versus a microscopic one.

6.2.2.1. Application-specific models

Predicting the performance of a given algorithm on a specific system has been done for many years where the application execution time is split into computation, communication and I/O time. Each of these is then described by a simple equation related to its algorithmic complexity (number of operation to be performed) and some hardware characteristics of the system (time to compute a DP operation, to send a DP or SP word or to read/write one byte or one record to storage, etc.). Many examples of such techniques for specific numerical algorithms exist from early examples (Keyes and Gropp 1986; Brochard 1989) up to more recent examples (Bonfa et al. 2018).

For example, the elapsed time is split as:

where Ttotal is the total elapsed time, Tcomp is the computation time, Tcomm the communication time and Tio the I/O time, with no overlapping assumed between the different tasks.

For a matrix multiplication of rank n, computation time in ms would be:

where GFlops is evaluated as in Table 5.2 for the Intel Xeon 6148 taking into account the type of instructions executed. For a matrix multiplication when using a tuned BLAS library (NVIDIA Developer n.d.; Gennady and Shaojuan 2018), it should be GFlops for AVX-512 FMA, while if the code is written by hand with no compiler, optimization GFlops should be taken as SSE2 ADD or MULT. This simple example shows clearly how difficult it is to select the right parameters since they highly depend on the type of instructions executed.

Similar work has been done regarding power consumption by splitting the total power of the device across its different components (CPU, memory, storage, etc.) and eventually splitting each component power into static and dynamic power. But the coefficients cannot be expressed by simple arithmetic formulas. This drawback is corrected by the predictive methods as described in section 6.2.3.

6.2.2.2. Application signature/surrogate-based performance models

Other performance prediction methods have introduced some level of abstraction to characterize the applications (application signature) and the system hardware (system signature) at a higher level than using the application and system characteristics. Examples of such approach are Todi et al. (2007) and Snavely et al. (2013). Another approach is to decompose an application behavior as a linear equation of surrogates, which are small enough such that they have been projected by cycle-accurate simulators and which represent the workloads to be projected (Sameh et al. 2012).

6.2.3. Predictive models

A predictive model refers to a mathematical model that can accurately predict future outcomes based on historical data using statistical or neural network (NN) methods.

6.2.3.1. Statistical power models

Several authors have worked to build linear power models where coefficients are calculated by best fit (or linear multiregression) statistical methods. With such models, accuracy is limited between 5% and 10%, but given their simplicity they can be used to make real time decisions.

An example of such a model is presented below. Economou et al. (2006) is another example.

As in every statistical method, the events and data used are critical. Looking at the measurements presented in Chapter 5, we saw that CPU power is varying with CPI, the core frequency and the processor temperature of the processors while DIMM power is varying with GBS.

The impact of CPI (CPI = 1/IPC) on CPU power and GBS on DIMM power is visible when we compare the CPU power of the various workloads Turbo OFF. HPL in Figures 5.35.6 has a very low CPI (0.45) and the highest possible CPU power of 150 W and a DIMM power of 32 W with a GBS of 70 for the plateau PL1 phase. For STREAM in Figures 5.115.16 we note an average CPU power of 125 W with a CPI ~10 and a DIMM power of 43 W with a GBS of 140. We see the same behavior on BQCD where BQCD128 has an average CPU power of 118 W with a CPI of 1.10 and BQCD1K has an average CPU power of 121 W with a CPI of 0.70. Similarly, BQCD128 has a DIMM power of 33 W and BQCD1K of 21 W with a respective average GBS value of 74 and 27.

This correlation of CPU power with 1/CPI and DIMM power with GBS leads to a simple CPI/GBS model, which was first introduced by Brochard et al. (2010) to predict the power and the performance of a workload at any possible frequency fn on a given system when this workload has run already once at nominal frequency f0 on the same system with no temperature variation.

Such a simple model enables real time decisions on which frequency to set while the workload is running to minimize power or energy (see section 6.3).

In this model, DC power at frequency fn is given by:

[6.3]

where GIPS is the number of giga instructions per second and GBS is the number of giga bytes of memory read or written per second with:

[6.4]

The elapsed time of a code at frequency fn is given by:

[6.5] Image

As the number of instructions in a code is independent of the frequency at which it is executed, we can write [6.5] at frequency f0 and from it derive:

[6.6]

where the CPI at frequency fn is given by:

[6.7]

The A(fn), B(fn), C(fn), D(fn), E(fn) and F(fn) are the hardware coefficients of the server computed through the least squares fitting of [6.3] and [6.5] measuring power and elapsed time of a suite of small kernels run at all possible frequency f0 to fn (Brochard et al. 2010).

Therefore, with [6.3], [6.6] and [6.7], the elapsed time and power of a code can be computed at frequency fn given its power, time, CPI and GBS at frequency f0 and the hardware coefficients that have been computed and stored once and for all unless the hardware configuration changes.

This model was later modified to better predict the DC power since [6.3] has no information on the actual power measured at f0 and [6.3] was replaced by

[6.8]

where TPI is the number of memory transactions per instruction.

This latter model based on [5.12] and [4.12] was used by IBM LoadLeveler Energy Aware Scheduling (Brochard et al. 2011; IBM Knowledge Center n.d.), which will be described in section 6.3.1. The LRZ data center, which we will present in section 7.2, has been using LoadLeveler Energy Aware Scheduler (EAS) from 2012 until 2018 and measured the accuracy of this model on a large selection of real workloads, demonstrating an average error less than 5% (Auweter et al. 2014).

An extension of this model where power and elapsed time can be predicted from any frequency fi instead of only f0 was introduced and used by Energy Aware Runtime (BSC n.d.).

These models have been developed such that the predicted performance and power can be computed near real time while the application is executing such decisions can be made to change their performance, power and energy either at the job level or at the iteration level (see section 6.3).

6.2.3.2. NN-based models for power prediction

Neural network (NN) methods are a new class of predictive methods that are not using classic statistical methods, but rather NNs that have been trained by ingesting real data gathered from the data center sensors to predict some behavior.

Figure 6.6 presents an example of an NN with xi inputs, hidden layers and one output P.

Image

Figure 6.6. Neural network example

Puzovic et al. (2018) use an NN model to predict the power consumption of workloads on three different servers equipped with Intel Xeon E5 v4, IBM Power8 System S822LC and Cavium ThunderX ARMv8 processors and compare it to the linear regression model based on CPI and GBS as described above. For statistical methods, the selection of the data is critical, but also now their raw number as we will see. In this study, they use the following hardware performance counters as the input of the NN:

  1. 1) total number of instructions (INST) – the number of instructed retired;
  2. 2) cycles (CYC) – number of cycles during the execution time. It should be denoted as INST/CYC = IPC, which we saw has a major impact on the processor power;
  3. 3) dispatched/fetched instructions (IFETCH) – the previous metric (IPC) only accounts for instructions that have been retired, but it does not take into account instructions that have been speculatively executed. These instructions still consume power;
  1. 4) stalls (STALL) – due to multiple issues and out-of-order execution, contemporary processors stall due to dependencies such as data and resource conflicts. The conflicts draw power and are not accounted for by any of the previous counters.
  2. 5) branch hit ratio (BR) – in order to find contribution to power consumed by speculatively executed instructions due to branch misprediction, the percentage of correctly predicted branches during the application execution is measured;
  3. 6) floating point instructions (FLOPS) – for HPC applications, the largest contributor toward power consumption are instructions that are utilizing the floating-point unit as they represent the majority of executed instructions;
  4. 7) L1 cache hit – due to the fact that the previous counters measure only power that is consumed within the processor, the number of hits in local cache (L1) is measured;
  5. 8) last level cache miss – the number of misses in the shared last level cache (LLCM) is also measured to record power from memory.

The above hardware performance counters were extracted from 43 benchmarks to train the model, which comes from the MATLAB Neural Network Toolbox.

The accuracy of the NN model increases with the number of hardware performance counters, the different benchmarks in the training set as well as running multiples copies of the same benchmark. The accuracy of the NN model is also shown to be less than 3% across the three servers, better than the statistical linear regression model based on CPI and GBS.

Another example of an NN model (Gao 2014) predicts the PUE of a data center. This NN utilizes five hidden layers and 50 nodes per hidden layer. The training dataset contains 19 normalized input variables (listed below) and one normalized output variable, the data center PUE:

  • 1) total server IT load [kW];
  • 2) total Campus Core Network Room (CCNR) IT load [kW];
  • 3) total number of process water pumps (PWP) running;
  • 4) mean PWP variable frequency drive (VFD) speed [%];
  • 5) total number of condenser water pumps (CWP) running;
  • 6) mean CWP variable frequency drive (VFD) speed [%];
  • 7) total number of cooling towers running;
  • 8) mean cooling tower leaving water temperature (LWT) set point [F];
  • 9) total number of chillers running;
  • 10) total number of dry coolers running;
  • 11) total number of chilled water injection pumps running;
  • 12) mean chilled water injection pump set point temperature [F];
  • 13) mean heat exchanger approach temperature [F];
  • 14) outside air wet bulb (WB) temperature [F];
  • 15) outside air-dry bulb (DB) temperature [F];
  • 16) outside air enthalpy [kJ/kg];
  • 17) outside air relative humidity (RH) [%];
  • 18) outdoor wind speed [mph];
  • 19) outdoor wind direction [deg].

Between predicted and actual PUE measured on a major data center, the model achieved a mean absolute error of 0.004 and standard deviation of 0.005 on the test dataset or 0.4% error for a PUE of 1.1, which is quite amazing and better than what was achieved by descriptive models. Note that the model error generally increases for PUE values greater than 1.14 due to the scarcity of training data corresponding to those values. The model accuracy for those PUE ranges is expected to increase over time as Google collects additional data on its DC operations. After this calibration phase, the model was used for identifying optimization opportunities.

6.3. Software to optimize power and energy of servers

One way to manage and optimize the power and energy of servers is at the job scheduler level when jobs are submitted. A more dynamic approach is to do it at run time when workloads are executing and doing changes on the fly. The system power is obtained by aggregating bottom up the information of all jobs running on the system. Another approach is to start from the system power or energy budget and proceed by splitting this total power budget to each node in a top down approach. This section will give examples of these different approaches.

6.3.1. LoadLeveler job scheduler with energy aware feature

The EAS feature of the LoadLeveler job scheduler and resource manager (IBM Knowledge Center n.d.) was introduced in 2012.

The goal of LL EAS is to determine and apply at the job level the optimal frequency to all cores and nodes running the job to match an energy policy selected. This frequency is set by setting the appropriate P-state using dynamic voltage and frequency setting (DVFS).

This frequency is determined and applied before the job is submitted through the use of an Energy Tag, which is added to the LL submit command. If this Energy Tag is new, meaning this job is submitted for the first time, LL EAS collects all the metrics (CPI, TPI and GBS) described in section 6.3.1 and compute the power and elapsed time of the job if it was run at any possible frequency based on the hardware coefficients computed in the learning phase as described above. When the job is submitted another time with the same energy tag and a selected energy policy, EAS determine the optimal frequency to match the policy. The energy policies proposed are minimize time to solution and minimize energy to solution.

With minimize time to solution, the goal is to accelerate the jobs that make efficient use of a higher frequency. For this policy, a default frequency fd, lower than nominal frequency f0, has to be defined by the system administrator such that all jobs by default run at the frequency fd if unless their predicted performance variation is higher than a threshold, also defined by the system administrator. With such a policy, memory bound jobs will run at default frequency and their performance will not be hurt, while only application whose performance will benefit from a higher frequency will run at a higher frequency. It is this minimize time to solution policy, which has been used on SuperMUC in production and whose efficiency is reported in Auweter et al. (2014).

Minimize energy to solution is the opposite of minimize time to solution. With minimize energy to solution, jobs by default run at a nominal frequency and their frequency is decreased from f0 to fn if it would reduce the energy of the job without hurting its performance more than a given threshold, defined by the system administrator.

On top of optimizing the power of active nodes, LoadLeveler was also minimizing the power of idle nodes by implementing the S3 state (Figure 4.11) on the nodes, which were idle and had no work waiting to be executed in the batch queue.

LL EAS was first released at LRZ in 2012 with the SuperMUC Phase 1 system (see section 7.2).

6.3.2. Energy Aware Runtime (EAR)

EAR (2018) is an energy management run time framework with different components. We describe here the EAR library and EAR Global Manager (EARGM). The EAR library uses similar power and performance models and energy policies as LL EAS, which we described above. However, they differ widely since EAR is open source and dynamic. To achieve dynamicity and transparency, EAR detects automatically at run time the iterative structures of a code and controls its frequency at the outer loop level. With such an approach, no code modification or energy tag is required and the frequency applied can change during the execution of the code if the performance and powers profiles (called the application signature) have changed over the iteration space. The EAR library targets applications written with the Message Passing Interface (MPI) or hybrid MPI + OpenMP applications. To detect the iterative structures of a code without modifying them, EAR relies on the Profiling MPI (PMPI) interface plus the LD_PRELOAD mechanism to be automatically loaded with MPI jobs. Figure 6.7 shows EAR library software stack.

Image

Figure 6.7. EAR software stack

Once MPI calls are intercepted, EAR passes the sequence of MPI calls together with their arguments to Dynamic Application Iterative Structure detection (DynAIS) to identify repetitive regions in parallel applications (outer loops). DynAIS is an innovative multilevel algorithm with very low overhead. DynAIS receives as input a sequence of events and reports different values that represents the role of the event in the outer repetitive sequence detected. By using DynAIS, EAR can detect MPI calls corresponding to the beginning of a new loop or MPI calls corresponding with a new iteration of an already detected loop.

Since EAR is aware of application loop iterations, it is able to selfevaluate policy decisions, one of the key differences between EAR and other solutions. Because of DynAIS, EAR computes the Application Signature at runtime corresponding to one loop iteration. The Application Signature is a very reduced set of metrics that characterize application behavior. The Application Signature together with the hardware characterization (called the System Signature) is the input for the power and performance models used by EAR.

Given the iterative behavior of many scientific applications, the Application Signature of one iteration in the main loop is representative of the application behavior, allowing EAR to do a frequency selection at runtime, avoiding the necessity of using historic information guided by user hints. EAR proposes a totally distributed frequency selection design avoiding interferences and additional noise in the network or the file system.

The Application Signature is a set of metrics that uniquely identifies and characterizes the application for a given architecture and one particular execution (that could differ depending on many factors such as input data and node list). The Application Signature includes CPI, TPI, time and average DC node power.

The System Signature is a set of coefficients that characterize the hardware with performance and power consumption of each node in the system. It is computed during a learning phase at EAR installation time or every time a configuration change is happening to the nodes. The EAR learning phase is similar to the LL EAS learning phase except that the hardware coefficients of each node are kept and used in a distributed way as explained above for the frequency selection.

Regarding the energy policies, EAR uses other criteria to EAS. For minimize time to solution, EAR uses a performance gain threshold called EAR threshold such that a new frequency is set from fi to fi+1 if:

[6.9]

with:

[6.10] Image
[6.11]

As well as controlling the frequency of the outer loop level in an MPI application, EAR uses EARGM to control the total power consumption and energy of the system such that if a power or energy capping value has been defined and is exceeded, EARGM will ask the EAR library to be more restrictive on the energy policies. For example, with minimize time to solution, it will decrease the default and/or maximum frequency by 1 bin or increase the performance gain threshold.

EAR overhead, predictions and decisions accuracy have been analyzed by nine application tests running on a cluster of up to 1040 cores (Corbalan and Brochard 2019) and installed on SuperMUC-NG at LRZ (see section 7.2) to optimize its energy.

6.3.3. Other run time systems to manage power

Many other tools have been developed for power management using DVFS.

The main goal of Adagio (Rountree et al. 2009) is the power balancing across nodes in order to stay within a given power capping. Adagio is a runtime library targeted to save as much energy as possible while minimizing the performance penalty. Adagio makes its frequency decisions at runtime and does not need user intervention. Conductor (Marathe et al. 2015) is a run-time system that intelligently distributes available power to nodes and cores to improve performance. It is also based on detecting critical paths and using more power in these parts, however, Conductor requires the user to mark the end of the iterative time step. Conductor targets MPI and OpenMP applications and it can be envisioned as an extension of Adagio since it exploits similar concepts in a different context. Conductor exploits the thread level by doing a reconfiguration of the concurrency level and later redistributing the power per job to speed up the critical path. After this power reallocation, Conductor checks its impact by doing global synchronization for a few time steps. While global synchronizations can have a major overhead on the system performance, as applications are expected to run many time step, this overhead is negligible.

GEOPM (Eastep et al. 2017) is an open source framework for power management. GEOPM implements a hierarchical design to support Exascale systems. GEOPM is an extensible framework where new policies can be added at the node level or application (MPI) level. Some of the policies require application modifications but some others do not have this requirement. Energy control in GEOPM is not included as part of the GEOPM framework. Rather it offers an API to be used by the resource manager.

DVFS has been also used in other works to reduce the power consumed by applications. In the context of MPI applications, DVFS has been used inside the MPI library to reduce the power consumed during communication periods (Lim et al. 2006; Etinski et al. 2009; Venkatesh et al. 2015). The goal of these proposals is to reduce the power consumed inside the MPI library without introducing a significant performance degradation in the application execution.

Power capping has been also targeted at the scheduler and resource manager level, for instance, in SLURM (2018) and PBS (PBS Works n.d.) to control the total power consumption. Although effective, such power capping implementation applies the same policy to all applications without taking into account their performance or energy characteristics.

6.4. Monitoring, controlling and optimizing the data center

6.4.1. Monitoring the data center

Data Center Integrated Monitoring (DCIM) is a wide topic with much software readily available. Meyer et al. (2016) gives a good overview of the different DCIM used in different high-performance computing data centers in Europe.

The following major functions of a DCIM are extracted from this report:

  • – Asset management

Most popular DCIM solutions have the ability to catalogue IT assets such as network equipment and servers, which are installed in rack cabinets in the data center. Also, there are some packages that map all connections between devices and from where a particular device is supplied with electricity. In addition, a DCIM solution provides tools to generate QR codes or barcode labels. Labels can then be scanned by an administrator using a smartphone or tablet.

  • – Capacity management

Most DCIM software available on the market are able to inform the user if there is enough space, available network ports or electric sockets in rack cabinets for new equipment. Furthermore, some DCIM software can detect that equipment is too heavy for a particular rack cabinet, or that the power supply or the cooling system has reached its limits and new equipment cannot be added.

  • – Real-time monitoring and alerting

Complex DCIM applications gather data and can inform the administrators if equipment parameters exceed thresholds so that they can react to the situation immediately.

  • – Trends and reporting

DCIM systems that monitor equipment parameters in real time are often able to save that data to visualize changes of parameters in the form of graphs or reports.

  • – Data center floor visualization

In most cases, DCIM systems provide 2D or 3D visualization of server rooms. Some solutions are even able to calculate heat dissipation and display it as a thermal-vision view of the server room.

  • – Simulations

Most complex DCIM solutions have functions to calculate “what if” scenarios, such as a power failure, a cooling system malfunction or the deployment of new equipment, to better understand consequences of such events and take appropriate actions.

  • – Workflow

Some DCIM software implements ticket systems to help automate and document change requests for assets in a data center. Thus, only qualified personnel are responsible for performing changes to IT assets.

  • – Remote control

Parts of DCIM solutions allow storing login credentials in an encrypted database to help administrators cope with a huge number of IT assets that require remote control.

  • – Application programming interface

Some DCIM applications are able to communicate with third party software by providing an API.

  • – Mobile and web-based interface

More and more often developers add a possibility to manage a data center through mobile or web-based applications for ease-of-use from anywhere.

6.4.2. Integration of the data center infrastructure with the IT devices

Dynamic optimization of the data center operations based on the data gathered by the DCIM and the data gathered from the IT devices (power and energy consumption of servers, storage and network switches) is the challenge ahead.

Google’s PUE Neural Network model (section 6.2.3.2) is a good example of a tool to optimize the data center energy efficiency based on data collected by the infrastructure and building sensors. However, it still lacks the integration of the data collected from the IT devices (servers, storage and network switches).

LRZ started such an attempt with a toolset called Power Data Aggregation Monitor (PowerDAM) (Shoukourian et al. 2013), which collects and evaluates data from all aspects of the HPC data center (e.g. environmental information, site infrastructure, information technology systems, resource management systems and applications). The aim of PowerDAM was not to improve the HPC data center’s energy efficiency, but to collect energy relevant data in a single place for analysis, typically by cloning them from already existing databases. Without this, energy efficiency improvements are non-trivial as data have typically to be pulled from multiple databases with different access restrictions, protocols, formats and sampling intervals. As such, PowerDAM was a first step toward a truly unified energy efficiency evaluation toolset needed for improving the overall energy efficiency of HPC data centers.

Recently, LRZ started development of a fully integrated and scalable monitoring solution called Data Center Data Base (DCDB n.d.). The goal of DCDB is to collect all sensor data that are relevant for HPC operations in a single database from the beginning instead of cloning it from other databases. For this purpose, DCDB provides a rich set of data collection plugins to talk to a multitude of devices via their native protocol (such as IPMI and SNMP for IT devices, Bacnet and Modbus for building infrastructure, or JSON/XML for Internet of Things (IoT) devices like smart meters, sensors, thermostats). Further stressing the comprehensiveness of DCDB’s approach, it can also collect information relevant for application performance such as CPU performance counters, parallel file system performance metrics and interconnect transfer rates and error counters. The main philosophy of DCDB is to collect as much data as possible at the highest sampling rate possible, without aggregating data during collection. To facilitate such extensive data collection, DCDB employs Apache Cassandra as a distributed non relational (NoSQL) database backend that provides the necessary scalability to ingest hundreds of thousands of sensors with subsecond sampling periods. The collected data can then easily be analyzed in a Grafana-based dashboard that is accessible from any web browser.

  1. 1. Available at: https://www.gnu.org/software/freeipmi/.
  2. 2 Available at: https://www.dmtf.org/dsp/DSP-IS0005 [Accessed April 30, 2019].
  3. 3 Available at: https://software.intel.com/en-us/vtune [Accessed April 29, 2019].
  4. 4 Available at: https://software.intel.com/en-us/advisor [Accessed April 29, 2019].
  5. 5 ARM Cross-platform tools. Available at: https://developer.arm.com/tools-and-software/server-and-hpc/cross-platform-tools [Accessed April 29, 2019].
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset