Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

5
Power and Performance of Workloads

5.1. Power and performance of workloads

5.1.1. SKU power and performance variations

The power consumption in a server depends very much on its configuration and the workloads it is executing. As our goal is also to compare the impact of different cooling technologies on power, thermal characteristics and performance, we want to reduce to a minimum the impact of other “variables” such as the power, thermal and performance variations across the same processor chips.

The performance/frequency variations of the same processor SKU have been measured by Aucun et al. (2016). It was also measured on the SuperMUC phase 2 system (see section 7.2.3) on nodes with the same configuration based on Intel Xeon Haswell 2697v3, the same “batch” of boards and with the same cooling/thermal conditions since the system was using water-cooled nodes. Figure 5.1 shows the distribution of node AVX2 frequency (x-axis) measured when running single-node HPL across the nodes with the number of nodes on y-axis.

As we can see, while the AVX2 base frequency of the 2697v3 processor is 2.2 GHz, the distribution of AVX2 measured frequency is a “bell” curve varying between 2.2 GHz and 2.4 GHz, which shows all processors are above Intel’s 2.2GHz minimum specification but some are 8% faster.

**Figure 5.1.** *AVX2 frequency across different 2697v3 processor SKU*

Therefore, we were careful to use the same system with exactly the same board and the same processors, dual in-line memory modules (DIMMs) and hard disk drive (HDD). In order to do so, we took a Lenovo ThinkSystem SD650, which is a water-cooled server, removed the water-cooling equipment and put the server on an air bench where we set a static volumetric flow rate. Therefore, the node did not handle any of its own cooling and had actually no system fans present during this test. This prototype server will be referred to in the following as “SD650 air-cooled”. Then we took the board and the processors, DIMMs and HDD and put them in a water-cooled SD650 server, which we will call “SD650 water-cooled”. This detailed work was conducted in the Lenovo Thermal Lab in Morrisville, NC. One side effect of the “SD650 air-cooled” prototype server where the system is not controlling dynamically its fan speed will be seen on the CPU temperature plots, which show an increasing CPU temperature during the workload execution. In section 5.1.7, we will see a different behavior when analyzing the measurements of this prototype air-cooled SD650 and the air-cooled SD530 product where the cooling algorithm is designed to maintain cooling margin on the CPUs and where the system firmware reacts to changes in system behavior to increase the system fans accordingly. The SD530 system fans are initially at their lower idle state before the exerciser is initiated. That is why we will see an initial peek in the CPU temperature, and even a second one for HPL due to the PL2 phase, before the temperature reduces to its steady-state behavior. Our decision to use such an air-cooled prototype has been driven by the desire to measure only the cooling effect between an air-cooled and a water-cooled server.

In section 5.2, we present the power, thermal and performance behavior of this air-cooled server running four different types of workloads. The measurements on the “water-cooled node” will be presented in section 5.3. The server is equipped with 2 Intel Xeon 6148 processors, 12 8GB dual rank DIMMs @ 2666 MHz and one HDD 1TB 2.5" 7.2K. The Intel Xeon 6148 is a 150 W TDP SKU as presented in Table 4.4.

5.1.2. System parameters

The power consumption of a server when running a workload will depend also on system parameters, which can be defined at boot time or dynamically.

We will discuss here the UEFI, the Turbo mode and the use of governors.

The Unified Extensible Firmware Interface (UEFI) is a specification that defines a software interface between an operating system and platform firmware.

The UEFI operating mode choices that will influence the power consumption and performance of a server are as follows:

1) minimal power mode strives to minimize the absolute power consumption of the system while it is operating. The tradeoff is that performance may be reduced in this mode depending on the application that is running;
2) efficiency-favor power mode maximizes the performance per watt efficiency with a bias toward power savings. It provides the best features for reducing power and increasing performance in applications where maximum bus speeds are not critical;
3) efficiency-favor performance mode optimizes the performance per watt efficiency with a bias toward performance;
4) maximum performance mode will maximize the absolute performance of the system without regard for power. In this mode, power consumption is not taken into consideration. Attributes like fan speed and heat output of the system may increase in addition to power consumption. Efficiency of the system may go down in this mode, but the absolute performance may increase depending on the workload that is running.

Turbo Technology as explained in section 4.1.3 can also influence the performance and power consumption of a workload.

UEFI settings already configure the utilization of TURBO, setting it to ON or OFF. This attribute limits the maximum frequency to be used by the system. On top of the UEFI configuration, the CPU frequency can be dynamically modified by the CPU frequency driver and the CPU frequency policies supported by the loaded driver. The CPU frequency driver can be automatically compiled with the kernel or loaded as a kernel module. CPU frequency drivers implement CPU power schemes called governors (it can happen that a given driver does not support a specific governor).

CPU frequency is statically or dynamically selected based on the governor and considering the system load, in response to ACPI events, or manually by user programs. More than one governor can be installed but only one can be active at a time (and only one CPU frequency driver can be loaded). All the governors are limited by UEFI settings and policy limits, such as the minimum and maximum frequency, that can be modified by the sysadmin using tools such as cpupower.

The list of governors with main power scheme characteristics is as follows:

1) performance: this governor sets the CPU statically to the highest frequency given the max_freq and min_freq limits;
2) powersave: this governor sets the CPU statically to the lowest frequency given the max_freq and min_freq limits;
3) userspace: this governor allows privileged users to set the CPU to a specific static frequency;
4) Ondemand, conservative and schedutil: these governors set the CPU frequency depending on the current system load but differ on the way to compute the load and the degree of frequency variation, being more or less conservative. The CPUfreq governor “ondemand” sets the CPU frequency depending on the current system load. The system load is computed by the scheduler using the update_util_data->func hook in the kernel. When executed, the driver checks the current CPU usage statistics for the last period and based on that the governor selects the CPU frequency. The “conservative” governor uses the same approach to compute the system load but it increases and decreases the CPU frequency progressively, rather than going directly to the maximum. This governor is more targeted to battery powered systems. Finally, the “schedutil” governor differs with the previous ones in the way it is integrated with the Linux kernel scheduler. This governor uses the per-entity load tracking mechanism to estimate the system load. Moreover, dynamic voltage and frequency setting (DVFS) is only applied to completely fair scheduler (CFS) tasks, while RealTime (RT) and Deadline (DL) always run at the highest frequency.

When TURBO is set to ON by UEFI settings, the utilization of TURBO frequencies for non-AVX instructions can be avoided by the sysadmin by setting the maximum frequency to the nominal frequency.

With the “userspace” governor, the sysadmin can explicitly set the CPU frequency by selecting one frequency available in the list of P-states. When TURBO is ON in the UEFI settings, the frequency at P-state 0 is a generic value representing the activation of turbo frequencies.

For instance, in a system with a nominal frequency of 2.4 GHz and TURBO set to ON in UEFI settings, the following commands will set CPU frequency to a fixed 2.4 GHz, not allowing the activation of TURBO. The first command selects the userspace governor and the second one sets the frequency to 2.40 GHz.

The following example allows the activation of TURBO frequencies assuming that “userspace” governor is selected and TURBO is set to ON in UEFI settings.

The “cpupower” command provides information about the CPU frequency driver, governor and list of frequencies. The current limits can be obtained with the “-frequency-info” option.

5.1.3. Workloads used

The four workloads we will be using have been selected to exhibit different power and performance characteristics.

The first, called single instruction, multiple data (SIMD), is a code executing different basic SIMD instructions. We use it to understand the power and performance behavior of different type of SIMD instructions. This test and its results are presented in section 5.2.1.

The second, called High Performance Linpack (HPL), is a well-known benchmark that has been written to exhibit the highest floating-point performance on one or multiples servers (Petitet et al. 2018). This test and its results are presented in sections 5.2.2 and 5.3.

The third, called STREAM, is a well-known benchmark that has been written to exhibit the highest memory bandwidth on a server when data is streamed from memory. This test and its results are presented in section 5.2.3.

The fourth, called Berlin Quantum Chromo Dynamics (BQCD), is a user application using hybrid Monte-Carlo methods to solve the lattice Quantum Chromo Dynamics (QCD) theory of quarks and gluons (Haar et al. 2017). We selected it because QCD methods are frequently used in high-performance computing and because its power and performance behavior vary when changing its domain decomposition. This test and its results are presented in sections 5.2.4 and 5.3.

But before presenting the power, thermal and performance behavior of these workloads, we introduce first some qualitative and quantitative metrics to classify workloads to better explain their behavior, and then we will present the different measurements for these workloads.

5.1.4. CPU-bound and memory-bound workloads

We say an application is CPU bound when it makes intensive use of the processor measured by a CPI < 1 (cycle per instruction) value where CPI is defined by:

[5.1]

We can also deduce CPI from IPC where IPC is the number of instructions per cycle by:

[5.2]

For such a CPU-bound application, the processor being highly utilized means the processor is not waiting for data from memory as they are already in cache or registers.

We say an application is memory bound when CPI > 1, leading to a “more idle” processor waiting for data to be fetched from memory. This memory traffic activity is measured by GBS (gigabytes per second) and defined by:

[5.3]

Section 6.1.2.1 provides the explicit formula to compute CPI, IPC and GBS.

The CPI and GBS values measured for each workload will be presented as part of the performance, power and thermal analysis in sections 5.2 and 5.3.

5.1.5. DC node power versus components power

For each workload, we will be reporting the following metrics measured by IPMI for node power or ptumon commands for CPU and DIMM power, CPU core frequency and temperature (see section 6.1.1):

– DC_NODE_POWER (or simply node power) is the DC power measured at the node level including all components of the server such the processors, the memory DIMMs, the board, the voltage regulator (VR), PCI slots, PCI adapters, disks and so on;
– CPU0 Power is the DC power consumed by CPU0 (in our case, Intel Xeon 6148);
– CPU1 Power is the DC power consumed by CPU1 (in our case, Intel Xeon 6148);
– PCK_POWER is the sum of CPU0 Power and CPU1 Power;
– MEM0 Power is the power consumed by the DIMMs on channel 0 (in our case, six DIMMs);

– MEM1 Power is the power consumed by the DIMMS on channel 1 (in our case, six DIMMs);
– DRAM_POWER is the sum of MEM0 Power + MEM1 Power;
– CPU0 core frequency is the average frequency of all cores of CPU0;
– CPU1 core frequency is the average frequency of all cores of CPU1;
– node frequency is the average frequency of CPU0 and CPU1;
– CPU0 temperature is the highest temperature of all cores on CPU0;
– CPU1 Temperature is the highest temperature of all cores on CPU1.

For each workload, we will present either the instantaneous metric measured every 1 s or the average value over the execution time of the workloads.

Taking the example of the SD650 air-cooled server with two sockets Xeon 6148, 12 DIMMs of 8GB, one 2.5′′ 1TB 7.2K HDD, adding the values measured NODE_POWER, PCK_POWER and DRAM_POWER (Figure 5.3) to the idle power of the HDD (Table 1.5) and the board power (see Table 1.6), we note the sum of all individual components (359 W) is less than the DC node power (390 W) measured in section 5.2.2. This is caused by the efficiency of the DC voltage conversions on the board from 12 V (for DC_POWER) to the specific and lower voltage of each component done by the VRs. In our case, this conversion efficiency is equal to 0.92:

[5.4]

Sometimes node power is also reported by the AC node power or more precisely the AC chassis power. As shown in Figure 6.1, AC chassis power is the sum of DC node power [5.4] plus the fan and PSU power, multiplied by the AC to DC conversion efficiency from the PSU (see section 1.4.9).

5.2. Power, thermal and performance on air-cooled servers with Intel Xeon

In this section, the UEFI “efficiency-favor performance” was selected.

5.2.1. Frequency, power and performance of simple SIMD instructions

As we have seen in Table 4.3, Intel Xeon has different type of SIMD instructions that have been introduced over Intel microarchitecture generations to increase performance. SIMD is a simple program that executes a number of times each of the following instructions shown in Table 5.1.

Table 5.1. List and type of instructions executed by the SIMD test

Instruction name	Instruction type	DP Flops per instruction
SSE2 DP ADD128	SSE2	4
SSE2 DP MUL128	SSE2	4
FMA DP FMADD128	SSE2	8
AVX DP ADD256	AVX2	8
AVX DP MUL256	AVX2	8
FMA DP FMADD256	AVX2	16
AVX-512F DP ADD512	AVX512	16
AVX-512F DP MUL512	AVX512	16
AVX-512F DP FMADD512		32

Table 5.1 presents also the number of DP Flops theoretically produced by each instruction. Figure 5.2 plots the core frequency at which each instruction was executed on the “air-cooled node” with Turbo OFF and ON. As explained in 5.1, Turbo was set to ON in the UEFI settings and the “userspace” governor was selected. Therefore, to run the code with Turbo OFF we used the command “cpupower frequency -set -freq 2,400,000”. To run the code with Turbo ON, we used the command “cpupower frequency -set -freq 2,401,000”.

**Figure 5.2.** *Frequency of each instruction type with Turbo OFF and ON*

Figure 5.2 shows the processor frequency at which each instruction of the SIMD test is executing Turbo OFF and ON. SSE2 instructions run at 2.4 GHz, the frequency we set, and 3.1 GHz when we let the processor run at maximum frequency. This is according to Table 4.4 for non-AVX instructions on the 6148 Xeon processor.

AVX2 instructions run at 2.4 GHz, the frequency we set, and 2.6 GHz when we let the processor run at maximum frequency. This is according to Table 4.4 for AVX2 instructions on the 6148 Xeon processor.

AVX-512 instructions run at 2.2 GHz with both settings. Although we set the frequency to 2.4 GHz in the first case and 2.401 GHz in the second case, as 2.4 GHz is higher than the max Turbo frequency for AVX-512 instructions of 2.2 GHz (see Table 4.4), the processors run effectively at 2.2 GHz and not 2.4 GHz or higher.

Tables 5.2 and 5.3 present the performance measured in GFlops and the performance per watt measured in GFlops per watt of these different instructions measured by the SIMD program on the 6148 Intel Xeon processor.

From a performance perspective, we see nearly a doubling of performance from SSE2 to AVX2 and to AVX-512 and we see also nearly a doubling of performance from DP Add to DP FMA, while DP Mult. is a bit lower than DP Add. We note also that Turbo is improving the performance of all instructions except AVX-512 as we have seen already in Figure 5.1.

Table 5.2. DP GFlops at 2.4 GHz and 2.401 GHz

Xeon 6148; 2.4 GHz	Instruction set	DP Add	DP Mult.	DP FMA
GFlops	SSE2	382	305	763
GFlops	AVX2	762	763	1525
GFlops	AVX-512	1396	1400	2791

Xeon 6148; 2.401 GHz	Instruction set	DP Add	DP Mult.	DP FMA
GFlops	SSE2	492	407	984
GFlops	AVX2	828	828	1652
GFlops	AVX-512	1399	1397	2797

Table 5.3. DP GFlops per watt at 2.4 GHz and 2.401 GHz

Xeon 6148; 2.4 GHz	Instruction Set	DP Add	DP Multiply	DP FMA
DP GFlops/W	SSE2	1.54	1.33	3.01
DP GFlops/W	AVX2	2.86	2.93	5.67
DP GFlops/W	AVX-512	5.28	5.30	10.22

Xeon 6148; 2.401 GHz	Instruction Set	DP Add	DP Multiply	DP FMA
DP GFlops/W	SSE2	1.53	1.37	3.00
DP GFlops/W	AVX2	2.90	2.93	5.65
DP GFlops/W	AVX-512	5.25	5.28	10.22

From a performance per watt perspective, we note that the performance per watt is nearly doubling from DP Add to DP FMA but not from SSE2 to AVX2 and even lower from AVX2 to AVX-512, pointing to the higher power consumption of these more complex AVX2 and AVX-512 instructions. Overall, comparing GFlops per watt of AVX-512 FMA to SSE Add, we note a 6.6× improvement, which is quite significant. We note also that Turbo is not providing a clear benefit versus nominal since the increased frequency is compensated by an increased power consumption. This conclusion is valid for these simple SIMD instructions and will be reevaluated for the other workloads we will study in the following section.

Figures 5.3 and 5.4 plot the average DC_NODE_POWER, PCK_POWER and DRAM_POWER power consumed by each of the nine instructions when executed on the “air-cooled node” with Turbo OFF and ON.

**Figure 5.3.** *Node, CPU and DIMM DC power of SIMD instructions Turbo OFF. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

**Figure 5.4.** *Node, CPU and DIMM DC power of SIMD instructions with Turbo ON*

In Figure 5.3, we note the average DC_NODE_POWER varies from 228 W (SSE2 ADD) to 273 W (AVX-512 FMA) and PCK_POWER varies from 178 W (SSE2 ADD) to 220 W (AVX-512 FMA). DRAM_POWER is flat at 13 W across all instructions leading to about 1.1 W per 8GB DIMM (1 DPC). According to Table 1.3, this power per DIMM is pretty close to the idle power (0.9 W), which is because all those instructions have a pretty low memory bandwidth rate (GBS) as data sets are accessed sequentially and mostly from cache. The power consumed by one CPU is also moderate (from 89 to 110 W for a 150 W TDP). We note also the AVX and AVX-512 instructions lead to the highest CPU power consumption (from 106 to 110 W), while in Figure 5.2 we note that AVX-512 instructions run only at 2.2 GHz vs 2.4 GHz for the other instructions. For this reason, we can say that AVX-512 instructions are power hungry instructions.

In Figure 5.4, with Turbo ON, we see the DRAM_POWER is unchanged while the DC_POWER has increased significantly, since all instructions have been executed at a higher frequency except AVX-512. For example, PCK_POWER for SSE ADD instructions increased from 195 to 263 W (35%), while core frequency increased from 2.4 to 3.1 GHz (29%), showing that power increases more than frequency. We note also that for the FMA DP instruction which run at 3.1 GHz, the average power consumed per socket is 134 W, getting close to the 150 W TDP.

5.2.2. Power, thermal and performance behavior of HPL

HPL is a well-known benchmark based on a dense linear algebra solver called LINPACK, which has been used for many years to rank the world fastest servers and systems and published twice a year by Top500¹. Because of its simple structure, HPL has been tuned to produce the highest double precision floating point sustained performance. The version we will use is HPL 2.2 (Petitet et al. 2018), which on Intel Xeon calls the MKL library (Gennady et al. 2018) and where DGEMM operations are using AVX-512 instruction for most of HPL execution time.

5.2.2.1. Power consumption

Figures 5.5 and 5.6 show the DC node power, the CPU power of each socket (CPU0 and CPU1) and the DIMM power attached to each memory channel (MEM0 and MEM1) with Turbo OFF and ON when running HPL.

**Figure 5.5.** *Node, CPU and DIMM DC power running HPL Turbo OFF. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

**Figure 5.6.** *Node, CPU and DIMM DC power running HPL Turbo ON. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

5.2.2.2. The different phases of HPL and the Turbo impact

At first glance, plots in Figures 5.5 and 5.6 are very similar with four different phases in both Turbo OFF or ON runs. The first and fourth phases are the start-up and ending phases, which execute non-AVX instruction. The second and third phases make the plateau phase, which executes AVX-512 instructions that were introduced with the Skylake architecture. This fourphase behavior is due to the fact that HPL spends the major part of its execution running a matrix–matrix multiplication called DGEMM (LAPACK n.d.), which every vendor has carefully tuned, and which Intel has introduced as a highly tuned AVX-512 version in the MKL library for Xeon Skylake (Gennady and Shaojuan 2018).

Therefore, phases 1 and 4 execute non-AVX instructions before and after HPL executes the AVX-512 executions in phases 2 and 3. Phase 2 is a short phase where node and CPU power are higher than during phase 3. DC node power and the PCK_POWER are 390 W and 149 W during phase 3, while they are 440 W and 169 W during phase 2. As phase 3 is executing 100% of AVX-512 instructions, the processor runs very close to TDP, which is 150 W for the 6148 SKU. Phase 2 behavior is due to another feature Intel introduced with SandyBridge processors as part of Intel Turbo Boost Technology 2.0 and the two RAPL (running average power limit) power limits (Rotem et al. 2011). PL1 is the long-term power limit and PL2 is the short-term power limit. By default, PL1 is set to TDP and the time constant is infinite (i.e. the CPU can stay there forever). PL1 is what we observe during phase 3. The default for PL2 is 1.2xTDP with a time constant of ~10s, PL2 being typically limited by the thermal conditions of the server. PL2 is what we observe during phase 2 and we note PL2 is only 1.13xTDP due to the thermal conditions of the server.

Since during phases 2 and 3 HPL is executing AVX-512 instructions, which run at maximum possible frequency whether Turbo is OFF or ON (see section 5.1.2), this explains why HPL power curves are identical with Turbo OFF or ON during phases 2 and 3. This is not the case during the first and fourth phases, which execute non-AVX instructions where CPU power is much higher with Turbo ON versus Turbo OFF. This is due to the fact that the frequency of non-AVX instructions gets a significant boost with Turbos as we seen for SIMD and as shown in Table 4.4.

During the PL1 plateau phase, HPL has a much higher node power consumption than the SIMD test. Power per CPU socket is about constant and close to TDP at 150 W for HPL, while varying from 106 to 134 W for SIMD. Similarly, memory power for HPL is more than twice the SIMD memory power (16.5 W for MEM0, 15.5 W for MEM1 leading to a total of 32 W for HPL vs. 13 W for SIMD). This difference will be explained when we analyze CPI and GBS in Figures 5.9 and 5.10.

5.2.2.3. Frequency and temperature

Figures 5.7 and 5.8 present the frequency and temperature of each CPU while running HPL with Turbo OFF and ON.

**Figure 5.7.** *CPU frequency and temperature running HPL with Turbo OFF. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

**Figure 5.8.** *CPU frequency and temperature running HPL with Turbo ON. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

As we have seen for the power, the only difference between Turbo OFF and ON occurs during the first and fourth phases, which execute non-AVX instructions. The CPU0 and CPU1 frequencies (gray and blue curves) are 2.4 GHz for the first and fourth phases with Turbo OFF and 3.1 GHz with Turbo ON, which match perfectly the non-AVX base and max Turbo frequencies for the 6148 in Table 4.4.

With Turbo OFF, during the PL1 plateau phase, average CPU frequency is 1.91 GHz while CPU 0 frequency is 1.85 GHz and CPU 1 is 1.98 GHz. Average CPU frequency is somewhere between the base and max Turbo frequencies of 6148 (respectively, 1.6 GHz and 2.2 GHz) as expected. The frequency difference between CPU 0 and CPU1 will be discussed in section 5.1.7.

With Turbo OFF, during the PL2 plateau phase, average CPU frequency is 2.1 GHz while CPU 0 frequency is 2.0 GHz and CPU 1 is 2.18 GHz. This higher CPU frequency is normal for the PL2 phase and still within the limit of the max Turbo AVX-512 frequency (2.2 GHz).

The yellow and orange curves represent the CPU0 and CPU1 temperatures. During the PL2 and PL1 plateau phases that execute AVX-512 instruction, CPU0 has a higher temperature (88°C max) than CPU1 (81°C max). Both are well below the max junction temperature of 95°C for the 6148 (Xeon Scalable Thermal Guide, 2018, Table 5.1). The temperature difference between CPU0 and CPU1 will be discussed in section 5.1.7.

5.2.2.4. CPI and GBS

Figures 5.9 and 5.10 present HPL CPI and GBS during execution time with Turbo OFF and Turbo ON. During PL1 phase, Turbo OFF (and ON) average CPI is 0.4 and average GBS is 68.

It shows clearly that HPL is highly CPU bound, making intensive use of AVX-512 instruction and leading to maximum power consumed by two sockets (around 150 W each), while DIMM power is about 16 W each. We note that memory bandwidth is quite high, while CPI is low. This is due to the BLAS3 dense matrix operations where data are blocked into cache with data being prefetched from memory without stalling the processor. This behavior is quite normal for AVX-512 instructions, which cannot execute well without data in cache and well prefetched data from memory.

**Figure 5.9.** *CPU0 and CPU1 CPI and node bandwidth running HPL with Turbo OFF. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

**Figure 5.10.** *CPU0&1 CPI and node bandwidth running HPL with Turbo ON. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

5.2.3. Power, thermal and performance behavior of STREAM

STREAM (McAlpin 2017) is a highly tuned kernel to measure the highest possible sustained memory bandwidth of a server. To exhibit the highest memory bandwidth on Xeon 6148, the STREAM 5.10 test has been compiled with AVX-512 instructions.

5.2.3.1. Power consumption

Figures 5.11 and 5.12 present the DC node power, the CPU and memory DC power while running STREAM with Turbo OFF and ON.

Both plots are very similar since STREAM is compiled with AVX-512 instructions. The frequency plots (Figures 5.13 and 5.14) will provide more detail on this topic.

During the plateau phase, node power is 330 W for both Turbo OFF and ON, while CPU0 power is 116 W and CPU1 power is 103 W. Both CPU power values are much less than HPL (150 W) as STREAM is not as CPU intensive, which is confirmed by the higher CPI value of 9.6 (see Figures 5.15 and 5.16).

MEM0 and MEM1 power is 21.5 W leading to 43 W for the whole memory, which is higher than HPL memory power (32 W). This is demonstrated by the highest GBS value of 140 GBS (see Figures 5.15 and 5.16) versus 68 for HPL. This corresponds to 3.6 W per DIMM, which is pretty close to the maximum value of 4.5 W in Table 1.3. The power differences between CPU1 and CPU0 will be discussed in section 5.2.5.

**Figure 5.11.** *Node, CPU and DIMM power running STREAM with Turbo OFF. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

**Figure 5.12.** *Node, CPU and DIMM power running STREAM with Turbo ON. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

5.2.3.2. Frequency and temperature

Figures 5.13 and 5.14 present the frequency and temperature of both CPUs while running STREAM with Turbo OFF and ON. During the plateau phase, both CPUs have a frequency of ~2.4 GHz Turbo OFF and ON. A constant frequency Turbo OFF and ON indicates the code is not running non-AVX instructions, which is expected as STREAM is complied with AVX-512 option. But this frequency is not in the range of AVX-512 frequencies for the 6148 as shown in Table 4.4. This could indicate that STREAM is executing AVX-2 instructions and not AVX-512. A detailed analysis of the code showed that instructions have 0% of AVX-512 instructions and 67% of AVX-2 instructions.

**Figure 5.13.** *CPU temperatures and frequencies running STREAM with Turbo OFF. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

**Figure 5.14.** *CPU temperatures and frequencies running STREAM with Turbo ON. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

5.2.3.3. CPI and GBS

Figures 5.15 and 5.16 present CPI of both CPUs and node bandwidth (GBS) when running STREAM with Turbo OFF and ON.

**Figure 5.15.** *CPU CPIs and node bandwidth running STREAM with Turbo OFF. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

**Figure 5.16.** *CPU CPIs and node bandwidth running STREAM with Turbo ON. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

During the plateau phase, GBS is around 140 GBS while CPI for CPU0 is 9.8 and 9.4 for CPU 1. As discussed earlier, the high GBS value causes the high memory power of 43 W compared to 32 W for HPL.

5.2.4. Power, thermal and performance behavior of real workloads

HPL is a highly tuned kernel to demonstrate the highest possible sustained performance with an extremely low CPI of 0.4, since most of its execution time is spent in the DGEMM routine from Intel MKL library, which makes intensive use of AVX-512 instructions (Gennady et al. 2018). STREAM is another extreme benchmark, which has been created to measure the highest possible memory bandwidth on a server with an extremely high GBS value of 140.

Workloads used by scientists and engineers in their daily job have different characteristics. They have neither extremely low CPI (with some not even using AVX-512 instructions) like HPL nor a very high GBS like STREAM.

To highlight how real workloads behave, we chose an application that is well used by the HPC community and can exhibit a CPU-bound and memory-bound behavior with the same code pattern. BQCD (Haar et al. 2017) is a hybrid Monte-Carlo program for simulating lattice QCD with dynamical Wilson fermions. It has a distributed memory version, implemented with MPI and the same code running on a single server with different processor grid decompositions exhibiting either a CPU-bound behavior, when the data set computed per core is small and fits reasonably into the processor’s cache, or a memory bound behavior when the data set computed per core is large enough such that it does not fit into the processor’s cache. BQCD has been compiled with no AVX option so that it executes only SSE instructions. In the following, we present the same measurements as for HPL and STREAM except that we will use two BQCD use cases: BQCD128, the memory bound test case, has been compiled with the following parameters LATTICE = 48 6 24 48 - PROC = 1 1 2 2. BQCD1K, the CPU-bound test case, has been compiled with parameters LATTICE = 48 6 12 12 - PROC = 1 1 2 2.

5.2.4.1. Power consumption

Figures 5.17 and 5.18 show the DC node, CPU0/1 and MEM0/1 power of the air-cooled SD650 prototype while running the two BQCD test cases with Turbo OFF.

**Figure 5.17.** *Node, CPU and DIMM power running BQCD1K with Turbo OFF. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

**Figure 5.18.** *Node, CPU and DIMM power running BQCD128 with Turbo OFF. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

During the plateau phase of BQCD1K and BQCD128 with Turbo OFF, average DC node power is, respectively, 320 W and 344 W, CPU0 and CPU1 node power is 136 W and 121 W, and 133 W and 118 W. MEM0 and MEM1 power is, respectively, 11 W and 10 W, and 18 W and 18 W. The power difference between CPU 0 and CPU 1 will be addressed in section 5.1.7. The slightly higher CPU power of BQCD1K versus BQCD128 can be explained by its lower CPI (0.64 vs. 1.24), and its lower memory power by a lower GBS (27 vs. 73), as presented in section 5.1.6.3.

Figures 5.19 and 5.20 show the DC node, CPU0/CPU1 and MEM0/MEM1 power of the air-cooled SD650 prototype while running the two BQCD test cases with Turbo ON.

With Turbo ON, we note average CPU0 power for both test cases is 149 W very close to the TDP limit of 150 W with peaks slightly over 150 W. This is according to Turbo ON definition, which run instructions at maximum frequency until it reach TDP or the thermal limit of the processors. In this case, the limiting factor is CPU power since CPU0 temperature is still below the max junction temperature of the 6148, which is 95 °C as we will see in the next section on frequency and temperature. Average CPU1 power for the two test cases is 133 W leading to a difference of about 16 W between CPU0 and CPU1 as already noted. This will be addressed in section 5.2.5.

**Figure 5.19.** *Node, CPU and DIMM power running BQCD1K with Turbo ON. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

**Figure 5.20.** *Node, CPU and DIMM power running BQCD128 with Turbo ON. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

5.2.4.2. Frequency and temperature

Figures 5.21 and 5.22 present the frequency and temperature for BQCD1K and BQCD128 with Turbo OFF and Figures 5.23 and 5.24 present the frequency and temperature for BQCD1K and BQCD128 with Turbo ON.

**Figure 5.21.** *CPU temperatures and frequencies running BQCD1K with Turbo OFF. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

**Figure 5.22.** *CPU temperatures and frequencies running BQCD128 with Turbo OFF. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

**Figure 5.23.** *CPU temperatures and frequencies running BQCD1K with Turbo ON. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

**Figure 5.24.** *CPU temperatures and frequencies running BQCD128 with Turbo ON. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

In Figures 5.21 and 5.22 for BQCD1K and BQCD128 with Turbo OFF, core frequency of both CPUs is 2.4 GHz for both test cases, which is, according to Table 4.4, the 6148 base non-AVX frequency, since the code has been compiled with no AVX. In Figures 5.23 and 5.24 with Turbo ON, core frequency of both CPUs is around 2.7 GHz, which is a significant boost but still below the max possible frequency of 3.1 GHz. This is due to the fact that CPU0 is reaching the TDP limit of 150 W.

We note also that with BQCD 1K CPU0 core frequency is slightly higher than CPU1 frequency (2.7 GHz vs. 2.6 GHz) while with BQCD 128 both CPUs have the same frequency (2.8 GHz). These differences will be addressed in section 5.2.5.

5.2.4.3. CPI and GBS

Figures 5.25 and 5.26 present CPI and GBS for the two BQCD test cases with Turbo OFF.

**Figure 5.25.** *CPU CPIs and node bandwidth running BQCD1K with Turbo OFF. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

**Figure 5.26.** *CPU CPIs and node bandwidth running BQCD128 with Turbo OFF. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

During the plateau phase, BQCD1K and BQCD128 Turbo OFF have, respectively, an average node bandwidth of 27 and 73 GBS, while CPU0/CPU1 CPI are, respectively, 0.63 and 0.64, and 1.19 and 1.29. BQCD1K shows stable values around 0.6 for CPI and 27 for GBS, while BQCD128 shows high oscillations with CPI and GBS values, respectively, between 0.4–2.0 and 47–97. This behavior is due to the fact that BQCD is using a conjugate-gradient iterative solver with some routines working on local data, which are not impacted by the domain decomposition and have always low CPI and GBS value, and other routines working on data in memory that are impacted by the domain decomposition and which will have high CPI and GBS values for BQCD128 and low CPI value/medium GBS value for BQCD1K. This explains the BQCD128 CPI and GBS oscillations and BQCD1K stability. This can be also expressed by saying that BQCD1K has a stable CPU-bound behavior with low CPI (0.6) and medium node bandwidth (27 GBS), while BQCD128 oscillates between CPU-bound and memory-bound routines leading to average CPI value of 1.2 and an average high node bandwidth of 73 GBS, which is why we can say that BQCD128 is a memory bound use case.

Figures 5.27 and 5.28 present CPI and GBS for the two BQCD test cases with Turbo ON.

**Figure 5.27.** *CPU CPIs and node bandwidth running BQCD1K with Turbo ON. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

**Figure 5.28.** *CPU CPIs and node bandwidth running BQCD128 with Turbo ON. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

We note CPI and GBS have similar behavior and values with Turbo ON or OFF.

5.2.5. Power, thermal and frequency differences between CPUs

In the measurements we reported above, CPU0 and CPU1 displayed different temperatures, frequencies or powers.

Although we were careful to make these measurements under the same conditions, there are two differences we could not eliminate, which are the cooling and SKU differences.

By cooling differences, we mean the impact of the air flow on a shadow processor configuration (Figure 2.9), like on the air-cooled SD650 (Figure 3.14), where when air is flowing first to CPU1 and then to CPU0 such the inlet air temperature on CPU0 is about 5 to 7°C hotter than on CPU1. In our case that could explain why CPU1 temperature is higher than CPU0. To verify this assumption, we made the same set of measurements as shown above with the same configuration (SKU, DIMM, HDD) but with another air-cooled shadow configuration server, the Lenovo SD530 (Figure 2.8) where the air is flowing first to CPU0 and then CPU1. We labeled these measurements as “SD530”.

By SKU differences, we mean the performance variations from SKU to SKU as shown in section 5.1, and Figures 5.1 and 7.11. In our case, it could explain why one CPU power is higher than the other CPU. To verify this assumption, we made the same set of measurements as shown above but swapping the CPUs on the SD650 air-cooled system. We labeled these measurements as “SD650 swapped”.

Figures 5.29–5.31 present six plots for BQCD1K Turbo OFF (Figure 5.29), BQCD1K Turbo ON (Figure 5.30) and HPL Turbo ON (Figure 5.31), where the first three plots present the CPU temperature and frequency and the last three plots present the node, CPU and DIMM power consumptions on the three platforms “SD530”, “SD650 air-cooled” and “SD650 swapped” to highlight the cooling and SKU differences across these three platforms.

For running BQCD1K Turbo OFF in Figure 5.29, core frequency is set at nominal fixed 2.4 GHz for all three servers. On “SD530”, we note that CPU0 temp > CPU1 temp and CPU0 power > CPU1 power while on ‘SD650 air-cooled” we have CPU1 temp > CPU0 temp and CPU0 power > CPU1 power and on “SD650 swapped” we have CPU1 temp > CPU0 temp and CPU1 power > CPU0 power. Knowing the only difference between “SD650 air-cooled” and “SD650 swapped” is the swapping of CPUs, we can conclude that the power consumption difference is caused by a CPU performance variation, as described in section 5.1.1, such that the best CPU consumes less power at the same core frequency. Regarding the CPU temperature difference, knowing that the only difference between “SD530” and “SD650 air-cooled” is the air flowing in opposite direction, we can conclude that the CPU temperature difference is due to the air flow, which is cooling the first processor at a cooler temperature than the second one in a shadow processor design (Figure 2.9).

**Figure 5.29.** *Comparison of BQCD1K with Turbo OFF on three servers. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

**Figure 5.30.** *Comparison of BQCD1K with Turbo ON on three servers. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

Figure 5.30 presents BQCD1K with Turbo ON where core frequency is not locked at nominal frequency and will increase up to thermal or power limits are hit. From Figure 5.29, we understood the best CPU for SD530 and SD650 is in slot 1, whereas the best CPU for “SD650 swapped” is in slot 0. For BQCD1K with Turbo ON, we note that only one CPU is reaching TDP (CPU0 for SD530 and SD650 and CPU1 for “SD650 swapped”), which happens as the CPU in slot 0 is the “worst” SKU on SD650 while the “worst” SKU is in slot 1 for “SD650 swapped”. We note that the CPU temperatures are still below the junction temperature (95°C). Regarding the CPU frequencies and temperature difference, we note that on “SD650 swapped”, CPU0 temperature is about 10°C lower than on SD650 due to the fact the best CPU is in slot 0. Finally, we note that for SD650 and “SD650 swapped”, CPU0 frequency is slightly higher than CPU1, which can be explained by the fact CPU0 has always a lower temperature as it is cooled first on both servers.

Figure 5.31 presents HPL with Turbo ON. In this case, we see that both CPUs are able to reach the TDP limit. Therefore, the best SKU is reaching the highest frequency on all three servers (CPU1 for SD530 and SD650 and CPU0 for “SD650 swapped”). The cooling difference has the same effect as in Figure 5.30 where the SKU, which is cooled first, has a lower temperature than the other SKU.

As explained in section 5.1.1, we note that for all three workloads, the CPU temperature rise curves on “SD650 air-cooled” and “SD650 swapped” are logarithmic because the CPU is initially being over-cooled during the start of the run, and the temperature gradually rises to its steady-state temperature with the given airflow. The temperature curve has a very different shape on SD530, which is a server designed for air-cooling where system firmware is dynamically controlling the fan speed.

**Figure 5.31.** *Comparison of HPL with Turbo ON on three servers. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

5.3. Power, thermal and performance on water-cooled servers with Intel Xeon

In section 5.2, we presented the power, thermal and performance behavior of the prototype version of the air-cooled prototype server ThinkSystem SD650, which we referred as “SD650 air-cooled” when running specific workloads. This section is presenting the impact of cooling on the same SD650 server, which is either air cooled or water cooled while running HPL and BQCD. We will refer to these servers as “SD650 air-cooled” and “SD650 water-cooled”. The inlet water temperature of the “SD650 water-cooled” will be set at 30°C, 45°C and 55°C while the room temperature is constant at 21°C.

We will also present measurements for the “SD650 water-cooled” with two UEFI settings (“efficiency-favor performance” and “maximum performance”), while the “SD650 air-cooled” runs are performed with the “maximum performance” UEFI setting only. Goal of these runs with two UEFI is to determine how water cooling can improve frequency and potentially performance even at the expense of power consumption.

The following sections will present how the CPU temperature, voltage and frequency, power and performance vary depending on the cooling and the inlet water temperature when running HPL and BQCD on “SD650 air-cooled” and “SD650 water-cooled” servers.

5.3.1. Impact on CPU temperature

Tables 5.4–5.6 present the average CPU temperature measured during the PL1 phase when running HPL and the plateau phase when running BQCD1K and BQCD128 with Turbo OFF and ON on the “SD650 air-cooled” and “SD650 water-cooled” servers with inlet water temperature of 30°C, 45°C and 55°C with UEFI “efficiency-favor performance” referred as “eff.” and UEFI “maximum performance” referred as “perf.”. The water flow rate is 0.25 L/min (lpm) per node.

**Table 5.4.** *CPU temperature running HPL*

**Table 5.5.** *CPU temperature running BQCD1K*

**Table 5.6.** *CPU temperature running BQCD128*

We observe that on all measurements the CPU temperatures with BQCD128 and BQCD1K are always lower than with HPL, which is due to the lower CPI of BQCD versus HPL. We also observe that the CPU temperatures are always lower with QBCD with Turbo OFF versus Turbo ON since as BQCD execute non-AVX instructions, which get a significant frequency boost with Turbo ON as we have seen in the previous section.

We note also the CPU temperatures measured with an inlet water temperature of 30°C and 45°C is always lower than the CPU temperatures measured on the air-cooled server. The difference between the inlet water and the CPU temperature is about 12°C for BQCD and 15°C for HPL on the water-cooled server, while the difference between the room temperature and the CPU temperature is about 47°C for BQCD and 55°C for HPL on the air-cooled server.

With an inlet water temperature of 55°C, the CPU temperatures of the air-cooled server and the water-cooled serve are about the same.

5.3.2. Impact on voltage and frequency

Tables 5.7–5.9 present the impact of a lower CPU temperature on the core voltage and frequency of the CPU.

**Table 5.7.** *CPU temperature, voltage and frequency running HPL*

**Table 5.8.** *CPU temperature, voltage and frequency running BQCD1K*

**Table 5.9.** *CPU temperature, voltage and frequency running BQCD128*

In the three tables, we observe the processor voltage is increasing as the processor temperature is decreasing. The processor voltage is set by the VR. VRs are programmed to work in the 70–80°C range. The CPU temperature is in this range with the air-cooled node and with the water-cooled node when the inlet water temperature is 55°C. But as the water temperature gets lower (45°C and 30°C), the CPU temperature gets lower, and the VR increases the voltage to keep the processor stable. As voltage is increasing, in order to keep the dynamic power stability the CPU frequency is not increasing as much as it would be if voltage was constant as we will see below.

In Table 5.7 for all temperatures, we note the frequency is about constant at about 1,900 MHz with the UEFI “efficiency-favor performance”, while it increases to ~2,100 MHz with the UEFI “maximum performance”, just one bin less than the max Turbo frequency for AVX-512 instructions according to Table 4.4, leading to a 10% frequency increase. We also note that the highest frequency is reached with a 55°C inlet water temperature, which corresponds to the processor temperature of the air-cooled node. Similarly, as the inlet water temperature is getting colder (45°C and 30°C), the processor voltage is increasing and the processor frequency is decreasing compared to 55°C.

In Tables 5.8 and 5.9, we note the impact of Turbo on the voltage and frequency due to the non-AVX instructions of BQCD. We note also the same impact of a colder CPU on the processor voltage as we have seen with HPL in Table 5.7. But the frequencies with Turbo ON and UEFI “maximum performance” are constant on the air-cooled and water-cooled servers at 2,670 MHz for BQCD1K and 2,770 MHz for BQCD128. This shows water cooling does not enable frequencies above 2,770 MHz, although, according to Table 4.4, the maximum Turbo frequency with all cores loaded is 3.1 GHz for non-AVX instructions, while it was enabling a frequency increase from 1,900 to 2,100 MHz for HPL and AVX-512 instructions. Could a different VR programming versus the CPU temperature be able to enable frequencies higher than 2,700 MHz and closer to the maximum Turbo frequency for non-AVX instructions is something we were not able to determine.

We also note that Turbo ON BQCD1K is running at a lower frequency than BQCD 128 (~2,670 MHz vs. ~2,780 MHz on the air-cooled node). This is due to the fact that BQCD1K CPI is 0.65 while BQCD12 CPI is 1.37 and therefore BQCD1K will have CPU0 which reach the TDP limit at a lower frequency with BQCD1K than with BQCD128.

5.3.3. Impact on power consumption and performance

Table 5.10 presents the CPU temperature, frequency, voltage and the node power and performance of HPL on the SD650 air-cooled node and water-cooled node at different inlet water temperature, Turbo OFF and ON and with the two different UEFI settings.

**Table 5.10.** *CPU temperature, voltage, frequency, node power and performance running HPL*

We note the node power is increasing significantly with UEFI “maximum performance” (about 13% comparing the air-cooled node Turbo OFF), which is the price to get higher frequency (about 11%) and performance (about 10%). This confirms that HPL is a CPU-bound application since performance is increasing with frequency. Power is increasing more than performance since P_dyn is varying with the square of voltage × frequency [4.2], while performance is only varying with frequency. We can also verify the impact of P_leak on the node power by comparing the power of the SD650 water-cooled node at inlet water with temperature 45°C and 55°C. At 45°C, the node power is the same as at 55°C (440 W), while the voltage is higher (0.774 V vs. 0.756 V) and the frequency is approximately the same (2,125 MHz vs. 2,129 MHz). As P_dyn varies with the square of voltage, P_dyn at 45°C is higher than at 55°C leading to the conclusion P_leak is lower at 45°C that at 55°C, which confirms that P_leak decreases with the CPU temperature.

Tables 5.11 and 5.12 present the CPU temperature, frequency, voltage and the node power and performance of BQCD1K and BQCD128 on the SD650 air-cooled node and SD650 water-cooled node at different inlet water temperature, with Turbo OFF and ON and with the two different UEFI settings.

**Table 5.11.** *CPU temperature, voltage, frequency, node power and performance running BQCD1K*

**Table 5.12.** *CPU temperature, voltage, frequency, node power and performance running BQCD128*

As noted earlier, we see that BQCD128 performance is constant at 60 GFlops, while BQCD1K performance is increasing from 112 to 119 GFlops depending on the UEFI setting, Turbo mode and inlet water temperature. This confirms BQCD128 is memory bound and BQCD1K is CPU bound. Therefore, for memory-bound applications using non-AVX instructions, the optimal energy corresponds to the power minimization. Due to the impact of inlet water temperature on the VRs, the minimum power is reached at 35°C water temperature for both BQCD128 and BQCD1K. Although BQCD1K best performance is reached at 55°C water temperature with Turbo ON and UEFI “maximum performance”, the optimal energy is still reached at 35°C water temperature with Turbo OFF and UEFI “efficiency-favor performance” since the performance improvement is quite small for the reasons we explained earlier.

Let us present now measurements done on the air-cooled and water-cooled NeXtScale node equipped with Xeon Haswell processors 2697v3 running HPL at different inlet water temperature with UEFI “efficiency-favor performance”.

These measurements are done while running single node HPL on 12 different nodes at different inlet water temperature. All 12 nodes have the configuration, Lenovo NeXtScale with 2697v3, 14C, 2.6GHz, 145W processor and the same memory configuration. The base frequency of 2697v3 is 2.6 GHz and its AVX2 base frequency is 2.2 GHz with a possible max Turbo AVX2 frequency of 2.9 GHz when all cores are loaded.

Figure 5.32 presents the effect of various cooling technologies on the 2697v3 processor (Haswell architecture) temperature and the corresponding HPL performance.

**Figure 5.32.** *Cooling impact on 2697v3 temperature and performance. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip*

Y-axis of the graph presents the HPL performance measured on each node. HPL performance varies from 920 GFlops to 1,005 GFlops. Using an HPL efficiency of 92% and the theoretical 16 DP Flops per cycle provided by AVX2 on Xeon (Table 4.3), the sustained AVX2 frequency when running HPL can be estimated at 2.23 GHz (for 920 GFlops) and 2.44 GHz (for 1,005 GFlops). These frequency variations are due to processor temperature variations, which come from several factors: processor quality and cooling as explained below.

Each of the 12 curves represents the HPL performance on one node (n101 ,.., n112) for different junction processor temperature. The blue (left), green (middle) and orange (right) areas represent the processor junction temperature for a server equiped with DWC with 18°C inlet water, DWC with 45°C inlet water and air cooling, respectively.

The performance difference between the 12 curves is due to the performance variations of processors with the same SKU when running with Turbo ON or running HPL as discussed in section 5.1, Figure 5.1 and Aucun et al. (2016).

Looking at a single curve across different cooling zones, for DWC with cold water (blue area, 18°C inlet) and hot water (green area, 45°C inlet) performance (and frequency) remains mostly flat, while HPL performance (and frequency) drops when junction temperature increases due to air cooling leading to an increased processor power leakage.

Table 5.13 presents detailed data about nodes n101 and n102.

Table 5.13. Inlet water temperature impact on processor temperature and HPL

	n101		n102
		Linpack		Linpack
T_water_inlet	Tj	Score	Tj	Score
(°C)	(°C)	(Gflops)	(°C)	(Gflops)
8	28	934.5	27	948.4
18	35	935.2	33	948.3
24	41	934.7	42	948.1
35	54	931.9	52	946.3
45	60	926.7	60	944.7
55	68	921.3	73	938.5
55¹	75	918.9	79	936.1

Typical flow rate per node is 0.5 L/min (lpm) and is 0.25 lpm/node for the data reported as “55*”. At these low flow rates, the processor temperature is about 20°C higher than inlet water temperature, while with air cooling it would be about 60°C due to the lower thermal resistance of water versus air (see section 3.5.1). DWC with inlet water close to 50°C still leads to a lower junction temperature than air at 20°C and therefore the lower processor leakage power either reduces the processor power consumption at fixed frequency or improves the performance with Turbo ON or AVX instructions like HPL. We also note that water cooling with 18°C inlet temperature improves performance by about 2% compared to the same node with air cooling. This differs from the Skylake measurements where performance of the water-cooled node was about 3% lower. We suspect this performance difference is due to the different VR programing on Haswell and Skylake.

5.4. Conclusions on the impact of cooling on power and performance

We have seen that the impact of cooling on voltage, frequency, power and performance depends mainly on four parameters: the type of instructions executed by the workload, the UEFI setting, the processor temperature and the processor type.

For AVX-512 intensive applications (like HPL), we see the UEFI “maximum performance” and water cooling deliver an increased performance (about 10%) at the expense of a higher power consumption (about 13%) versus air cooling. In this case, the 55°C inlet water temperature delivers the best performance due to the VRs, which increase the processor voltage as the inlet water temperature and the CPU temperature decrease. With the UEFI “efficiency-favor performance”, we note a performance/power difference on Haswell and Skylake. With Haswell, water cooling can provide up to 2% of power saving at equal performance while on Skylake water-cooling delivers 1% less power but with 3% less performance. We explain this difference by the Haswell and Skylake VRs different reactions to a colder CPU temperature.

For non-AVX applications (like BQCD), the UEFI “efficiency-favor performance” is best for performance and power consumption. Water cooling delivers a power saving of about 2% with no performance degradation versus air cooling. With the UEFI “maximum performance”, water cooling is not enabling higher frequency than with the UEFI “efficiency-favor performance” due to an increased processor voltage in reaction to CPU temperatures below its normal operating range of 70–80°C.

For all instruction types with the UEFI “efficiency-favor performance”, the lowest inlet water temperature delivers the lowest power consumption due to the reduction in processor power leakage.

Regarding Turbo, Turbo ON delivers the best performance with AVX-512 instructions and the UEFI “maximum performance”, while Turbo OFF delivers the best for performance and power consumption with the UEFI “efficiency-favor performance”. With non-AVX instructions, Turbo ON delivers the best frequency performance whatever is the UEFI setting and delivers the best performance with CPU-bound applications only.

1 Available at: https://www.top500.org/project/linpack [Accessed April 30, 2019].

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5 Power and Performance of Workloads

Create new playlist

Sign In

Sign Up