4
Power Consumption of Servers and Workloads

This chapter covers the theoretical power and performance characteristics of Intel Xeon processors and NVIDIA GPUs across several generations. It also presents power, thermal and performance measurements of air-cooled platforms equipped with an Intel Xeon processor and NVIDIA GPU running a selection of workloads.

While we take Intel processors and NVIDIA GPUs as examples, CPUs and GPUs developed by other vendors like AMD (Mutjaba 2018; Wikipedia n.d.) or ARM (ARM n.d.) follow the same trends as the law of physics are the same for all.

4.1. Trends in power consumption for processors

Table 4.1 presents the Thermal Design Power (TDP), cores/chip, lithography size, core frequency, number of transistors per chip and peak floating-point performance in single and double precision (DP) gigaflops (GFlops) for the Intel Xeon processors from Woodcrest in 2006, which was the first Intel Core microarchitecture1 up to Skylake in 2017. TDP of a processor is the maximum power it can dissipate without exceeding the maximum junction temperature this CPU can sustain. Single precision (SP) and DP GFlops are the theoretical peak performance of all cores on the socket. This peak performance is obtained by multiplying the number of cores on the socket × the core frequency × the number of SP or DP operations the processor is theoretically capable of executing in one processor cycle. SP and DP are also sometimes called FP32 and FP64 since SP and DP floating point numbers are represented with 32 and 64 bits. The information in this table is extracted from Intel Xeon product specifications2, with the exception of the number of transistors for Skylake, which we estimated as Intel is not publishing this information anymore. It should be noted that SP and DP peak performance are computed using the nominal frequency not the Turbo frequency, which is addressed in section 4.1.3.

Table 4.1. Intel Xeon processors characteristics from 2006 to 2017

Release date Microarchitecture Processor Model core / chip TDP(W) Lithograph (nm) Transistor (Billions) Core frequency (GHz) Peak SP Flops (GFlops) Peak DP Flops (Gflops)
June 26, 2006 Woodcrest Intel Xeon 5160 2 80 65 0.291 3.0 48 24
November 12, 2007 Harpertown Intel Xeon 5460 4 120 45 0.820 3.16 101 51
March 30, 2009 Nehalem Intel Xeon 5580 4 130 45 0.731 3.20 102 51
March 16, 2010 Westmere Intel Xeon 5690 6 130 32 1.170 3.46 166 83
May 1, 2012 SandyBridge Intel Xeon E5-2690 8 135 32 2.270 2.90 371 186
October 1, 2013 IvyBridge Intel Xeon E5-2697v 12 130 22 4.310 2.70 518 259
September 9, 2014 Haswell Intel Xeon E5-2699v 18 145 22 5.560 2.30 1325 662
March 1, 2016 Broadwell Intel Xeon E5-2699v 22 145 14 7.200 2.20 1549 774
July 11, 2017 Skylake Intel Xeon 8180 28 205 14 13.086 2.50 4480 2240

Figures 4.1 and 4.2 plot how lithography, peak performance measured in SP Gflops, and TDP have evolved over time. Given the ratio of SP to DP Flops is constant and equal to 2 for these processors, the DP GFlops graph will be similar to the SP GFlops graph.

Image

Figure 4.1. Lithography and peak performance (SP GFlops) for Xeon architectures

Image

Figure 4.2. Lithography and TDP for Xeon architectures

In Figure 4.1, the lithography curve in blue shows the well-known Intel “tick-tock” model where a new architecture is introduced at same lithography size (“tock”), while the new finer lithography (“tick”) is introduced on the same architecture. This model reduces risks to introduce two innovations at the same time, knowing that a new architecture will improve the chip performance at same lithography size, and a new lithography will improve performance with the same microarchitecture. For example, “ticks” introduce finer lithography from Woodcrest to Harpertown, Nehalem to Westmere, Sandy Bridge to Ivy Bridge, Haswell to Broadwell, while “tocks” introduce new microarchitectures from Harpertown to Nehalem, Westmere to Sandy Bridge, Ivy Bridge to Haswell and Broadwell to Skylake. Figure 4.1 shows the exponential improvement of peak performance over time, which was named as “Moore law” and which we will explain in more detail later in section 4.1.1. Figure 4.1 shows also the major impact of microarchitecture changes, which we will also explain later in sections 4.1.2 and 4.1.3.

Figure 4.2 shows the evolution of lithography size and TDP over the same period.

We note that TDP has been pretty flat from Harpertown to Broadwell, while it increased sharply from Woodcrest to Harpertown, and Broadwell to Skylake. The increased TDP from Woodcrest to Harpertown comes from the following: the Harpertown chip has many more transistors (about 2.8×, including a much larger cache from 4 to 12 MB) running at about the same frequency as Woodcrest, and the lithography improvement (from 65 to 45 nm) leads to pack these transistors with the same package size. At the same time, the voltage reduction (from 1.0 to 0.85 V) and the increased frequency (from 3.0 to 3.17 GHz) were not able to cope with the increased density transistor leading to an increased TDP (120 W vs. 80 W) and a violation of Dennard scaling law (see next section). The increased TDP from Broadwell to Skylake is due to the new microarchitecture introducing very powerful instruction (AVX-512 instructions are covered in section 4.1.2) with longer registers, leading to a larger chip at the same lithography and frequency. A similar trend for higher TDP will be also reflected for GPUs in section 4.2.

As theoretical peak performance does not reflect the actual performance of a processor, Table 4.2 presents the same information as in Table 4.1 based on the SPEC_fp benchmark instead of theoretical peak performance. SPEC_fp is the floating-point performance of a core measured by the SPEC CPU 2006 benchmark (SPEC 2006), whereas SPEC_fp rate is the floating-point performance measured using all cores on the socket.

Table 4.2. Intel processors characteristics from 2006 to 2017 with SPEC_fp

Release date Microarchitecture Processor Model core /chip TDP(W) Lithograph (nm) Transistor (Billions) Core frequency (GHz) SPEC_fp SPEC_fp rate
June 26, 2006 Woodcrest Intel Xeon 5160 2 80 65 0.291 3.0 17.7 45.5
November 12, 2007 Harpertown Intel Xeon 5460 4 120 45 0.820 3.16 25.4 79.6
March 30, 2009 Nehalem Intel Xeon 5580 4 130 45 0.731 3.20 41.1 202
March 16, 2010 Westmere-EP Intel Xeon 5690 6 130 32 1.170 3.46 63.7 273
May 1, 2012 SandyBridge Intel Xeon E5-2690 8 135 32 2.270 2.90 94.8 507
October 1, 2013 IvyBridge Intel Xeon E5-2697v2 12 130 22 4.310 2.70 104 696
September 9, 2014 Haswell Intel Xeon E5-2699v3 18 145 22 5.560 2.30 116 949
March 1, 2016 Broadwell Intel Xeon E5-2699v4 22 145 14 7.200 2.20 128 1160
July 11, 2017 Skylake Intel Xeon 8180 28 205 14 13.086 2.50 155 1820

4.1.1. Moore’s and Dennard’s laws

Moore’s law was introduced by Moore (1965). It postulates a reduction in the size of transistors leading to more and more transistors per chip at the cost-effective optimum leading to a doubling of transistors every 2 years. This period was later changed to 18 months.

Dennard’s scaling, also known as MOSFET scaling (Dennard et al. 1974), was introduced in 1974. It claims that as transistors get smaller, their power density stays constant, so that the power use stays in proportion with area. Therefore, in relation with Moore’s law, it claims the performance per watt is growing exponentially at roughly the same rate as Moore’s law.

As Moore’s and Dennard’s scaling laws are empirical, let us check how Intel Xeon processors followed these laws from 2006 to 2017.

Figure 4.3 plots the number of transistors in each chip from 2006 to 2017 as extracted from Table 4.1.

Image

Figure 4.3. Number of transistors and TDP for Xeon architectures

In Figure 4.3 and Table 4.1, we note that the number of transistors has steadily increased at every chip generation except from Harpertown to Nehalem. Over 11 years, Moore’s law predicts that the number of transistors should grow by a factor 45.2 if doubling every 2 years while it should grow by 155 with a doubling every 18 month. As the number of transistors grew by a factor 44.9, we see Moore’s law has been followed with a doubling every 2 years. The fact that the number of transistors even decreased from Harpertown to Nehalem is due to a change of microarchitecture with Nehalem introducing memory controller on chip versus front side bus (FSB), which increased greatly the sustained performance of the processor and therefore leading to a reduced cache size from 12 to 8 MB.

Looking ahead, while a new lithography size has been introduced every 30 months from 2006 to 2011 from 65 to 14nm, Intel recently announced (Shenoy 2018) the next Xeon processor Cascade Lake would still be using 14 nm and only Ice Lake in 2020 will use 10 nm, showing a deceleration in Moore’s law. Other manufacturer like AMD will introduce smaller lithography sooner like 7 nm (AMD Zen2 2018). While we see deceleration in lithography improvements, there is no deceleration in the increased number of cores.

As Moore’s law is also sometimes interpreted as the doubling of performance, let us check how this law has been followed using different performance metrics.

Figure 4.4 plots the SP peak performance and SPEC_fp rate growth of Intel Xeon processors from 2006 to 2017.

Image

Figure 4.4. SP and SPEC_fp rate performance for Xeon architectures

Taking SP peak GFlops performance as the performance metric, we note SP peak performance has grown by a factor of 93, showing a doubling of performance every 20 months, leveraging the effect of new microarchitectures and new instructions such AVX2 and AVX-512 (see Table 4.3). Taking SPEC_fp rate as the performance metric, SPEC_fp rate has grown by a factor of 40, following Moore’s law with a doubling of performance every 24 months. This lower performance improvement for SPEC_fp versus peak performance happens since only a few codes, which are part of the SPEC_fp benchmark, make effective use of the new AVX2 or AVX-512 instructions. We will see this again when we look at real applications in section 5.1.6.

Figure 4.5 represents the performance per watt curve over the same period, taking both SP peak performance and SPEC_fp rate performance as metrics.

Image

Figure 4.5. Performance per watt for Xeon architectures

It shows SP peak performance per watt increased by a factor 36.4 over 11 years while SPEC_fp rate per watt increased by a factor 15.6.

From a Dennard’s law perspective, SP per watt doubled every 21 months while SPEC_fp rate per watt doubled every 33 months. Therefore, depending on the performance metric used, we see again a different evolution if we look at peak performance or sustained performance measured by SPEC_fp rate. And if we look only at the evolution of SPEC_fp rate per watt from Nehalem to Skylake, we see it has been doubling only every 40 months, showing a deceleration of Dennard’s law that is causing serious power and energy issues.

In conclusion, to keep up with Moore’s law, Intel Xeon processors are increasing TDP, and this trend will not stop with the next processor generation, leading to a major power problem. From a Dennard’s law perspective, taking SPEC_fp as reference, performance per watt growth rate is slower than Moore’s law and slowing down significantly.

4.1.2. Floating point instructions on Xeon processors

Table 4.3 presents the different instructions set Intel introduced from Woodcrest to Skylake with their floating-point characteristics and the number of SP or DP operations they can theoretically perform.

Table 4.3. Intel Xeon microarchitecture characteristics

Microarchitecture Instruction Set register length FP execution units SP Flops / cycle DP Flops / cycle
Woodcrest SSE3 128 2 FP 128 8 4
Harpertown SSE4 128 2 FP 128 8 4
Nehalem SSE4 128 2 FP 128 8 4
Sandybridge AVX 256 2 FP 256 16 8
Haswell/Broadwell AVX2 & FMA 256 2 FP FMA 256 32 16
Skylake AVX-512 & FMA 512 2 FP FMA 512 64 32

Over time and the different Xeon microarchitectures, Intel processors have supported more complex and powerful instructions able to produce more Flops per cycle, which is one reason processors have been able to keep up with Moore’s law from a performance perspective, but at the expense of increased TDP power as we have discussed in previous section.

Fused multiply-add (FMA) and Advanced Vector Extensions (AVX) instructions are very good examples of such powerful instructions. SSE and AVX instructions are SIMD instructions (single instruction multiple data) introduced to operate on larger number of bits (128 bits for SSE, 256 and 512 bits for AVX) and therefore capable of producing more operations in one cycle as shown in Table 4.3. The FMA instruction set is an extension to the Streaming SIMD Extensions (SSE) and Advanced Vector Extensions (AVX) to perform FMA operations.

FMA is a floating-point multiply-add operation a+b×c performed in one step, with a single rounding. Where a unfused multiply-add would compute the product b×c, round it to N significant bits, add the result to a and round back to N significant bits, a fused multiply-add would compute the entire expression a+b×c to its full precision before rounding the final result down to N significant bits in one step. How applications can effectively make use of these complex instructions depends on the type of operations each application uses and what the compiler is able to generate. For example, if a code only uses addition, subtraction or multiplication but no FMA, then the potential of Flops per cycle is divided by two unless the compiler can generate FMA operations by itself based on a succession of additions, subtractions or multiplications.

Examples showing the effective use of these instructions on different workloads will be presented in section 5.1.

4.1.3. CPU frequency of instructions on Intel Xeon processors

Before Nehalem, Intel processors were operating at a reference frequency called the nominal frequency, meaning all instructions were executed at this base nominal frequency. With Nehalem, Intel introduced Turbo Boost Technology (TBT). TBT is a microprocessor technology developed by Intel that attempts to enable temporary higher performance by opportunistically and automatically changing the P-state (frequency/voltage pair, according to ACPI terminology presented in section 4.5). This feature automatically kicks in on TBT-enabled processors when there is sufficient headroom – subject to power rating, temperature rating and current limits.

With Intel TBT, instructions can be executed at a higher frequency than nominal if those instructions running at nominal frequency were below the TDP of the processor. Therefore, under TBT, instructions can be boosted at a frequency higher than nominal until they reach processor TDP. Later with Sandy Bridge, Intel introduced Intel Turbo Boost Technology 2.0 and then Intel Turbo Boost Max 3.0 in 2016 with Broadwell (WikiChip n.d.), which we will present in section 5.1.

With the introduction of AVX 2.0 instructions (and later with AVX-512 instructions), as these more powerful instructions are using more power than SSE instructions at the same frequency, Intel introduced an AVX 2.0 base frequency and AVX-512 base frequency lower than base non-AVX frequency to take into account these differences of power consumption depending on the type of instructions executed.

Table 4.4 shows the base frequency for non-AVX, AVX 2.0 and AVX-512 instructions for different Xeon Skylake SKUs (Intel Xeon Scalable Specifications 2018). Columns show the maximum core frequency in Turbo mode for each number of active cores.

Image

Table 4.4. Base and Turbo frequency for non-AVX, AVX 2.0 and AVX-512 instructions on 6xxx Skylake processors

As we can see, the AVX-512 base frequencies are lower than AVX 2.0 base frequencies, which themselves are lower than non-AVX frequencies.

4.2. Trends in power consumption for GPUs

The trend of increasing power consumption and number of transistors and performance per chip we described in section 4.1 is not unique to processors. Accelerators and GPUs have been following the same evolution. Table 4.5 presents for NVIDIA GPU from 2009 to 2017 the same data as Table 4.1 for Intel Xeon processors. Content is gathered from GPU Database (n.d.) and NVIDIA (2017). This table does not report the Volta Tensor cores, which we will discuss in section 4.4.

Table 4.5. NVIDIA GPUs characteristics from 2009 to 2017

Microarchitecture and GPU model Chip # of CUDA Cores TDP (W) Lithograph (nm) Transistor (Billions) CUDA cor frequency (GHz) Peak SP Flops (TFlops Peak DP Flops (TFlops Ratio SP to DP
Tesla T10 GT200 240 188 55 1.4 1.29 0.62 0.08 8
Fermi M20 GF100 448 247 40 3.2 1.15 1.03 0.52 2
Kepler K10 2 x GK104 3072 225 28 7.1 0.74 4.55 0.19 24
Kepler K80 2 x GK210 4992 300 28 14.2 0.56 5.59 1.86 3
Pascal P100 SXM2 GP100 3584 300 16 15.3 1.48 10.61 5.30 2
Pascal P100 PCle GP100 3584 250 16 15.3 1.30 9.32 4.66 2
Volta V100 SXM2 GV100 5120 300 12 21.1 1.45 14.89 7.44 2
Volta V100 PCle GV100 5120 250 12 21.1 1.37 14.03 7.01 2

The theoretical peak Flops SP is computed based on the base clock frequency of the CUDA cores and not their Max Boost clock frequency to be comparable with the way theoretical peak is reported on Intel Xeon processors in Table 4.1. The theoretical peak Flops DP varies depending on the NVIDA GPU generation with a varying ratio of SP/DP Flops3 as shown in the last column of Table 4.5, while it was constant and equal to ½ for Intel Xeon processors.

While lithography size has been following the evolution, we note several major differences between Tables 4.1 and 4.10. The first difference is the number of transistors, which is higher for NVIDIA GPUs than for Intel Xeon processors at equivalent lithography size due to the fact that GPU chips are larger and have more transistors than Intel Xeon chips. The second difference is a higher TDP for NVIDIA GPU versus Intel Xeon processors at equivalent lithography size and release date. The third is the performance difference, which if we take SP Flops shows a threefold delta (14 TFlops vs. 4.5 TFlops). This large difference comes from the fact that GPUs are not general-purpose processors and therefore use more specialized units which use less die area from a silicon perspective leading to a much higher number of processing cores. This advantage being balanced by the fact GPUs are coprocessors requiring specific instructions adding programming complexity and leading to a wide variance of performance depending on the percentage of instructions running on the GPU versus the CPU.

Figures 4.6 and 4.7 plot the lithography size versus SP peak performance and lithography size versus TDP for NVIDIA GPUs similar to Figures 4.1 and 4.2 for Intel Xeon.

Image

Figure 4.6. Lithography and peak performance (SP GFlops) for NVIDIA GPUs

Image

Figure 4.7. Lithography and TDP (Watt) for NVIDIA GPUs

4.2.1. Moore’s and Dennard’s laws

Figure 4.8 plots the number of transistors versus TDP.

Image

Figure 4.8. TDP and transistors over NVIDIA architectures

Comparing Figures 4.8 with 4.3 shows that GPUs have about twice as many transistors as CPUs as explained before, while the number of transistors per watt is about the same (70 million transistors per watt for V100 vs. 64 million transistors per watt for Skylake).

From a Moore’s law perspective for transistors, the increase has been 15× during 98 months leading to a doubling in size every 25 months, which is quite well aligned with Intel Xeon, as well as for the TDP increase of 1.6× and the lithography progress of 4.6×.

From a Moore’s law perspective for peak performance, SP peak Flops increased 24× leading to a doubling every 21.5 months, while DP peak Flops increased 96×, which is due to the microarchitecture improvement for DP instructions. Similar evolution is leading to the introduction of Tensor cores for AI operations, as will be discussed in section 4.5.

Image

Figure 4.9. SP and DP peak GFlops per watt for NVIDIA GPUs. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip

As shown in Figure 4.9, from a Dennard’s law perspective, SP per watt has increased by 15× doubling every 25 months, which is similar to Intel Xeon doubling every 21 months (Figure 4.5). DP per watt increased 60×, again due to microarchitecture improvements.

4.3. ACPI states

Advanced Configuration and Power Interface (ACPI) is an open industry specification co-developed by different vendors. ACPI establishes industry standard interfaces enabling operating system (OS)-directed configuration power management and thermal management of mobile, desktop and server platforms. Since 2013, the UEFI Forum has taken over the ACPI mission, but it is still referred as ACPI (ACPI 2014).

Figure 4.10 gives an overall description of the ACPI states.

Image

Figure 4.10. ACPI states. For a color version of this figure, see www.iste.co.uk/brochard/energy.zip

ACPI states define the activity state of all major components in a system: the overall system, the devices, the memory and the CPU. P0 and D0 are, for example, the power consumption and capability states when the processor or device is in active state, until it reaches the state of minimum power consumption Pn and Dn. The number of states for each component is defined by the designer of these components and it has increased a lot since 1999 when ACPI was first published.

G-states, in Figure 4.11, are global system states that define what operational state the entire server is in.

Image

Figure 4.11. ACPI G-states

Image

Figure 4.12. ACPI S-states

S-states are system sleep states that define what sleep state the entire server is in.

When servers are idle, S3 state can save large amounts of power while keeping the latency at an acceptable level. This feature was used by IBM LoadLeveler Energy Aware Scheduler (IBM Knowledge Center n.d.) to minimize idle node power consumption.

Figure 4.13 presents the typical power saving and latency associated with the different S-states.

Image

Figure 4.13. Power saving and latency of S-sates

The most familiar states users are aware of and the one of most interest for us in this section are the CPU P-states and C-states.

The OS places the CPU in different P-states depending on the amount of power needed to complete the current task. That is the difference with the CPU C-states, which are idle power saving states while P-states are execution power saving states.

C-states are CPU power saving states. The CPU transitions to C-states higher than C0 when it is idle after a period of time. All C-states occur when the server is in S0 state and G0 state.

Figure 4.14 presents a summary of each C-state with their implementation for Intel Nehalem architecture.

Image

Figure 4.14. Example of C-sates on Intel Nehalem

CPU package refers to all the hardware contained in a CPU chip. “Uncore” refers to all the hardware except for the CPU cores. C-states can operate on each core separately or the entire CPU package. Core C-state transitions are controlled by the OS. Package C-state transitions are controlled by the hardware. The number of C-states and the savings associated with each is dependent on the specific type and SKU of CPU used.

Figure 4.15 presents a summary of each P-state. P-states are CPU performance states. The OS places the CPU in different P-states depending on the amount of power needed to complete the current task.

Image

Figure 4.15. Example of P-states

For example, if a 2-GHz CPU core only needs to run at 1 GHz to complete a task, the OS may place the CPU core into a higher number P-state. P-states operate on each core separately. The OS controls the transitioning among the P-states. The number of P-states and the frequency and voltage associated with each is dependent on the specific type and SKU of CPU used. BIOS can restrict the total number of P-states revealed to the OS and can change this on the fly.

Figure 4.16 shows how frequency and voltage vary across the different P-states. At P0, frequency and voltage are at maximum. Then voltage is reduced down to Vmin (minimum CPU voltage). From Vmax to Vmin, voltage and frequency vary linearly. After this point, a lower P-state will be reached by reducing frequency only until it reaches Fmin and Pn.

Image

Figure 4.16. P-states, voltage and frequency

It should be noted that the ranking of P-states we described above with P0 being the nominal frequency is valid when Turbo is not activated. When Turbo is selected (WikiChip n.d.), state P0 is the turbo frequency, state P1 is the nominal frequency and the frequency, which was corresponding to state Pi when Turbo is not activated, is corresponding to state Pi+1 when Turbo is activated.

4.4. The power equation

In Table 4.1, we saw the number of cores on a socket of Intel Xeon processor increased by a factor 14× while its frequency decreased by 20% and TDP increased by 2.5×. On the other hand, for NVIDA GPU, we saw in Table 4.5 the number of cores increased by a factor 14×, while its frequency has been about flat and TDP increased by a factor 1.6×.

Why such an evolution?

TDP of a processor is the maximum power it can dissipate without exceeding the maximum junction temperature the CPU can sustain.

There are two major factors contributing to the CPU power consumption, the dynamic power consumption and the power loss due to transistor leakage currents:

[4.1]

By Ohm’s law, the dynamic power Pdyn consumed by a processor is given by:

[4.2]

where C is capacitance, f is frequency and V is voltage, which means the dynamic power increases quadratically with the voltage and linearly with the frequency.

While the dynamic power consumption is dependent on the clock frequency, the leakage power Pleak is dependent on the CPU supply voltage. We will come back to the power leakage in section 5.3.

If we look at how voltage and frequency vary with ACPI P-states (Figure 4.11), we see that between P0 state and the P-state corresponding to the minimum voltage (Pm with m < n where n is the highest possible P-state), voltage and frequency vary linearly.

Therefore, between P0 and Pm, which is the range where a processor is executing workloads, [4.2] can be approximated with:

[4.3]

It shows dynamic power increases as the cube of frequency and how reducing the frequency when an application is running can significantly reduce the power consumption of a server.

Let us take an example. According to [4.3], at the same lithography, decreasing the frequency of a processor by 20% will reduce its dynamic power by (0.20)3 (~50%). Therefore, as power is halved, twice as many transistors would consume the same power and provide a theoretical performance of 1.6 (2 × 0.80) times the theoretical performance of the original processor. On the other hand, if we were increasing the frequency of the original processor to reach the same peak performance, its power consumption would be multiplied by a factor = 1.63 ~ 4.

This power equation is the logic driving the multicores and many cores eras, which all CPU and GPU manufacturers have been following for about 10 years steadily increasing cores at about constant frequency. The number of cores differs significantly between CPUs (in the tens) and GPUs (in the thousands) since a GPU core is much simpler leading to less transistors, die space and power consumption than a more complex CPU core. This is shown in Table 4.1 for the Xeon CPUs where the number of cores has steadily increased from Xeon Woodcrest dual core to Skylake 8180 with 28 cores per socket. For the same reasons, the number of cores will keep increasing as with the second generation Intel Xeon Cascade Lake with up to 56 cores per socket (Cascade Lake 9282 2019) and AMD EPYC Zen2 (Mutjaba 2018) with up to 64 cores per sockets using 7 nm lithography and a core frequency around 2.3 GHz. Similar evolution is shown in Table 4.5 for NVIDIA GPUs, which are playing an increasingly important role with thousands of simple cores running at low frequency like Blue Gene systems (IBM Icons n.d.), leading to greater performance per watt.

It is important to remark that this trend of increasing number of cores has recently taken a new turn with the introduction of more specialized cores, like Tensor cores in NVIDIA Volta GPU4 and in the Google Tensor Processing Unit (TPU)5. Like CUDA cores which were designed by NVIDIA to optimally perform vector SP (FP32) and DP (FP64) operations heavily used by graphics and HPC applications, Tensor cores were designed to optimally perform AI operations. On NVIDIA Volta, Tensor cores implement a mixed precision FMA operation with FP16 multiply and FP32 accumulation to perform 4×4 matrix multiplication and accumulation operations heavily used by AI training and inference. It should be noted that inference sometimes uses only INT8 numbers (integer represented on 8 bits) like on the Google TPU 1.0 while TPU 2.0 and TPU 3.0 have been extended to perform also FP16 operations on top of INT8.

Table 4.6 compares peak SP TFlops (FP32) and Tensor TFlops (mixed FP16 and FP32 precision) performance and performance per Watt (GFlops/W) where only the power of the CPU and GPU is taken in consideration for Xeon Skylake 8180 and NVIDIA V100 to illustrate the trade-off between operation specialization, performance and performance per Watt, as we did for SP Flops in Tables 4.1 and 4.5 and Figures 4.5 and 4.9.

Table 4.6. CPU and GPU performance and performance per watt

TDP (Watt) SP Tflops SP Gflops/W Tensor Tflops Tensor Gflops/W
Intel Skylake 8180 205 4.5 21.9 NA NA
NVIDIA Volta V100 300 14.9 49.6 125 416.7

While the SP performance ratio of GPU versus CPU is ~3.3× and SP performance per watt is ~2.3×, the Tensor versus FP core comparison is a factor 8.4 both on performance and performance per watt. The performance per Watt comparisons are a bit unfair regarding the CPU since the power used by the GPU does not include the CPU power of the host. We note also that all SP Flops we report in this table and in previous tables do not include Intel Turbo or NVIDIA GPU boost, while NVIDIA publishes Tensor Flops with GPU boost clock only (NVIDA V100 2018). Therefore, Tensor performance is a bit optimistic (~5%), although this does not change the comparison much. A similar comparison has been reported by Jouppi et al. (2017) when comparing an Intel Haswell CPU, an NIVIDIA K80 GPU and a Google TPU1 tensor processor on real AI workloads. In this chapter, a much higher performance and performance per watt comparing the TPU 1.0 versus CPU or GPU are reported as the measurements are done with inference AI workloads using INT8 operations where TPU 1.0 is best of breed.

  1. 1 Available at: https://en.wikipedia.org/wiki/Intel_Core_(microarchitecture) [Accessed April 30, 2019].
  2. 2 Available at: https://ark.intel.com/content/www/us/en/ark.html#@Processors [Accessed May 7, 2019].
  3. 3 Fermi microarchitecture, available at: https://en.wikipedia.org/wiki/Fermi_(microarchitecture)#Performance; Kepler microarchitecture, available at: https://en.wikipedia.org/wiki/Kepler_(microarchitecture)#Performance; Maxwell microarchitecture, available at: https://en.wikipedia.org/wiki/Maxwell_(microarchitecture)#Performance; Pascal microarchitecture, available at: https://en.wikipedia.org/wiki/Pascal_(microarchitecture)#Performance; Tesla microarchitecture, available at: https://en.wikipedia.org/wiki/Tesla_(microarchitecture)#Performance; Volta microarchitecture, available at: https://en.wikipedia.org/wiki/Volta_(microarchitecture)#Performance. [All URLs accessed April 30, 2019].
  4. 4 Available at: https://www.nvidia.com/en-us/data-center/tensorcore [Accessed April 30, 2019].
  5. 5 Available at: https://cloud.google.com/tpu/docs/system-architecture#system_architecture [Accessed April 30, 2019].
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset