Chapter 6

HPC Benchmarks for CFD

Abstract

This chapter is about the use of high-performance computing in computational flow dynamics, a real devourer of resources. Various types of comparative studies are analyzed in detail and commented on in terms of interconnectivity, storage, and memory requirements. The ANSYS® Fluent®, CFX®, and OpenFOAM benchmarks provide an overview of the nature of problems run on the various types of hardware.

Keywords

ANSYS CFX; ANSYS FLUENT; Benchmarks; CFD; HPC; OpenFOAM

1. How Big Should the Problem Be?

Computational Fluid Dynamics (CFD) problems are resource-hungry applications. They require a large amount of computer power to be solved. Let us start with the smallest problem. Consider a one-dimensional (1D) partial differential equation. To solve it, you would need at least three points. For its first-order accurate solution, these three points in space and a certain number of points (for example, 300) for time steps are needed. A single core processor such as Pentium IV with 512 MB RAM will easily solve it in a few minutes. Now, add some complexity: make it 2D and go for more grid points. This would need more computational power, such as 1 GB memory. Now, make it 3D and solve the Navier–Stokes equation with 0.5 million grid points. This would add more complexity and you would need at least a dual-core processor and 2 GB RAM. For Reynolds-averaged Navier–Stokes (RANS) simulations this is a reasonably good machine. This may not be of paramount importance because for real physics one may need to capture turbulence. For this, it will become mandatory to add turbulence model to the simulation.
For wall-bounded flows, whether internal or external, the mesh is kept fine near the wall to capture near-wall viscous effects, whether turbulent or laminar. From the viewpoint of computational expense, the mesh is usually stretched in terms of geometric progression near the wall, which can make cell size very large at the far field boundaries. Ideally, one should keep all the points equally spaced, but this can make mesh size very fine. Even in the case of mesh stretching when the mesh size has been increased, a dual-core PC would also be useless. The user may require a quad core with 64-bit architecture support so that he or she can use the whole 4 GB of RAM, at least if the operating system is 32-bit. To get maximum power out of your computer, all four cores can be used. In summary, it can be said that high power is required when:
1. There is a 3D problem to solve
2. There is a need to solve turbulence
3. Flows involve flow recirculation, reversed flows, and gradients

Table 6.1

Number of cores suitable for a particular mesh size for Fluent simulation

Cluster size (no. of cores)Fluent case size (no. of cells/mesh size)No. of simultaneous Fluent simulations
8Up to 2–3 million1
16Up to 2–3 million2
16Up to 4–5 million1
32Up to 8–10 million1
32Up to 4–5 million2
64Up to 16–20 million1
64Up to 8–10 million2
64Up to 4–5 million4
128Up to 30–40 million1
256Up to 70–100 million1
256Up to 30–40 million2
256Up to 8–10 million4
256Up to 4–5 million16
4. There is unsteady flow
5. There is a need for direct numerical simulation, large eddy simulations (LES), or detached eddy simulations (DES).
Thus, an obvious question is how big the problem must be to run on a cluster for HPC. Performance benchmarks are done for this purpose. Guidelines based on the number of cores to be used with respect to the problem size are given in Table 6.1. The benchmark was performed on Sun machines [1]. It is common to see performance benchmarks for CFD with mesh size limits crossing 100 million grid points.

2. Maximum capacity of the critical components of a cluster

2.1. Interconnect

Interconnect sometimes becomes a bottleneck in the performance of a cluster. This becomes tiring especially when one does not understand where the problem is because everything looks fine in terms of functionality. A fast interconnect usually solves the problem. The term “low latency high bandwidth” is used in the networking field; it means that the time for communication between nodes must be as low as possible and the data transfer rate must be high, i.e., large data packets can be transferred in no time. Currently, the largest vendor of Infiniband in the world is the Mellanox.
image
Figure 6.1 Infiniband versus gigabit ethernet.
Figure 6.1 shows a performance histogram for the ratio of Infiniband to gigabit Ethernet. Again, the reference benchmark has been used with respect to the Sun machine [1] with ANSYS® Fluent, v. 12 (thanks to Sun, Inc.). The curve has two parts: one is for four nodes each of eight core processors and the other is for four nodes with 16 cores each. The curve is self-explanatory. For a smaller number of cores, a low-size mesh gives subsequent performance but the performance decays and the ratio is increased with the problem size. The ratio of Infiniband to gigabit does not show high values, which indicates that for a low to medium-size mesh (4–10 million), if there is a smaller number of cores, using gigabit Ethernet is sufficient. If the number of cores is increased, Infiniband will have an impact. The first bar has a peak at 8.7, which means that it has an advantage when the number of cores is larger. Consequently, the bars shorten with an increase in problem size because of the large amount of data transfer between nodes. Obviously, the calculations increase as the mesh size increases.

2.2. Memory

Memory requirements for the test cases span from a few hundred megabytes to about 25 GB. As the job is distributed over multiple nodes, the memory requirements per node are reduced correspondingly. As a starting point, 2 GB per core (e.g., 8 GB per dual-processor, dual-core node) is recommended. The total memory requirement for one Fluent 12 simulation on the cluster (distributed across multiple nodes) can scale linearly with the Fluent model size (measured in number of cells) and can be on the order of the estimates listed in Table 6.2. The author has personally noticed that for steady simulations in almost all versions from 6 to 14, Fluent consumes memory only while reading the case and data files and when it is distributing the mesh over the compute nodes. Figure 6.2 shows the performance of DDR with respect to the software-defined ratio for different meshes on Sun clusters on Fluent 12. The best performance can be obtained with a larger number of cores for a heavy mesh or with an intermediate number of cores on a medium mesh.

Table 6.2

Estimation of memory requirement for a particular problem size

Fluent case size (no. of cells/mesh size)RAM requirements
2 Million4 GB
5 Million10 GB
50 Million100 GB

2.3. Storage

Adequate storage capacity is also required to run ANSYS Fluent. Data file sizes created by ANSYS Fluent differ with the CFD simulation model size, which is usually measured by the number of cells. With unsteady simulations, the typical data file size increases because of the increasing amount of data to store at each time step. Typical file sizes for a steady case are shown in Table 6.3.

3. Commercial Software Benchmarks

3.1. ANSYS Fluent Benchmarks

ANSYS Fluent has several benchmarks available [2]. These benchmarks are versatile in that they contain problems of different scales and have been tested on a number of different platforms. These have been included here so that the reader can have an idea about how the problem depends on the type of hardware used. ANSYS defines benchmarks in terms of the performance rating, speedup, and efficiency. The definition of each term is given below:
1. Performance Rating: The performance rating is the basic measure used to report performance results of ANSYS Fluent benchmarks. It is defined as the number of benchmarks that can be run on a given machine (in sequence) in a 24 h period. It is computed by dividing the number of seconds required to run the benchmark by the number of seconds in a day (86,400 s). A higher rating means faster performance.
image
Figure 6.2 Software-defined ratio and double data rate (DDR) comparison.

Table 6.3

Storage needs for a particular problem setup in ANSYS Fluent

Fluent case size (no. of cells/mesh size)Space requirements
2 Million200 MB
5 Million1 GB
50 Million5 GB
2. Speedup: Speedup is the ratio of wall-clock time required to complete a give calculation using a single processor compared with that of the equivalent calculation performed on a concurrent machine. Its value ranges from 0 to the number of processors used for the parallel run. When speedup is equal to the number of processors used, it is called perfect or linear. Sometimes speedup exceeds the number of processors. This is referred to as super-linear speedup and is often caused by the availability and use of larger amounts of fast memory (e.g., cache or local memory) compared with a single processor run.
3. Efficiency: Efficiency is speedup normalized by the number of processors used, presented as a percentage. It indicates the overall use of the central processing units (CPUs) during a parallel calculation. An efficiency of 100% indicates that each CPU is completely occupied by computation during the run period and corresponds to linear speedup. An efficiency of 60% indicates that each CPU is performing useful computation only 60% of the time. The remaining time is spent waiting for other functions, such as parallel communication or work on other processors, to complete. The curves will be shown for the benchmarks with respect to the performance rating and not the standard speedup. The reason is that for speedup performance is measured on the basis of a serial process that is not possible until the benchmark is performed on a single core. Because the technology has made a core to act as a single CPU and in most machines each processor chip contains a minimum of four cores, sometimes it is not possible to measure performance on the basis of a single core, as in most IDM machines. The performance rating thus gives an equivalent measure of speedup with almost the same trend as that depicted in the usual speedup curves.

3.1.1. Flow of Eddy Dissipation

The eddy case consists of the flow modeling of reacting flow with an eddy-dissipation model such as k-ε. Here, ANSYS Fluent has used the k-ε model along with an implicit solver. It has been reported in [2] that this simulation was attempted with 417,000 cells and all-structured mesh. The benchmark results are shown in Figure 6.3. Three machines and their different configurations were used, including the famous hardware of Bull, Fujitsu, and IBM. For the Bull machine, the model was B710 with an Intel E5-2680 processor with a speed of 2.8 GHz with turbo-boost on, and the operating system was Redhat 6 with connectivity of FDR Infiniband by Mellanox. Two models of Fujitsu were tested, each with a difference in CPU speed of 2.7 and 3 GHz, respectively. IBM machines also differ in CPU rating; one is IBM DX 360 M3 with a 2.6 GHz processor and the other is IBM DX360 M4 with a 2.7 GHz processor. All used the same operating system, Redhat Linux 6, and FDR Infiniband of Mellanox for connectivity.
If you look carefully at Figure 6.3, you will see that IBM machine with a 2.6 GHz processor is taking the lead. However, that this machine started at 16 cores at a minimum, which means that the problem is not large enough to require a high number of cores. This is indicated by 256 cores, showing that all of the curves are fading; this is not true with the IBM with a 2.6 GHz processor, which is still a bit straight on 1024 cores compared with others. Table 6.4 mentions other parameters that were described previously. The core solver speedup and core solver efficiency are also listed. ANSYS Fluent described solver efficiency on one core as 100% because there are no communication bottlenecks and no shared memory headache. Because the solver speedup and efficiency are tabulated with respect to a single core, the columns are shown as not applicable (N/A) for IBM machines, because the minimum number of cores is 16.
image
Figure 6.3 Benchmarking curve for the problem of eddy dissipation on ANSYS Fluent 14.5 software.

Table 6.4

Core solver rating, core solver speedup, and efficiency details for the problem of eddy dissipation

ProcessesMachinesCore solver ratingCore solver speedupCore solver efficiency
Bull with 2.8 GHz turbo
11243.91100%
2012773.711.457%
4024753.819.549%
8047125.829.237%
16089846.240.425%
3201611,443.746.915%
Fujitsu with 2.7 GHz processor
11185.41100%
21361.11.997%
41661.33.689%
811185.66.480%
1011470.67.979%
1211489.7867%
2412421.913.154%
4824670.325.252%
9647697.141.543%
19289573.451.627%
Fujitsu with 3 GHz processor
11206.41100%
214021.997%
41756.73.792%
811318.16.480%
1011644.1880%
2012769.213.467%
4025112.424.862%
8048093.739.249%
1608781937.924%
320168037.238.912%
IBM with 2.6 GHz processor
1612168.1N/AN/A
2423083N/AN/A
3223945.2N/AN/A
4835228.4N/AN/A
6446376.4N/AN/A
9667783.8N/AN/A
12889118.7N/AN/A
384249959.7N/AN/A
10246411,220.8N/AN/A
Table Continued

image

ProcessesMachinesCore solver ratingCore solver speedupCore solver efficiency
IBM with 2.7 GHz processor
1612341.5N/AN/A
3224148.9N/AN/A
4825693.6N/AN/A
6436996N/AN/A
352159265.4N/AN/A

image

3.1.2. Flow Over Airfoil

Figure 6.4 shows benchmarking for the case of flow over an airfoil with about 2 million cells. All of the cells used in the simulation were hexahedral. The turbulence model used was realizable k-ε with a density-based implicit solver. Based on the solver rating, Fluent scales up very well for a higher number of cores. The problem of 2 million, however, dictates that almost all of the curves start to decelerate after 256 cores. This means that for this problem on all of the machines tested, 256 cores are enough. No significant increase in performance is expected if the cores are extended beyond this number. In comparison, IBM has slightly better performance than Fujitsu and Bull. Table 6.5 shows the results of core solver rating, speedup, and efficiency.
image
Figure 6.4 Benchmarking curve for the problem of airfoil on ANSYS Fluent 14.5 software.

Table 6.5

Core solver rating, core solver speedup, and efficiency details for the problem of flow over aircraft

ProcessesMachinesCore solver ratingCore solver speedupCore solver efficiency
Bull with 2.8 GHz turbo
11210.31100%
201258312.361%
4024958.423.659%
8049340.544.456%
160815,853.275.447%
3201622,012.7104.733%
Fujitsu with 2.7 GHz processor
11164.91100%
21329.92100%
41640.43.997%
811092.36.683%
1011442.48.787%
121175710.789%
2412921.417.774%
4825374.832.668%
9649735.25961%
192814,52188.146%
3841625,985157.641%
Fujitsu with 3 GHz processor
11184.21100%
21368.52100%
41719.13.998%
811327.27.290%
1011575.28.686%
2012721.314.874%
Table Continued

image

ProcessesMachinesCore solver ratingCore solver speedupCore solver efficiency
4024979.82768%
8047500.34350%
160816,225.488.155%
3201625,60013943%
IBM with 2.6 GHz processor
1612076.9N/AN/A
2423110.7N/AN/A
3223958.8N/AN/A
4835877.6N/AN/A
6447697.1N/AN/A
96611,041.5N/AN/A
128814,163.9N/AN/A
2561622,887.4N/AN/A
3842427,212.6N/AN/A
5123230,315.8N/AN/A
IBM with 2.7 GHz processor
1612192.9N/AN/A
2413005.2N/AN/A
4825798.7N/AN/A
6437731.5N/AN/A
96411,006.4N/AN/A
128613,991.9N/AN/A
192818,782.6N/AN/A
2561122,736.8N/AN/A
3601526,584.6N/AN/A

image

3.1.3. Flow Over Sedan Car

The problem of a Sedan car was simulated with about 4 million cells. The actual number of cells was 3.6 million. It was a hybrid grid, obviously, owing to the complexity of the geometry especially in the regions where the car wheel is present. k-ε was used as a turbulence model with a pressure-based implicit solver. Figure 6.5 compares benchmarking for this problem; almost all of the machines behave similarly, but the Bull machine is linear until the end and we can expect performance to continue (if not perfectly) if the cores are extended. The behavior is not drifting away from a linear trend, aside from IBM, which shows the curves for the two machines starting to come down at 1024. However, it is expected that after about one x-axis unit all of the curves will start to come down because the problem size is not extravagant. In this problem case, we can say that Bull performed the best. Table 6.6 tabulates Figure 6.5.
image
Figure 6.5 Benchmarking for the problem of a sedan car, using ANSYS Fluent.

3.1.4. Flow Over Truck Body with 14 Million Cells

A truck is an interesting problem from an aerodynamics point of view. Usually, because of their bulky mass, trucks do not attain high velocity on highways. Hence, their drag is reduced by certain geometrical modifications. For example, a fairing is mounted on the roof to reduce drag and thereby increase speed. Also the base drag decreases the speed many fold. Thus, it is the testing through CFD that tells us the contribution of drag on its performance and true simulation can be performed only using techniques like DES. Therefore, HPC is the best possible solution to run these kind of simulations. The benchmark was performed by ANSYS Fluent on a truck body with about a 14-million hybrid type of grid (Figure 6.6). A pressure-based implicit solver was used for simulations. Details regarding the machines used and the efficiency are shown in Table 6.7. Figure 6.6 also shows that the Bull machine performed better than the others. The curve was almost linear until the end.
image
Figure 6.6 Benchmarking of a truck body with 14 million cells using ANSYS Fluent software.

Table 6.6

Core solver rating, core solver speedup, and efficiency details for the problem of flow over a sedan car

ProcessesMachinesCore solver ratingCore solver speedupCore solver efficiency
Bull with 2.8 GHz turbo
11157.41100%
2011882.41260%
4023818.824.361%
804766348.761%
160815,926.3101.263%
3201630,315.8192.660%
6403255,741.9354.155%
Fujitsu with 2.7 GHz processor
11125.91100%
212221.888%
41502.84100%
81882.3788%
1011175.99.393%
1211125.48.974%
2412053.516.368%
4824085.132.468%
9647944.863.166%
192816,149.5128.367%
3841629,793.1236.662%
Fujitsu with 3 GHz processor
11141.21100%
212842101%
41562.14100%
811066.77.694%
1011269.2990%
2011916.813.668%
4023831.527.168%
8047464.452.966%
160814,96110666%
3201627,648195.861%
IBM with 2.6 GHz processor
1611508.5N/AN/A
2422285.7N/AN/A
3223091.2N/AN/A
4834632.7N/AN/A
Table Continued

image

ProcessesMachinesCore solver ratingCore solver speedupCore solver efficiency
6446149.5N/AN/A
9669118.7N/AN/A
2561623,351.4N/AN/A
3842432,000N/AN/A
5123238,831.5N/AN/A
10246454,857.1N/AN/A
IBM with 2.7 GHz processor
1611519.8N/AN/A
2412068.2N/AN/A
4824199.3N/AN/A
6435610.4N/AN/A
9648307.7N/AN/A
128611,184.5N/AN/A
192816,776.7N/AN/A
2561121,735.8N/AN/A
3601529,042N/AN/A
1611519.8N/AN/A
2412068.2N/AN/A

image

Table 6.7

Core solver rating, core solver speedup, and efficiency details for the problem of flow over a truck body with 14 million cells

ProcessesMachinesCore solver ratingCore solver speedupCore solver efficiency
Bull with 2.8 GHz turbo
201180.414.874%
402360.329.574%
804723.659.374%
16081373.6112.670%
320162958.9242.576%
640325366.5439.969%
1280648727.3715.456%
Fujitsu with 2.7 GHz processor
118.81100%
2119.82.2112%
4141.14.7117%
8174.68.5106%
10193.510.6106%
121108.412.3103%
241206.423.598%
482371.942.388%
964811.392.296%
19281384.6157.382%
384162814.3319.883%
Fujitsu with 3 GHz processor
119.31100%
2120.42.2110%
41454.8121%
81798.5106%
10199.310.7107%
201185.920100%
40234336.992%
80448351.965%
16081259.5135.485%
320161624.1174.655%
IBM with 2.6 GHz processor
161147.2N/AN/A
242223.8N/AN/A
322292N/AN/A
483404.3N/AN/A
644548.6N/AN/A
966817.4N/AN/A
Table Continued

image

ProcessesMachinesCore solver ratingCore solver speedupCore solver efficiency
12881082.7N/AN/A
256162112.5N/AN/A
384243130.4N/AN/A
512324056.3N/AN/A
1024646912N/AN/A
IBM with 2.7 GHz processor
161157N/AN/A
241203.4N/AN/A
482400N/AN/A
643539.3N/AN/A
964799.3N/AN/A
12861057.5N/AN/A
19281627.1N/AN/A
256112138.6N/AN/A
360152851.5N/AN/A

image

3.1.5. Truck with 111 Million Cells

This was the largest benchmark performed by ANSYS Fluent. It consisted of the same problem as discussed before but with 111 million cells. Obviously, with such a huge grid it is difficult to manage the whole grid structure; therefore, mixed-type cells were built. The model was DES turbulence and a pressure base solver was used to solve the governing equations. A perfect linear curve was obtained for Bull machine whereas the worst performance was shown by Fujitsu with a 2.7 GHz processor (see Figure 6.7). Table 6.8 shows the core solver efficiency and speedup.

3.1.6. Performance of Different Problem Sizes with a Single Machine

We have seen that the best machine so far is Bull. We now compare the performance of each problem tested on the Bull machine, shown in Figure 6.8.
Figure 6.8 shows that scalability is good for larger mesh sizes. The smallest mesh of the eddy-dissipation problem, 4000 cells, moves downward after 64 cores. Truck 111 million is still scalable because of the larger mesh size. We can conclude that if one wants to put up a cluster in a laboratory, the first thing is to know the size of the problem size and how large it may be in future. Will you run larger meshes in the future, or unsteady simulations? Then, you need to know, if your possible mesh size may be millions of cells, how many cores would be sufficient. Keeping in mind the components of Infiniband, the operating system, and the MPI software constant, these benchmark curves will guide you as to how many cores will be sufficient for your case. The last step is to select the machine. Keeping the budget in mind, you will select a vendor and then compare prices in the market. If a vendor is the best for your problem but is expensive, go to the next best one, and so on. A flowchart will help you select an appropriate machine, as shown in Figure 6.9.
image
Figure 6.7 Benchmarking of a truck problem with 111 million cells with ANSYS Fluent software.
image
Figure 6.8 Benchmarking for a number of problems on the bull machine with a 2.8 GHz turbo processor.

Table 6.8

Core solver rating, core solver speedup, and efficiency details for the problem of flow over a truck body with 111 million cells

ProcessesMachinesCore solver ratingCore solver speedupCore solver efficiency
Bull with 2.8 GHz turbo
80474N/AN/A
1608154.2N/AN/A
32016322.7N/AN/A
64032654N/AN/A
1280641303.2N/AN/A
Fujitsu with 2.7 GHz processor
96456.3N/AN/A
1928110N/AN/A
38416318.5N/AN/A
Fujitsu with 3 GHz processor
80456.3N/AN/A
1608128N/AN/A
32016208.2N/AN/A
Table Continued

image

ProcessesMachinesCore solver ratingCore solver speedupCore solver efficiency
IBM with 2.6 GHz processor
64456.7N/AN/A
96679.6N/AN/A
1288107N/AN/A
25616249.9N/AN/A
38424378.9N/AN/A
51232501.7N/AN/A
102464998.8N/AN/A
IBM with 2.7 GHz processor
64361.4N/AN/A
96493N/AN/A
1286121.6N/AN/A
1928185.3N/AN/A
25611246.7N/AN/A
36015348.8N/AN/A

image

image
Figure 6.9 Flowchart for selecting an HPC machine.
Alternately, you may obtain quotations from all of the vendors and then compare them and select the best one. You may need to compromise between performance and your budget. Vendors sell their machines mostly on the basis of teraflops per second. This is based on the LINPACK benchmark, but for software such as ANSYS Fluent and CFX or OpenFOAM, you need to look at the number of cores for your problem and choose a machine on this basis. Second, Infiniband is offered by Mellanox, which is the sole vendor in the HPC market, like NVIDIA in GPU technology. Infiniband is expensive equipment, but whatever machine you are going to buy, its price is fixed, so determine whether the vendor is offering it as a package. Otherwise, you will have to make room in the budget for it, as well.

3.2. Benchmarks for CFX

3.2.1. Automotive Pump Simulation

This CFX benchmark was used for an automotive pump problem consisting of 596,252 cells. There were mixed types of elements, including tetrahedrons and prisms. The models used were k-ε and a moving reference frame to inculcate the motion of the rotor and make the stator stationary. A density-based solver was employed in the simulations.
CFX benchmarks differ from ANSYS Fluent benchmarks, in that ANSYS mentions different processor architectures on which the benchmarks are run rather than specifying different vendors. The pump problem was also run with different types of processors, Infiniband architecture, operating systems, etc. The pump problem was mainly run with various Intel processors. The first was with an Intel E5-2670 processor with a 2.6 GHz processor, which had the best performance overall. It had 64 GB RAM per machine and the operating system was Redhat Linux. Next was Intel E5-2680 with a 2.7 GHz processor. This processor was Intel Sandy Bridge, a dual-CPU, 16-core processor with 28 GB RAM and CentOS as the operating system. The last (but not least) was Intel x5650, with a 2.67 GHz processor with 39 GB RAM and SLES as the operating system, as shown in Figure 6.10. The SLES operating system is from SUSE Linux Enterprise Service. This is a much more stable, secure, and user-friendly version of Linux. Figure 6.10 shows the values only for the solver rating; core solver speed and efficiency are listed in Table 6.9.

3.2.2. Le Mans Car Simulation

Millions of dollars are spent improving the design of sports cars. Le Mans is an example. The CFX team performed an analysis fn this car using a mesh size of 1,864,025 cells. All of the elements were tetrahedral. The turbulence model was k-ε with a density-based implicit solver. Figure 6.11 shows the benchmark curve and Table 6.10 lists the detailed parameters. Figure 6.11 also shows that Intel E5 2670 depicted a linear trend for even 64 cores, and that this was the best among the three processors tested.
image
Figure 6.10 Benchmarking for the problem of an automotive pump simulation with ANSYS CFX software.

Table 6.9

Benchmarks performed with CFX software for the pump simulation problem

ProcessesMachinesCore solver ratingCore solver speedupCore solver efficiency
HP SL230sG8 with Intel Sandy Bridge (16-core dual CPU) with 64 GB RAM per machine, RHEL 6.2 using FDR Infiniband without turbo mode
11881100%
211621.8492%
412853.2481%
613814.3372%
814835.4969%
1015806.5966%
1216917.8565%
1618479.6360%
322149016.9353%
644261829.7546%
Intel Sandy Bridge (16-core dual CPU) with 28 GB RAM
1196.91100%
21180.81.8793%
31262.62.7190%
41317.63.2882%
61421.54.3573%
Table Continued

image

ProcessesMachinesCore solver ratingCore solver speedupCore solver efficiency
81533.35.5169%
121732.27.5663%
161881.69.157%
Intel Gulftown/Westmere (12-core dual CPU) with 39 GB RAM
1183.41100%
21149.21.7989%
31214.92.5886%
412653.1879%
61334.94.0267%
81407.54.8961%
121496.65.9550%

image

image
Figure 6.11 Benchmarking for problem of Le Mans car with ANSYS CFX software.

Table 6.10

Core solver rating, core solver speedup, and efficiency details for the problem of flow over a Le Mans car body with ANSYS CFX software

ProcessesMachinesCore solver ratingCore solver speedupCore solver efficiency
HP SL230sG8 with Intel Sandy Bridge (16-core dual CPU) with 64 GB RAM per machine, RHEL 6.2 using FDR Infiniband without turbo mode
11531100%
211112.09105%
412113.98100%
612945.5592%
813747.0688%
1014418.3283%
1215149.781%
14158010.9478%
16161711.6473%
322116822.0469%
644192036.2357%
Intel Sandy Bridge (16-core dual CPU) with 28 GB RAM
1157.91100%
21120.82.09104%
31169.72.9398%
41230.43.98100%
61322.45.5793%
81413.47.1489%
121543.49.3978%
16164011.0669%
Intel Gulftown/Westmere (12-core dual CPU) with 39 GB RAM
1150.41100%
21103.12.04102%
31139.12.7692%
411873.7193%
61244.84.8581%
81296.95.8974%
1213476.8857%

image

3.2.3. Airfoil Simulation

Airfoil simulation was conducted with 9,933,000 cells. All of the elements were hexahedral. Shear stress transport was used as a turbulence model whereas coupled implicit simulation was used to solve the flow equations. Figure 6.12 shows the curves and Table 6.11 lists the values of performance parameters.
image
Figure 6.12 Airfoil simulation benchmarking with ANSYS CFX software.

Table 6.11

Core solver rating, core solver speedup, and efficiency details for the problem of flow over an airfoil

ProcessesMachinesCore solver ratingCore solver speedupCore solver efficiency
HP SL230sG8 with Intel Sandy Bridge (16-core dual CPU) with 64 GB RAM per machine, RHEL 6.2 using FDR Infiniband without turbo mode
11161100%
21332.06103%
41674.19105%
61966100%
811217.5695%
1011438.9489%
1211599.9483%
14117510.9478%
16119312.0675%
Intel Sandy Bridge (16-core dual CPU) with 28 GB RAM
1114.91100%
2130.62.06103%
3146.13.09103%
41624.17104%
Table Continued

image

ProcessesMachinesCore solver ratingCore solver speedupCore solver efficiency
6192.96.24104%
81116.67.8398%
121151.810.285%
161174.911.7573%
Intel Gulftown/Westmere (12-core dual CPU) with 39 GB RAM
1114.81100%
2129.21.9798%
3142.42.8695%
4156.73.8296%
6176.15.1386%
8187.35.8874%
12196.96.5354%

image

4. OpenFOAM® Benchmarking

Throughout the text we have been discussing commercial codes, mainly ANSYS Fluent. We now focus on open source codes as well. Open source means that you can modify the code according to your need. You may add subroutines, programs, functions, and so on, according to your needs. The most well-known open source code for CFD simulations is OpenFOAM, in which “Open” refers to the open source and “FOAM” stands for Field Operation and Manipulation. The benchmark presented here is for the problem of cavity flow—a famous problem in the CFD community. The case was simulated with the 2.2 version of OpenFOAM.
The following table (Table 6.12) illustrates the conditions under which the problem was simulated.
An Altix system contains processors that are connected by the NUMALINK in a fat-tree topology. The terminology of fat-tree evolves from the fact that branches become thinner as they move from bottom to top; here, network branches become thicker until they reach the master node. Thus, it is like a tree, but inverted.
Like conventional clusters, and in this case as well, each node is fitted into a blade that later fits into an enclosure or chassis; it is also called the individual rack unit (IRU). The IRU is a 10-unit enclosure that contains the necessary components to support the blades, such as power supplies, two router boards (one for every five blades), and an L1 controller. Each IRU can support 10 single-width blades or two double-width blades and eight single-width blades. The IRUs are mounted in a 42-U-high rack; thus, each rack supports up to four IRUs. The Altix ICE X blade enclosure features two 4x DDR Infiniband switch blades.

Table 6.12

Flow conditions for the cavity flow problem

Reynolds number1000
Kinematic viscosity0.0001 m2/s
Cube dimension0.1 × 0.1 × 0.1 m
Lid velocity1 m/s
deltaT0.0001 s
Number of time steps200
Solutions written to disk8
Solver for pressure equationPreconditioned conjugate gradient (PGC) with diagonal incomplete Cholesky smoother (DIC)
Decomposition methodSimple
The process was expedited on an SGI machine by Silicon Graphics, Inc®. The model was an Altix Ice X computer with 1404 nodes, each carrying two eight-core Intel Xeon E5-2670 CPUs and 32 GB memory per node. The interconnect was FDR and FDR-10 Infiniband. For this problem, for larger mesh sizes, the case was split into nodes in multiples of 9. You may consider them to be cores. Here, the nodes are grouped in IRUs of 18 nodes each, where each IRU has two switches connecting 9 × 9 nodes together. It is beneficial to fill these IRUs, with respect to both communication and fragmentation of the job queue.
The simulated cases are shown in Table 6.13. The 27-million mesh size on one node was not run because the RAM was not adequate. On the other hand, smaller mesh sizes are not run on a large number of nodes not because of memory, but because extra time latency in communication causes a drop in parallel performance.
The results of this scaling study are presented as plots indicating speedup and parallel efficiency. All results are based on total analysis time, including all startup overhead.
Speedup and parallel efficiency are calculated with the lowest number of nodes as a reference; i.e., speedup is computed relative to one node for all meshes except the 27-million cell mesh, in which the speedup is relative to two nodes. In Figure 6.13, a trend is evident in that the highest performance is achieved by 27 million cores. The smallest mesh size case does not show linear behavior because efficiency drops with a higher number of processes as a result of intercommunication, as shown in Figure 6.13. Figure 6.14 plots parallel efficiency. The ideal behavior is obviously one corresponding to 100% efficiency. It is normal to obtain efficient results and then to see efficiency drop after a certain number of cores (depending on the problem size and other overheads), just as the speedup drops after a certain number of cores.
image
Figure 6.13 Speedup curve for the simulation of flow in a cavity using OpenFOAM.

Table 6.13

Problem scale span and number of cores

NodesMesh size
1M3.4M8M15.6M27M
1N (16 cores)YesYesYesYesNo
2N (32 cores)YesYesYesYesYes
4N (64cores)YesYesYesYesYes
9N (145 cores)YesYesYesYesYes
18N (288 cores)YesYesYesYesYes
27N (432 cores)YesYesYesYesYes
36N (576 cores)YesYesYesYesYes
72N (1152 cores)YesYesYesYesYes
144N (2304 cores)YesYesYesYesYes
288N (4608 cores)NoNoYesYesYes

image

image
Figure 6.14 Parallel efficiency curves for different problem sizes. Notice that the ideal curve has 100% efficiency.
image
Figure 6.15 Milestones of supercomputing in Russia.
image
Figure 6.16 Places in Russia where clusters have been deployed.

5. Case Studies of Some Renowned Clusters

5.1. t-Platforms' Lomonosov HPC

t-Platform is a leading company from Russia in the field of HPC. Germany, the United States, and China are the main competitors in the field of HPC, but the Russians are also on their way to take the lead. This became clear to the world in November 2009, when Lomonosov became the 12th in the Top500 Web site list of the world's largest supercomputers. Figure 6.15 shows the history of supercomputing in Russia. The graph shows the increment in teraflops achieved in the past decade of supercomputing at Moscow State University (MSU). The last three years show that HPC machines in Russia have broken the barrier of teraflops per second by entering into the petaflop range. A number of supercomputing centers working in Russia are shown in Figure 6.16 [3].

5.2. Lomonosov

The Lomonosov is the largest supercomputer in Moscow, Russia, located at MSU. It was named after renowned Russian scientist M.V. Lomonosov [3]. The peak performance of Lomonosov is 420 teraflops and LINPACK measured performance is 350 teraflops. System efficiency is 83%, which is considered the best in the world in terms of the performance of supercomputers. The Lomonosov is based on the T-Blade 1.1i and T-Blade 2T and TB2-TL, which are equipped with GP-GPU nodes. All of the system is in-house except the processors, which are Intel or NVIDIA Tesla, power supply units, and cooling fans. Figure 6.17 shows the hall in which the Lomonosov cluster was installed.
image
Figure 6.17 Lomonosov cluster with inset of M.V. Lomonosov.
image
Figure 6.18 Water tanks installed in the basement to cool the gigantic cluster.

5.2.1. Engineering Infrastructure

The Lomonosov cluster consumes around 1.36 MW of power and has redundant power supply units in case of failure. The uninterruptible power supply (UPS) system is guaranteed to provide sufficient power and cool down the system for the time required to shut down running tasks gracefully and shut the system down appropriately. Two UPS units provide separate power to the two segments of Lomonosov, each with a performance of 200 teraflops.
In addition, in case of power loss, one compute segment can be powered down to allocate more power to the critical computational segment. The specialty of the UPS system is 97%, which is above the conventional 92% used at the industrial level. High efficiency is a must for such huge computational systems.
Another feature of Lomonosov that makes it ubiquitous is its high-scale computational density workload, which draws 65 kW of power per a 42-U, approximately 73.5 in rack. A separate cooling system (Figure 6.18) that occupies an 800 m2 room provides cooling for this massive structure. Because of the long winters in Russia, the system is also cooled by the free outside atmosphere, by cutting off compressors running through water chillers. This helps reduce power consumption for about half the year. The system is also equipped with a fire security system. Within a half second the automatic fire system fills the entire room with a gas, terminating fire without damaging any of the equipment components. The fire is suppressed but it does not lower the concentration of oxygen in the room, and thus it is relatively safe for personnel.

5.2.2. Infrastructure Facts

Preparation for this gigantic mega-structure involves reinforcing floors to accommodate rack cabinets weighing more than 1100 kg each, as well as insulating the data center walls to keep a nominal 50% humidity. Six water tanks are used for the water circulation system, carrying over 31 tons of water to provide necessary cooling. The UPS and the cooling and management subsystems are tightly coupled. The system has a two-stage scenario in which it analyzes the first 3 min of an event to determine whether the power loss was temporary, in which case normal operation can be restored, or whether it is a permanent situation, in which case the system will restart a proper shutdown procedure, creating backups for all running jobs. In the event of external power loss, the entire system takes 10 min to shut down completely. The cooling system consists of an innovative hot–cold air containment system; high-velocity air outlets provide efficient, even air mixed with minimal temperature deviations in the hot aisle zones.

Table 6.14

Key features of the Lomonosov cluster

FeaturesValues
Peak performance420 teraflops/s (heterogeneous nodes)
Real performance350 teraflops/s
LINPACK efficiency83%
Number of compute nodes4446
Number of processors8892
Number of processor cores35,776
Primary compute nodesT-Blade 2
Secondary compute nodesT-Blade 1.1 peak cell S
Processor type of primary compute nodeIntel Xeon X5570
Processor type of secondary compute nodeIntel Xeon X5570, power cell 8i
Total RAM installed56 TB
Primary interconnectQDR Infiniband
Secondary interconnect10G ethernet, gigabit ethernet
External storageUp to 1350 TB, t-Platforms ready storage SAN 7998–Lustre
Operating systemClusterX t-Platforms edition
Total covered area occupied by system252 m2
Power consumption1.36 MW

5.2.3. Key Features

Key features of the Lomonosov cluster are given in Table 6.14. The details consider only the Intel Xeon Westmere series and not NVIDIA GPUs.

5.3. Benchmarking on TB-1.1 Blades

This benchmarking was performed on TB-1.1 blades. These blades are made by MSU HPC team and contained 264 cores in 16 blade enclosures. Thus, each blade contained two processors. The tests were performed in two stages. In the first stage the processors used were AMD Opteron 6174 (Magny-Cour), each with 12 cores, and so 24 cores per blade. In the second stage the processors used were Intel Westmere X5670, each with a hex core, and so 12 cores in a single blade. The performance curves were tested for 3 million and 8 million cells of a CFD problem. The problem consisted of an NACA 0012 aerofoil. These curves were tested for 256 cores of an AMD Magny-Cour processor and a 128 Intel Xeon Westmere processor. Figure 6.19 shows the curve for the 3-million cell problem in ANSYS Fluent and Figure 6.21 shows a plot of the 8-million cell problem in ANSYS Fluent performed with AMD Opteron 6174 (Magny-Cour) processors. Figure 6.20 and Figure 6.22 show the performance curves of Intel Westmere X5670 and a comparison with AMD cores.
image
Figure 6.19 Performance results of 3-million mesh size problem run on an Intel Westmare hex core processor.
image
Figure 6.20 Comparison of the two processors' performance for a 3-million mesh size problem.
image
Figure 6.21 Performance results of an 8-million mesh size problem run on an Intel Westmare hex core processor.
This shows that Intel performs much faster than an AMD processor. An expert from t-Platforms discussed this and said that Intel Xeon performance is much better with AMD Magny-Cour. This is because of the overall clock speed of Intel and AMD. The AMD Magny is 2.2 GHz whereas the Intel Xeon is 2.93 GHz. Overall, for both AMD and Intel it is obvious that for a single problem, as the number of cores increases the time to complete 10,000 iterations decreases until it show no significant decrease with increasing cores. The IST team is thankful to t-Platform for this ultimate support in benchmarking.

5.4. Other t-Platform Clusters

5.4.1. Chebyshev Cluster

The MSU Chebyshev supercomputer (Figure 6.23) with 60 teraflops/s of peak performance was the most powerful computing system in Russia and eastern European countries before Lomonosov. It was named after the Russian mathematician P.L. Chebyshev. The Chebyshev, with 60 teraflops/s of peak performance, was based on 625 blades designed by t-Platform, incorporating 1250 quad-core Intel Xeon E5472 processors. The LINPACK real performance test result was 47.17 teraflops/s or 78.6% peak performance. The MSU Chebyshev supercomputer incorporated the most recent technological findings from the industry and used several in-house developed technologies. Its computing core used the first Russian-developed blade systems, which incorporate 20 quad-core Intel Xeon 3.0 GHz 45 nm processors in a 5-U chassis, providing the highest computing density among all Intel-based blade solutions on the market.
image
Figure 6.22 Comparison of the two processors' performance for an 8-million mesh size problem.
image
Figure 6.23 Chebyshev cluster.

5.4.2. SKIF Ural Supercomputer

The SKIF Ural Supercomputer incorporates advanced technical solutions. It incorporates over 300 up-to-date, 45 nm Intel Hypertown quad-core processors. Along with the SKIF MSU supercomputer, SKIF Ural was the first Russian supercomputer of the time built using the Intel Xeon E5472 processors. This supercomputer is also equipped with advanced engineering software for research involving analysis and modeling. FlowVision is an example and is Russia's indigenous computer product. There are many application areas, the foremost of which is CFD. Others include nanotechnology, optics, deformation and fracture mechanics, 3D modeling and design, and large database processing. In March 2008, SKIF Ural occupied the fourth position in the eighth edition of the Top50 list of the fastest computers in the Commonwealth of Independent States countries.

5.4.3. Shaheen Cluster

The Shaheen cluster consists of a 16-rack IBM Blue Gene/P supercomputer owned and operated by King Abdullah University of Science and Technology (KAUST). Built in partnership with IBM, Shaheen is intended to enable KAUST faculty and partners to research both large- and small-scale projects, from intuition to realization. Shaheen is the largest and most powerful supercomputer in the Middle East. Originally built at IBM’s Thomas J. Watson Research Center in Yorktown Heights, New York, Shaheen was moved to KAUST in mid-2009 [4]. The creator of Shaheen is Majid Alghaslan, KAUST's founding interim chief information officer and the university’s leader in the acquisition, design, and development of the Shaheen supercomputer. Majid was part of the executive founding team for the university and the person who named the machine.
5.4.3.1. System Configuration
Shaheen includes 16 racks of Blue Gene/P, with a peak performance of 222 teraflops. It also contains 128 IBM System X3550 Xeon nodes, with a peak performance of 12 teraflops. The supercomputer contains 65,536 independent processing cores. The Blue Gene/P is technology evolved from Blue Gene/L. Each Blue Gene/P motherboard chip contains four PowerPC 450 processor cores running at 850 MHz. The cores are cache coherent and the chip can operate as a four-way symmetric multiprocessor. The memory subsystem on the chip consists of small private L2 caches, a central shared 8 MB L3 cache, and dual DDR2 memory controllers. The chip also integrates the logic for node-to-node communication, using the same network topologies as Blue Gene/L, but at more than twice the bandwidth. A compute card contains a Blue Gene/P chip with 4 or 4 GB DRAM, comprising a compute node. A single compute node has a peak performance of 13.6 gigaflops. Thirty-two compute cards are plugged into an air-cooled node board. A rack contains 32 node boards (thus, 1024 nodes and 4096 processor cores).

6. Conclusion

This discussion shows how various supercomputer manufacturers market their product by benchmarking. It gives the idea that if users want to make their own cluster or want to hire some vendor, they can easily obtain their desired machine by analyzing standard benchmark curves. However, users must know mesh requirements beforehand. Also, only the speedup curves are important, but also the efficiency, as shown in tabular form for ANSYS Fluent and CFX and in graphical form for OpenFOAM. Budgeting is also important; it is not advisable to buy an expensive machine (such as IBM) if your problems usually run on less than 250 cores. IBM machines are useful for big problem sizes of more than 10 million. This chapter had no rocket science behind it; it was an information guide for establishing an economical and productive HPC cluster.

References

[1] William Aiken. Sun business ready HPC for ANSYS FLUENT-Configuration guidelines for optimizing ANSYS FLUENT performance. ISV engineering, sun BluePrints™ online, part No 821-0696-10, Revision 1.0, September 4, 2009.

[2] ANSYS HPC Benchmark, http://www.ansys.com/benchmarks, visited September 2014.

[3] Moscow State University and High Performance Computing, presentation Vladimi Voevodin, deputy director, Research Computing Center from Moscow State University, Helsinki, Finland – April 13, 2011.

[4] Webpage on, Michael Feldman, HPC Wire, Saudi Arabia Buys Some Big Iron, October 1, 2008. Retrieved 2009-03-13. Accessed on 28th March 2015.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset