Index
Note: Page numbers followed by “f” and “t” refer to figures and tables, respectively.
A
Aligned version, of loop,
160f
allocate_huge_pages(),
289
Architecture, of coprocessor,
8f,
243
cache organization and memory access consideration,
251–252
individual core architecture,
247–249
instruction and multithread processing,
249–251
PCIe system interface and DMA,
257–260
power management capabilities,
260–262
reliability, availability, and serviceability (RAS),
263–265
machine check architecture (MCA),
264–265
system management controller (SMC),
265–266
potential application impact,
266
thermal design power monitoring and control,
266
vector processing unit architecture,
253–257
Array section operator,
153
prefetch instructions,
162
unaligned loads and stores,
158–159
Asynchronous data transfer,
229–234
from coprocessor to processor,
233–234
automatic and compiler-assisted use,
335
compiler-assisted offload, using,
337–338
disable and re-enable,
335
Intel MKL, effective use of,
336–337
data alignment and leading dimensions,
336
favor LAPACK unpacked routines,
336–337
openMP and threading settings,
336
oversubscription, avoiding,
336
using control work division,
333
B
Baseboard management controller (BMC),
246,
265
Baseline 9-point stencil implementation,
61–68
Bash shell, environment variables in,
331f
Basic diffusion calculation,
84
Basic Linear Algebra Subprograms (BLAS),
326
Boundary effects accounting,
84–91
Built-in operations,
155t
C
C/C++ arrays, allocating memory for,
212–213
C Shell, environment variables in,
331f
Cache line reuse, in diffusion calculation,
100,
101f
Cache optimizations,
20–21
Card System Management Agent,
280
Carry-propagate instructions (CPI),
365–369
formulas and thresholds,
366
and elemental functions,
187
Cilk plus array sections and elemental functions,
152–155
writing target-specific code with,
223–224
C-language interfaces for the BLAS (CBLAS),
326–327
clevicts, compiler generation of,
122–123
Clock gate L2 cache and inter-processor network,
262f
Clocksources
Communicating with the coprocessor,
26–27
Coprocessor Offload Infrastructure,
278
Compiler and programming models,
19–20
Compiler generation of
clevicts,
122–123
Compiler-assisted offload,
337f
Compute to data access ratio
formulas and thresholds,
369
Console kernel command line parameter,
299
Coprocessor offload interface (COI) layer,
240
Coprocessor support overview,
327–330
automatic offload, control functions for,
328–329
environment variables,
330
Cores
D
compiler of alignment,
144
diffusion_baseline.c listing,
90f
diffusion_openmp() code,
94
diffusion_openmpv function,
95f
diffusion_peel function,
98f
diffusion_tiled() function,
101,
103f
dir filelist directive,
312
Direct Memory Access (DMA)
Double precision floating point (DP),
24
Drand48 Vectorization, example of,
135
E
carry-propagate instructions (CPI),
365–369
compute to data access ratio,
369–370
Erand38 vectorization,
138f
Event monitoring registers,
364
External Bridge Topology, in Linux,
304
F
File location parameters,
301
First Intel Xeon Phi coprocessor
Floating point operations per second (FLOP/s),
24,
27–28,
375
Floating-point arithmetic variations,
339–342
floating-point exceptions,
340
-fp-model precise option,
340
fused multiply-add (FMA) instruction,
340
math functions, precision of,
341
Xeon Phi coprocessors and Intel Xeon processors,
341–342
Fortran
allocating memory for parts,
213–214
asynchronous data transfer
from coprocessor to processor,
233–234
example of using pointers for various allocation types,
208f
example showing static vs. dynamic memory declarations,
207f
target-specific code using a directive in,
208f,
209,
210f
vector addition using Cilk Plus array notation,
109f
with “short vector” syntax,
110f
vector addition using standard C,
109f,
109f
Fortran prefetch intrinsics,
121f
Fused multiply and add (FMA),
24,
28
G
Guided Auto-parallelization (GAP) report,
112
H
Hardware prefetcher (HWP),
253
MPI implementation specific settings,
356
mpirun argument sets,
355
Fortran listing of, with OpenMP directives for scaling,
43f
in Intel Xeon Phi coprocessor,
40f
High Performance Computing (HPC),
243
High Speed Driver Certificate,
105
Host Channel Adapter (HCA),
282
Huge 2-MB memory pages,
79–80
Hyper-threading vs. multithreading,
17–18
I
core coprocessor modifications,
286
InfiniBand Host Channel Adapter (HCA),
282
Inlining, importance of,
126
Intel Cluster Studio XE 2013,
269–270
Intel Coprocessor Offload Infrastructure (Intel COI),
278,
278
Intel Many Integrated Core (MIC) Architecture,
10,
28
Coprocessor Offload Infrastructure,
278
coprocessor components for MPI applications,
282–287
coprocessor communication link (Intel CCL),
282–285
IB core coprocessor modifications,
286
vendor HCA proxy driver,
287
coprocessor system management,
279–282
board tools, control panel, MicAccess SDK,
280–282
MYO (mine yours ours),
277
SCIF (symmetric communications interface),
278
sysfs and ganglia support,
281f
virtual networking (NetDev), TCP/IP, and sockets,
278
Intel Math Kernel Library (Intel MKL),
15,
107,
276
differences on coprocessor,
327
support function examples
Intel Math Kernel Library Reference Manual,
327
Intel Parallel Studio XE 2013,
269–270
loaning components to,
184
Intel Trace Analyzer and Collector (ITAC),
378–380
coprocessor only application,
379
processor+coprocessor application,
379–380
Intel Xeon Phi coprocessor, ,
59–60,
107,
165,
244–245,
270,
271–272,
341–342,
354,
372,
373
cache optimizations,
20–21
compiler and programming models,
19–20
maximum performance, achieving,
16–17
floating point operations per second (FLOP/s),
27–28
High Speed Driver Certificate,
105
highly parallel execution, measuring readiness for,
15–16
hyper-threading vs. multithreading,
17–18
memory architecture of,
50f
MPI vs. offload model,
18–19
multiple cores, effect of using,
56f
ninja gap, controlling,
operational software,
269,
269
parallel program performance, maximizing,
15
performance, maximizing,
11–12
platform specifications,
62t
software architecture,
270f
transformation, for performance,
17
transforming-and-tuning double advantage,
10
vector floating point formats on,
25f
Intelligent Platform Management Interface (IPMI),
246
Intel Composer XE 2013,
26
Intel Many core Platform Software Stack (MPSS),
26
Intel Many Integrated Core (MIC),
28
Internal Bridge Topology, in Linux,
303
J
jrand48 vectorization,
140f
K
L
“Least recently used” (LRU),
100
Linux on coprocessor,
293
bootstrap and configuration,
294–295
changing coprocessor configuration,
297–305
configuring boot parameters,
298–300
coprocessor root file system,
300–305
coprocessor Linux baseline,
293–294
default coprocessor Linux configuration,
295–297
external bridge topology,
304
Linux cluster, coprocessors in,
318–322
booting coprocessors,
306
configuration initialization and propagation,
308–309
configuration parameters, helper functions for,
309–311
coprocessor state control,
306
rebooting the coprocessors,
306–307
resetting coprocessors,
307
shutting down coprocessors,
306
dirfilelist directive,
312
nod filelist directive,
313
pipe filelist directive,
313
slink filelist directive,
313
sock filelist directive,
313
support for Intel Xeon Phi coprocessors,
287
Linux operating system (OS),
269
Linux Standard Base (LSB),
294,
294t
Linux support, for Intel Xeon Phi coprocessors,
287
Loadable kernel modules (LKMs),
294
Logging
stdout and
stderr from offloaded code,
240
Loop
lrand38 vectorization,
138f
M
Machine check architecture (MCA),
264–265
Many core Platform Software Stack (MPSS),
26
Many-one array section,
151
precision choices and variations,
339–342
fast transcendentals and mathematics,
339
for floating-point arithmetic variations,
339–342
Memory allocation for pointer variables
C/C++ offload pragma examples,
203–204
coprocessor memory management for input pointer variables,
201–202
Fortran offload pragma examples,
204–206
length parameter in Fortran,
201
pointer variables, alignment of,
203
target memory management for output pointer variables,
202
transferring data into pre-allocated memory on target,
202
Memory architecture on Intel Xeon Phi coprocessor,
50f
formulas and threshold,
376
Memory disambiguation inside vector-loops,
127–128
Message Passing Interface (MPI), ,
276,
343
on Intel Xeon Phi coprocessors,
345–348
using natively on coprocessor,
354–361
coprocessor-only model,
275
on the host processor platform using offload to coprocessors,
274f
running on coprocessors only,
275f
symmetric communications,
276f
MIC Access Software Developers Kit (MicAccess SDK),
280,
280–282,
282
MIC elapsed time counter (micetc),
380
Microarchitecture of Intel Xeon Phi coprocessor,
9f
Microsoft’s Parallel Patterns Library (PPL),
176
MPI application using Intel Coprocessor Communications Link,
271,
285f
MPI vs. offload model,
18–19
mpirun argument sets,
355
mrand48 vectorization,
139f
Multiple cores, effect of using
on Intel Xeon Phi coprocessor,
56f
Multiple declarations, target attribute to,
234–238
measuring timing and data in offload regions,
236
Vec-report option used with offloads,
235–236
MYO (Mine Your Ours),
277
N
Need for Intel Xeon Phi coprocessor,
2–5
Ninja gap,
nod filelist directive,
313
Non-shared memory model,
190,
191
Non-uniform memory access (NUMA),
345
-no-opt-prefetch option,
121
nrand48 vectorization,
139f
O
OFED (Open Fabrics Enterprise Edition),
282
asynchronous data transfer,
229–234
from coprocessor to processor,
233–234
C/C++ arrays, allocating memory for,
212–213
choosing vs. native execution,
191–192
writing target-specific code with,
223–224
compiler options and environment variables for,
193–195
creating offload libraries,
237–238
explicit copy modifiers,
201t
fortran arrays, allocating memory for,
213–214
Intel Math Kernel Library (Intel MKL) automatic,
192
libraries in offloaded code,
237
logging stdout and stderr from offloaded code,
240
memory allocation for pointer variables,
198–199
C/C++ offload pragma examples,
203–204
coprocessor memory management for input pointer variables,
201–202
Fortran offload pragma examples,
204–206
length parameter in Fortran,
201
pointer variables, alignment of,
203
target memory management for output pointer variables,
202
transferring data into pre-allocated memory on target,
202,
202–203
to multiple coprocessors,
195
multiple declarations, target attribute to,
234–238
measuring timing and data in offload regions,
236
Vec-report option used with offloads,
235–236
non-shared memory model,
191
performing file i/o on the coprocessor,
238–240
placing variables and functions on coprocessor,
198–199
pragma in C/C++, target-specific code using,
206–209
predefined macros for Intel MIC architecture Fortran arrays,
211
restrictions, using pragma,
215–216
shared memory management functions,
219
C++ declarations of persistent data with,
227–228
sharing environment variables for,
195
sharing variables and functions,
220–221
synchronous and asynchronous function execution,
219–220
target-specific code using
directive in fortran,
209
Offload programming model,
41–44
On-Die Interconnect (ODI),
247f
Open Fabrics Enterprise Distribution (OFED),
274,
282,
287
OpenMP Parallel vs.
DO CONCURRENT,
173–174
OpenMP Target (TR1) models,
194t
-opt-prefetch=50 option,
121
P
Parallel 9-Point 2D stencil code,
75f
Parallelization on processor and coprocessor,
59
“alignment,” adjusting,
76–77
baseline 9-point stencil implementation,
61–68
baseline stencil code, running,
68–70
huge 2-MB memory pages,
79–80
9-point stencil algorithm,
60
vectors plus scaling,
72–75
Parallel Patterns Library (PPL),
176
maximizing performance of,
15
Partitioned Global Address Space (PGAS) model,
171
“Peel” unneeded code from inner loop,
97–100
Performance application programming interface (PAPI),
378
pipe filelist directive,
313
Platform specifications of Intel Xeon Phi coprocessor,
62t
Platforms with coprocessors,
5–6
9-point stencil algorithm,
60
over two-dimensional array,
61f
Portable High Performance Programming,
Potential performance issues,
370–377
Power management capabilities,
260–262
PowerManagement kernel command line parameter,
299–300
Pragma in C/C++, target-specific code using,
206–209
#pragma offload target,
44
#pragma omp parallel,
130
#pragma vector novecremainder,
140
Preproduction Intel Xeon Phi coprocessor,
7f
Processor and coprocessor
core/thread parallelism,
3f
compute to data access ratio,
369–370
event monitoring, registering,
364
Intel Trace Analyzer and Collector,
378–380
coprocessor only application,
379
processor+coprocessor application,
379–380
Intel VTune amplifier XE product,
16,
377–378
performance application programming interface,
378
potential performance issues,
370–377
clocksources on coprocessor,
380
MIC elapsed time counter,
380
setting the clocksource,
381
Programming options, in coprocessor,
273f
Q
R
Random number function vectorization,
137–138
RDMA Peer-to-Peer Transfer with Intel CCL,
284f
Real-world code example,
83
basic diffusion calculation,
84
boundary effects accounting,
84–91
peeling code from inner loop,
97–100
vectorization ensuring,
93–96
Reliability, availability, and serviceability (RAS),
263–265
Remote direct memory access (RDMA),
282
RootDevice kernel command line parameter,
300
Runtime Type Information (RTTI),
225
S
Scatter/gather instructions,
161f
Share virtual memory offload,
192
Shared secure shell (SSH) keys,
348
C++ declarations of persistent data with,
227–228
Simple profiling, importance of,
126
Single Dynamic Library (SDL),
327
Single instruction, multiple data (SIMD),
24
vectorize, requirements to,
130–131
Single precision floating point (SP),
24
Single-program, multiple-data (SPMD),
343
slink filelist directive,
313
sock filelist directive,
313
Software architecture and components,
269
development tools and application layer,
276–277
coprocessor programming models and options,
271–275
coprocessor-only model,
275
Intel Manycore platform software stack,
277–287
board tools, control panel, MicAccess SDK,
280–282
Coprocessor Offload Infrastructure,
278
coprocessor communication link (Intel CCL),
282–285
coprocessor system management,
279–282
IB core coprocessor modifications,
286
MPI applications, coprocessor components for,
282–287
MYO (mine yours ours),
277
SCIF (symmetric communications interface),
278
vendor HCA proxy driver,
287
virtual networking (NetDev), TCP/IP, and sockets,
278
Software architecture of Intel Xeon Phi coprocessor,
270f
Software Developers Kit (SDK),
280
Source Code, for loop,
160f
Specifications, of coprocessor,
24–26
std::condition variable,
181
Stencil Diffusion Algorithm Code,
85f
sten2d9pt_base.c Listing,
67f
sten2d9pt_omp Function,
73f
sten2d9pt_vect Function,
71f
Structured Parallel Programming,
385
Symmetric communications interface (SCIF),
278,
282,
295
Symmetric multiprocessor (SMP), ,
243,
247
System Management Bus (SMBus),
246
T
Thread pools, importance of,
168
Threads
increasing the number of,
38
performance improvements from,
104f
Timer hardware devices,
380
clocksources on coprocessor,
380
MIC elapsed time counter,
380
setting the clocksource,
381
Transforming-and-tuning double advantage,
10
description and usage,
374
formulas and thresholds,
373
hybrid workloads, manual load balance of,
357–358
Tuning memory allocation performance,
288–290
number of 2-MB pages
sample method for allocating 2-MB pages,
289–290
U
Unaligned version, of loop,
160f
Uniform Memory Access (UMA),
247
User-level application code,
271
V
Vec-report option with offloads,
235–236
Vector floating point formats on Intel Xeon Phi coprocessor,
25f
description and usage,
375
directives/pragmas to assist,
109
numerical result variations with,
163
numerical result variations with,
163
six step vectorization methodology,
110–112
techniques to achieve,
108t
VerboseLogging kernel command line parameter,
299
Virtual networking (NetDev), TCP/IP, and sockets,
278
Virtual-shared memory model,
190
W
X