Index

Note: Page numbers followed by “f” and “t” refer to figures and tables, respectively.

A

affinity_partitioner, 178–179, 178, 179f
-align rec64byte, 143
-align recnbyte, 143
ALIGNED clauses, 134
Aligned version, of loop, 160f
allocate_huge_pages(), 289
Architecture, of coprocessor, 8f, 243
benchmarks, 267
cache organization and memory access consideration, 251–252
card design, 245–246
individual core architecture, 247–249
instruction and multithread processing, 249–251
PCIe system interface and DMA, 257–260
DMA capabilities, 258–260
power management capabilities, 260–262
prefetching, 252–253
reliability, availability, and serviceability (RAS), 263–265
machine check architecture (MCA), 264–265
system management controller (SMC), 265–266
fan speed control, 266
potential application impact, 266
sensors, 265–266
thermal design power monitoring and control, 266
vector processing unit architecture, 253–257
vector instructions, 254–257
Array notations, 109–110
Array section operator, 153
Assembly code, 157
inspection, 156–163
prefetch instructions, 162
quick inspection of, 158–163, 162–163
scatters or gathers, 161–162
unaligned loads and stores, 158–159
ASSERT keyword, 134
Asynchronous computation, 228–229
Asynchronous data transfer, 229–234
from coprocessor to processor, 233–234
from processor to coprocessor, 229–234, 232–233, 233–234
Auto partitioner, 178
Auto vectorization, 109
Automatic offload mode, 332–337
automatic and compiler-assisted use, 335
compiler-assisted offload, using, 337–338
using tips, 338
debug/test, 335
disable and re-enable, 335
effective use of, 333–336
enabling, 333
Intel MKL, effective use of, 336–337
data alignment and leading dimensions, 336
favor LAPACK unpacked routines, 336–337
openMP and threading settings, 336
vs. matrix size, 334
oversubscription, avoiding, 336
situations for, 335
using control work division, 333
auto_partitioner, 178

B

Baseboard management controller (BMC), 246, 265
Baseline 9-point stencil implementation, 61–68
Bash shell, environment variables in, 331f
Basic diffusion calculation, 84
Basic Linear Algebra Subprograms (BLAS), 326
blocked_range, 177–178
Blocking, 100
Blocking factor, 101
Board tools, 280–282
Boundary conditions, 99
Boundary effects accounting, 84–91
Built-in operations, 155t

C

C/C++ arrays, allocating memory for, 212–213
C Shell, environment variables in, 331f
Cache architectures, 100
Cache line reuse, in diffusion calculation, 100, 101f
Cache optimizations, 20–21
Card System Management Agent, 280
Carry-propagate instructions (CPI), 365–369
description and usage, 366–368
events used, 365
formulas and thresholds, 366
tuning suggestions, 368–369
Cilk Plus, 181–187
array notation, 110
and elemental functions, 187
cilk_for, 184–185
cilk_spawn and cilk_sync, 185–187
history of, 183
Cilk plus array sections and elemental functions, 152–155
cilk_for, 184–185
_cilk_offload, 219–220
rules for using, 222
writing target-specific code with, 223–224
cilk_spawn and cilk_sync, 185–187
Clamshell, 245
C-language interfaces for the BLAS (CBLAS), 326–327
Clause, 134
SIMD directive, 131–133
clevicts, compiler generation of, 122–123
Clock gate core, 261f
Clock gate L2 cache and inter-processor network, 262f
Clock generators, 380
clock_gettime, 364–365
Clocksources
on coprocessor, 380
setting, 381
Code, running, 27–32
Communicating with the coprocessor, 26–27
Coprocessor Offload Infrastructure, 278
Compiler and programming models, 19–20
Compiler directives, 128–150
Compiler generation of clevicts, 122–123
Compiler options, 126–128
Compiler prefetches, 118–119, 119
Compiler tips, 123–126
Compiler-assisted offload, 337f
Coprocessor silicon chip, 243, 245, 246–247
Compute to data access ratio
description and usage, 369–370
events used, 369
formulas and thresholds, 369
tuning suggestions, 370
Console kernel command line parameter, 299
Control panel, 280–282
Coprocessor Communication Link (CCL), 282, 282–285, 285f, 345
Coprocessor offload interface (COI) layer, 240
Coprocessor support overview, 327–330
automatic offload, control functions for, 328–329
environment variables, 330
in native mode, 330–332
Coprocessor-only model, 275, 345, 346t
Cores
using all cores, 38–48

D

3D stencil, 85f
Data alignment, 114–116
to assist vectorization, 142–146
aligned STATIC arrays, 143–144
aligning data, 143
compiler of alignment, 144
Data layout, 113–114
Data locality, 100
Data reuse, 114
Data-type, 132
Diffusion algorithm, 84
diffusion_baseline(), 85
diffusion_baseline.c listing, 90f
diffusion_openmp() code, 94
diffusion_openmpv function, 95f
diffusion_peel function, 98f
diffusion_peel.c, 99
diffusion_tiled() function, 101, 103f
dir filelist directive, 312
Direct Memory Access (DMA)
capabilities, 258–260
channel arbitration, 259
channel descriptor ring, 259–260
DO CONCURRENT, 171
definition, 172–173
and DATA RACES, 171–172
definition, 172–173
vs. FOR ALL, 173
vs. OpenMP Parallel, 173
Double precision floating point (DP), 24
Drand48 Vectorization, example of, 135
Dynamic load balance, 358–361

E

Efficiency metrics, 364–370
carry-propagate instructions (CPI), 365–369
compute to data access ratio, 369–370
Elemental functions, 155
Erand38 vectorization, 138f
Error Correcting Codes (ECC), 263–264, 280–282
Event monitoring registers, 364
Extended math unit (EMU), 243, 253, 254
External Bridge Topology, in Linux, 304

F

file filelist directive, 312–313
File location parameters, 301
First Intel Xeon Phi coprocessor
key facts about, 7–9
preproduction, 7f
Floating point operations per second (FLOP/s), 24, 27–28, 375
Floating-point arithmetic variations, 339–342
basics, 339
floating-point exceptions, 340
-fp-model precise option, 340
-fp-model switch, 340
fused multiply-add (FMA) instruction, 340
math functions, precision of, 341
Xeon Phi coprocessors and Intel Xeon processors, 341–342
Fortran
arrays, 211
allocating memory for parts, 213–214
asynchronous data transfer
from coprocessor to processor, 233–234
from processor to coprocessor, 229–233, 233–234
example of using pointers for various allocation types, 208f
example showing static vs. dynamic memory declarations, 207f
length parameter in, 201
offload pragma examples, 204–206
target-specific code using a directive in, 208f, 209, 210f
vector addition using Cilk Plus array notation, 109f
with “short vector” syntax, 110f
vector addition using standard C, 109f, 109f
Fortran 2008, 167t
DATA RACES, 171–172
DO CONCURRENT, 171
Fortran array sections, 150–152
subscript triplets, 151
vector subscripts, 151
Fortran compiler, 115, 269–270
Fortran prefetch intrinsics, 121f
Full vectors, 138–141, 161f
remainder loop, 139–140
Fused multiply and add (FMA), 24, 28

G

Ganglia, 280
gettimeofday, 364–365
GPUs, 16
Guided Auto-parallelization (GAP) report, 112

H

Hardware prefetcher (HWP), 253
Hello World, MPI version of, 344, 344f, 350, 354–356
MPI implementation specific settings, 356
mpirun argument sets, 355
wrapper script, 355–356
helloflops1.c code, 27–28, 31f
helloflops2.c code, 37f
helloflops3.c code, 39f
building of, 38
Fortran listing of, with OpenMP directives for scaling, 43f
in Intel Xeon Phi coprocessor, 40f
hellomem.c code, 50, 55f
code listing of, 53f
Heterogeneity, 345–348
High Performance Computing (HPC), 243
High Speed Driver Certificate, 105
Hint value, 119
Hit Rate metrics, 372
Host Channel Adapter (HCA), 282
Host-only model, 345, 346t
Huge 2-MB memory pages, 79–80
HUGE_TLB flag, 288
Hyper-threading vs. multithreading, 17–18

I

InfiniBand (IB), 275
core coprocessor modifications, 286
Proxy Client, 286–287
Proxy Daemon, 286
Proxy Server, 286
InfiniBand Host Channel Adapter (HCA), 282
Inlining, importance of, 126
Intel C/C++ compiler, 269–270
Intel Cilk, 107
Intel Cilk Plus, 19–20, 165, 167t
Intel Cluster Studio XE 2013, 269–270
Intel Compiler, 124–126, 327
Intel Coprocessor Communication Link (Intel CCL), 275, 275, 282, 282–285, 284
Intel Coprocessor Offload Infrastructure (Intel COI), 278, 278
Intel Many Integrated Core (MIC) Architecture, 10, 28
Intel Manycore Platform Software Stack (Intel MPSS), 270, 276, 277–287, 278, 280
Coprocessor Offload Infrastructure, 278
coprocessor components for MPI applications, 282–287
coprocessor communication link (Intel CCL), 282–285
IB core coprocessor modifications, 286
IB Proxy Client, 286–287
IB Proxy Daemon, 286
IB Proxy Server, 286
OFED/SCIF, 287
vendor HCA proxy driver, 287
coprocessor system management, 279–282
board tools, control panel, MicAccess SDK, 280–282
ganglia, 280
sysfs, 279–280, 280t
MYO (mine yours ours), 277
SCIF (symmetric communications interface), 278
sysfs and ganglia support, 281f
virtual networking (NetDev), TCP/IP, and sockets, 278
Intel Math Kernel Library (Intel MKL), 15, 107, 276
automatic offload, 192
differences on coprocessor, 327
and Intel compiler, 327
overview, 326–327
support function examples
for C/C++, 330f
for Fortran, 330f
Intel Math Kernel Library Reference Manual, 327
Intel MPI library, 269–270
Intel MPSS service, 296–297
Intel Parallel Studio XE 2013, 269–270
Intel Threading Building Blocks (TBB), 165, 167t, 174–181, 181
blocked_range, 177–178
borrowing components from, 183–184
history of, 175–177
loaning components to, 184
notes on C++11, 180–181
overview of, 175f
parallel_for, 177
parallel_invoke, 180
parallel_reduce, 179–180
partitioners, 178–179
use of, 177
Intel Trace Analyzer and Collector (ITAC), 378–380
coprocessor only application, 379
processor+coprocessor application, 379–380
Intel VTune Amplifier XE, 16, 16, 93, 104, 370–371, 377–378
Intel Xeon Phi coprocessor, 1, 59–60, 107, 165, 244–245, 270, 271–272, 341–342, 354, 372, 373
architecture of, 8f
cache optimizations, 20–21
compiler and programming models, 19–20
maximum performance, achieving, 16–17
examples, 21
first coprocessor, 6–9
floating point operations per second (FLOP/s), 27–28
GPUs, 16
High Speed Driver Certificate, 105
highly parallel execution, measuring readiness for, 15–16
hyper-threading vs. multithreading, 17–18
Linux commands, 27
Linux support for, 287
memory architecture of, 50f
microarchitecture of, 9f
MPI vs. offload model, 18–19
multiple cores, effect of using, 56f
need for, 2–5
ninja gap, controlling, 9
operational software, 269, 269
parallel program performance, maximizing, 15
performance, maximizing, 11–12
platform specifications, 62t
platforms with, 5–6
preproduction of, 7f
scaling, need for, 12–14
silicon chip of, 246–247
software architecture, 270f
sysfs nodes, 281t
test card specs, 25t
time of use, 11
transformation, for performance, 17
transforming-and-tuning double advantage, 10
using MPI on, 345–348
vector floating point formats on, 25f
Intelligent Platform Management Interface (IPMI), 246
Intel Composer XE 2013, 26
Intel Many core Platform Software Stack (MPSS), 26
Intel Many Integrated Core (MIC), 28
Internal Bridge Topology, in Linux, 303
IVDEP directive, 135–137
examples in C, 136–137
fortran example, 136

J

jrand48 vectorization, 140f

K

Kernel, 271
Keyword spelling, 184
Knights Corner, 6–7

L

L1 Cache, 100
L2 Cache, 100
Language extensions for offload (LEO), 192–195, 193t, 194t
“Least recently used” (LRU), 100
Linux on coprocessor, 293
boot process, 315–318
kernel command line, 315–316
Linux kernel image, 317
bootstrap and configuration, 294–295
changing coprocessor configuration, 297–305
configurable components, 297–298
configuration files, 298
configuring boot parameters, 298–300
coprocessor root file system, 300–305
coprocessor Linux baseline, 293–294
default coprocessor Linux configuration, 295–297
external bridge topology, 304
Linux cluster, coprocessors in, 318–322
Intel Cluster Ready, 319
micctrl utility, 305–312
booting coprocessors, 306
configuration initialization and propagation, 308–309
configuration parameters, helper functions for, 309–311
coprocessor state control, 306
rebooting the coprocessors, 306–307
resetting coprocessors, 307
shutting down coprocessors, 306
software adding, 312–315
dirfilelist directive, 312
download image file, 313
file filelist directive, 312–313
new global file set, 314–315
nod filelist directive, 313
pipe filelist directive, 313
root file system, 313–314
slink filelist directive, 313
sock filelist directive, 313
support for Intel Xeon Phi coprocessors, 287
Linux operating system (OS), 269
Linux Standard Base (LSB), 294, 294t
Linux support, for Intel Xeon Phi coprocessors, 287
Loadable kernel modules (LKMs), 294
Logging stdout and stderr from offloaded code, 240
Loop
aligned version, 160f
source code for, 160f
unaligned version, 160f
unrolling, 123–124
Low trip count, 131
lrand38 vectorization, 138f

M

Machine check architecture (MCA), 264–265
mallocs, 288
Manual prefetching, 120f, 121f
Manual unrolling, 123
Many core Platform Software Stack (MPSS), 26
Many Integrated Core (MIC) Architecture, See Intel Many Integrated Core (MIC) Architecture
Many-one array section, 151
Math Library, 109, 325
precision choices and variations, 339–342
fast transcendentals and mathematics, 339
for floating-point arithmetic variations, 339–342
McCool, Michael, 385
memcpy, 196
Memory allocation for pointer variables
C/C++ offload pragma examples, 203–204
coprocessor memory management for input pointer variables, 201–202
Fortran offload pragma examples, 204–206
length parameter in Fortran, 201
managing, 198–199
pointer variables, alignment of, 203
target memory management for output pointer variables, 202
transferring data into pre-allocated memory on target, 202
Memory architecture on Intel Xeon Phi coprocessor, 50f
Memory bandwidth, 376–377
accessing, 49–54
description and usage, 376–377
events used, 376
formulas and threshold, 376
maximising, 54–57
tuning suggestions, 377
Memory disambiguation inside vector-loops, 127–128
Message Passing Interface (MPI), 6, 276, 343
on Intel Xeon Phi coprocessors, 345–348
heterogeneity, 345–348
offload from, 349–354
Hello World, 350
trapezoidal rule, 350–354
prerequisites, 348
using natively on coprocessor, 354–361
Hello World, 354–356
trapezoidal rule, 356–361
programming models, 274–275
coprocessor-only model, 275
on the host processor platform using offload to coprocessors, 274f
offload model, 274
running on coprocessors only, 275f
symmetric communications, 276f
symmetric model, 275
MIC Access Software Developers Kit (MicAccess SDK), 280, 280–282, 282
MIC elapsed time counter (micetc), 380
miccheck, 280
micctrl utility, 305–312
MIC_ENV_PREFIX=MIC, 48
micflash, 280
micinfo, 280
Microarchitecture of Intel Xeon Phi coprocessor, 9f
Microsoft’s Parallel Patterns Library (PPL), 176
MKL, 325
Moore’s Law, 4f
MPI, 343
MPI application using Intel Coprocessor Communications Link, 271, 285f
MPI Library, 378
MPI rank, 343, 356
MPI vs. offload model, 18–19
mpirun argument sets, 355
mrand48 vectorization, 139f
Multiple cores, effect of using
on Intel Xeon Phi coprocessor, 56f
Multiple declarations, target attribute to, 234–238
measuring timing and data in offload regions, 236
Vec-report option used with offloads, 235–236
Multithreading, 17–18
MYO (Mine Your Ours), 277

N

Native model, 18
Need for Intel Xeon Phi coprocessor, 2–5
Nesting, 170
NetDev drivers, 278, 279f
Network access, 302–305
Ninja gap, 9
nod filelist directive, 313
Non-shared memory model, 190, 191
NONTEMPORAL clause, 134
Non-uniform memory access (NUMA), 345
-no-opt-prefetch option, 121
NOVECTOR directives, 134–135
nrand48 vectorization, 139f

O

OFED (Open Fabrics Enterprise Edition), 282
Offload, 189
asynchronous computation, 228–229
asynchronous data transfer, 229–234
from coprocessor to processor, 233–234
from processor to coprocessor, 229–234, 232–233, 233–234
C/C++ arrays, allocating memory for, 212–213
choosing vs. native execution, 191–192
_cilk_offload, 219–220
rules for using, 222
writing target-specific code with, 223–224
_cilk_shared, 220–221
rules for using, 222
compiler options and environment variables for, 193–195
creating offload libraries, 237–238
explicit copy modifiers, 201t
fortran arrays, allocating memory for, 213–214
Intel Math Kernel Library (Intel MKL) automatic, 192
language extensions for, 192–195
libraries in offloaded code, 237
logging stdout and stderr from offloaded code, 240
memory allocation for pointer variables, 198–199
C/C++ offload pragma examples, 203–204
coprocessor memory management for input pointer variables, 201–202
Fortran offload pragma examples, 204–206
length parameter in Fortran, 201
pointer variables, alignment of, 203
target memory management for output pointer variables, 202
transferring data into pre-allocated memory on target, 202, 202–203
to multiple coprocessors, 195
multiple declarations, target attribute to, 234–238
measuring timing and data in offload regions, 236
Vec-report option used with offloads, 235–236
non-shared memory model, 191
_Offload_report, 236
and OpenMP, 198f
performing file i/o on the coprocessor, 238–240
placing variables and functions on coprocessor, 198–199
pragma in C/C++, target-specific code using, 206–209
pragma offload, 190t
predefined macros for Intel MIC architecture Fortran arrays, 211
processor-only execution, 209–211
restrictions, using pragma, 215–216
shared functions, 219
shared memory management functions, 219
shared virtual memory model, 190t, 191–192, 217–228
C++ declarations of persistent data with, 227–228
persistent data using, 225–227
restrictions on, 224–225
and shared variables, 217–218
sharing environment variables for, 195
sharing variables and functions, 220–221
support, 197t
synchronization, 222
synchronous and asynchronous function execution, 219–220
target-specific code using
directive in fortran, 209
time optimization, 206
two models, 190
using pragma/directive, 195–216
Offload model, 18, 274, 345, 346t, 349–354
Offload pragmas, 192
Offload programming model, 41–44
On-Die Interconnect (ODI), 247f
Open Fabrics Enterprise Distribution (OFED), 274, 282, 287
OpenACC models, 194t
clause, 91
code, 33
controls, 170t
directives, 169
nesting, 170
significant controls over, 169–170
OpenMP Parallel vs. DO CONCURRENT, 173–174
OpenMP Target (TR1) models, 194t
Oper, 132
-opt-assume-safe-padding, 138–141, 142
-opt-prefetch=50 option, 121

P

Padding, 76
Parallel 9-Point 2D stencil code, 75f
Parallelization on processor and coprocessor, 59
“alignment,” adjusting, 76–77
baseline 9-point stencil implementation, 61–68
baseline stencil code, running, 68–70
huge 2-MB memory pages, 79–80
9-point stencil algorithm, 60
streaming stores, 77–79
tuning, 75–80
vectorization, 70–72
vectors plus scaling, 72–75
Parallel Patterns Library (PPL), 176
Parallel processing model, 168–169
Parallel programming, 10
maximizing performance of, 15
parallel_for, 177
parallel_invoke, 180
parallel_reduce, 179–180
Partitioned Global Address Space (PGAS) model, 171
Partitioners, 178–179
Peel loop, 140–141
“Peel” unneeded code from inner loop, 97–100
Performance application programming interface (PAPI), 378
pipe filelist directive, 313
Platform specifications of Intel Xeon Phi coprocessor, 62t
Platforms with coprocessors, 5–6
9-point stencil algorithm, 60
over two-dimensional array, 61f
Portable High Performance Programming, 1
Potential performance issues, 370–377
general cache usage, 371–373
TLB misses, 373–374
Power gate core, 261f
Power management capabilities, 260–262
PowerManagement kernel command line parameter, 299–300
Pragma in C/C++, target-specific code using, 206–209
#pragma ivdep, 130
#pragma nounroll, 140
Pragma offload, 190t, 192
#pragma offload target, 44
#pragma omp parallel, 130
#pragma simd, 19–20, 94, 95, 129, 178–179
#pragma vector novecremainder, 140
Prefetching, 20, 112–123, 116–121
compiler prefetches, 118–119
controls, 119
manual prefetches, 119–121
Preproduction Intel Xeon Phi coprocessor, 7f
Privileged verbs, 284–285
Processor and coprocessor
core/thread parallelism, 3f
speed era, 2f
transistor count, 4f
vector parallelism, 3f
Profiling, 363
efficiency metrics, 364–370
compute to data access ratio, 369–370
event monitoring, registering, 364
Intel Trace Analyzer and Collector, 378–380
coprocessor only application, 379
processor+coprocessor application, 379–380
Intel VTune amplifier XE product, 16, 377–378
performance application programming interface, 378
potential performance issues, 370–377
general cache usage, 371–373
memory bandwidth, 376–377
TLB misses, 373–374
VPU usage, 374–375
and timing, 380–383
clocksources on coprocessor, 380
MIC elapsed time counter, 380
setting the clocksource, 381
time penalty, 382–383
time stamp counter (tsc), 380–381
time structures, 381–382
Programming options, in coprocessor, 273f

Q

qrestrict, 127

R

Random number function vectorization, 137–138
RDMA Peer-to-Peer Transfer with Intel CCL, 284f
Real-world code example, 83
basic diffusion calculation, 84
boundary effects accounting, 84–91
code scaling, 91–93
data locality, 100–104
peeling code from inner loop, 97–100
tiling, 100–104
vectorization ensuring, 93–96
Reducers, 187
Reinders, James, 385
Reliability, availability, and serviceability (RAS), 263–265
Remote direct memory access (RDMA), 282
restrict keyword, 127
restrict pointer, 128
Robison, Arch, 385
Root file system, 300–305
RootDevice kernel command line parameter, 300
Runtime Type Information (RTTI), 225

S

Scaling code, 91–93
Scatter/gather instructions, 161f
sect-subscript-list, 150
Share virtual memory offload, 192
Shared secure shell (SSH) keys, 348
Shared virtual memory model, 190t, 191–192
C++ declarations of persistent data with, 227–228
persistent data using, 225–227
restrictions on, 224–225
Silicon chip, 243, 245, 246–247
Simple partitioner, 178
Simple profiling, importance of, 126
Single Dynamic Library (SDL), 327
Single instruction, multiple data (SIMD), 24
clauses, 131–133
directives, 129–134, 133–134
vectorize, requirements to, 130–131
Single precision floating point (SP), 24
Single-program, multiple-data (SPMD), 343
slink filelist directive, 313
sock filelist directive, 313
Software architecture and components, 269
architecture, 269–271
ring levels, 271
symmetry, 271
components, 276–277
development tools and application layer, 276–277
coprocessor programming models and options, 271–275
breadth and depth, 273
coprocessor-only model, 275
offload model, 274
symmetric model, 275
Intel Manycore platform software stack, 277–287
board tools, control panel, MicAccess SDK, 280–282
Coprocessor Offload Infrastructure, 278
coprocessor communication link (Intel CCL), 282–285
coprocessor system management, 279–282
ganglia, 280
IB core coprocessor modifications, 286
IB Proxy Client, 286–287
IB Proxy Daemon, 286
IB Proxy Server, 286
MPI applications, coprocessor components for, 282–287
MYO (mine yours ours), 277
OFED/SCIF, 287
SCIF (symmetric communications interface), 278
sysfs, 279–280, 280t
vendor HCA proxy driver, 287
virtual networking (NetDev), TCP/IP, and sockets, 278
Software architecture of Intel Xeon Phi coprocessor, 270f
Software Developers Kit (SDK), 280
Source Code, for loop, 160f
Spawning block, 185
Specifications, of coprocessor, 24–26
std::condition variable, 181
stderr, 240
std::lock guard, 180
std::mutex, 180
std::ostream, 187
stdout, 240
std::thread, 180–181
Stencil, 84, 85, 99, 100
Stencil Diffusion Algorithm Code, 85f
stencil9pt_base(), 63
sten2d9pt_base.c Listing, 67f
sten2d9pt_omp Function, 73f
sten2d9pt_vect Function, 71f
streaming stores, 77–79, 114, 121–123
Structured Parallel Programming, 385
Subrange reduction, 180
Subscript triplet, 151
Swizzle modifier, 256
Symmetric communications interface (SCIF), 278, 282, 295
Symmetric model, 275, 345, 346t
Symmetric multiprocessor (SMP), 5, 243, 247
Symmetry, 271
Sync, 185–186
Sysfs, 279–280, 280t
System Management Bus (SMBus), 246
System Management Controller (SMC), 246, 265, 265–266

T

Tasks, 165
Thread pools, importance of, 168
Threads
increasing the number of, 38
running, 32–38
Threading Building Blocks (TBB), See Intel Threading Building Blocks (TBB)
Thunder road, 93–96
Tiling, 100, 100–101
performance improvements from, 104f
Time stamp counter (tsc), 380–381
Timer hardware devices, 380
Timing, 380–383
clocksources on coprocessor, 380
MIC elapsed time counter, 380
setting the clocksource, 381
time penalty, 382–383
time stamp counter (tsc), 380–381
time structures, 381–382
Transforming-and-tuning double advantage, 10
Translation Lookaside Buffer (TLB), 79, 251–252, 288
description and usage, 374
events used, 373
formulas and thresholds, 373
misses, 373–374
tuning suggestions, 374
Trapezoidal rule, 350–354, 356–361
cluster example, 361
dynamic load balance, 358–361
hybrid workloads, manual load balance of, 357–358
Tuning memory allocation performance, 288–290
number of 2-MB pages
controlling, 288
monitoring, 288–289
sample method for allocating 2-MB pages, 289–290

U

UNALIGNED clauses, 134
Unaligned version, of loop, 160f
Uniform Memory Access (UMA), 247
Unrolling, 123–124
User-level application code, 271

V

Var, 132, 133
Vec-report option with offloads, 235–236
Vector, 24
VECTOR directives, 134–135, 135
Vector floating point formats on Intel Xeon Phi coprocessor, 25f
Vector processing unit (VPU), 15, 24, 243, 374–375
description and usage, 375
events used, 374
formula and threshold, 374–375
tuning suggestions, 375
Vectorization, 19
directives/pragmas to assist, 109
ensuring, 93–96, 94
numerical result variations with, 163
toolkit, 110
Vectorization Intensity, 156, 368–369, 375
VECTORLENGTH, 130, 131
Vectors, 107
alignment, 112–123, 114–116
approaches to achieving, 108–110
data layout, 112–123, 113–114
need for, 107–108
numerical result variations with, 163
prefetching, 112–123, 116–121
process of, 108
six step vectorization methodology, 110–112
streaming through caches, 112–123
techniques to achieve, 108t
Vendor HCA proxy driver, 286–287, 287
VerboseLogging kernel command line parameter, 299
Verbs, 284–285
Virtual networking (NetDev), TCP/IP, and sockets, 278
Virtual-shared memory model, 190
Volume, 84, 84, 85
VTune Amplifier, 370–371, 377–378

W

Whole Package C6, 263f
Wrapper script, 355–356

X

Xeon Phi coprocessor, 372, 373, 385
xiar, 237–238
xild, 237–238
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset