Index

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Glossary

Index

Note: Page numbers followed by “f” and “t” refer to figures and tables, respectively.

affinity_partitioner, 178–179, 178, 179f

-align rec64byte, 143

-align recnbyte, 143

ALIGNED clauses, 134

Aligned version, of loop, 160f

allocate_huge_pages(), 289

Architecture, of coprocessor, 8f, 243

benchmarks, 267

cache organization and memory access consideration, 251–252

card design, 245–246

individual core architecture, 247–249

instruction and multithread processing, 249–251

PCIe system interface and DMA, 257–260

DMA capabilities, 258–260

power management capabilities, 260–262

prefetching, 252–253

reliability, availability, and serviceability (RAS), 263–265

machine check architecture (MCA), 264–265

system management controller (SMC), 265–266

fan speed control, 266

potential application impact, 266

sensors, 265–266

thermal design power monitoring and control, 266

vector processing unit architecture, 253–257

vector instructions, 254–257

Array notations, 109–110

Array section operator, 153

Assembly code, 157

inspection, 156–163

prefetch instructions, 162

quick inspection of, 158–163, 162–163

scatters or gathers, 161–162

unaligned loads and stores, 158–159

ASSERT keyword, 134

Asynchronous computation, 228–229

Asynchronous data transfer, 229–234

from coprocessor to processor, 233–234

from processor to coprocessor, 229–234, 232–233, 233–234

Auto partitioner, 178

Auto vectorization, 109

Automatic offload mode, 332–337

automatic and compiler-assisted use, 335

compiler-assisted offload, using, 337–338

using tips, 338

debug/test, 335

disable and re-enable, 335

effective use of, 333–336

enabling, 333

Intel MKL, effective use of, 336–337

data alignment and leading dimensions, 336

favor LAPACK unpacked routines, 336–337

openMP and threading settings, 336

vs. matrix size, 334

oversubscription, avoiding, 336

situations for, 335

using control work division, 333

auto_partitioner, 178

Baseboard management controller (BMC), 246, 265

Baseline 9-point stencil implementation, 61–68

Bash shell, environment variables in, 331f

Basic diffusion calculation, 84

Basic Linear Algebra Subprograms (BLAS), 326

blocked_range, 177–178

Blocking, 100

Blocking factor, 101

Board tools, 280–282

Boundary conditions, 99

Boundary effects accounting, 84–91

Built-in operations, 155t

C/C++ arrays, allocating memory for, 212–213

C Shell, environment variables in, 331f

Cache architectures, 100

Cache line reuse, in diffusion calculation, 100, 101f

Cache optimizations, 20–21

Card System Management Agent, 280

Carry-propagate instructions (CPI), 365–369

description and usage, 366–368

events used, 365

formulas and thresholds, 366

tuning suggestions, 368–369

Cilk Plus, 181–187

array notation, 110

and elemental functions, 187

cilk_for, 184–185

cilk_spawn and cilk_sync, 185–187

history of, 183

Cilk plus array sections and elemental functions, 152–155

cilk_for, 184–185

_cilk_offload, 219–220

rules for using, 222

writing target-specific code with, 223–224

cilk_spawn and cilk_sync, 185–187

Clamshell, 245

C-language interfaces for the BLAS (CBLAS), 326–327

Clause, 134

SIMD directive, 131–133

clevicts, compiler generation of, 122–123

Clock gate core, 261f

Clock gate L2 cache and inter-processor network, 262f

Clock generators, 380

clock_gettime, 364–365

Clocksources

on coprocessor, 380

setting, 381

Code, running, 27–32

Communicating with the coprocessor, 26–27

Coprocessor Offload Infrastructure, 278

Compiler and programming models, 19–20

Compiler directives, 128–150

Compiler generation of clevicts, 122–123

Compiler options, 126–128

Compiler prefetches, 118–119, 119

Compiler tips, 123–126

Compiler-assisted offload, 337f

Coprocessor silicon chip, 243, 245, 246–247

Compute to data access ratio

description and usage, 369–370

events used, 369

formulas and thresholds, 369

tuning suggestions, 370

Console kernel command line parameter, 299

Control panel, 280–282

Coprocessor Communication Link (CCL), 282, 282–285, 285f, 345

Coprocessor offload interface (COI) layer, 240

Coprocessor support overview, 327–330

automatic offload, control functions for, 328–329

environment variables, 330

in native mode, 330–332

Coprocessor-only model, 275, 345, 346t

Cores

using all cores, 38–48

3D stencil, 85f

Data alignment, 114–116

to assist vectorization, 142–146

aligned STATIC arrays, 143–144

aligning data, 143

compiler of alignment, 144

Data layout, 113–114

Data locality, 100

Data reuse, 114

Data-type, 132

Diffusion algorithm, 84

diffusion_baseline(), 85

diffusion_baseline.c listing, 90f

diffusion_openmp() code, 94

diffusion_openmpv function, 95f

diffusion_peel function, 98f

diffusion_peel.c, 99

diffusion_tiled() function, 101, 103f

dir filelist directive, 312

Direct Memory Access (DMA)

capabilities, 258–260

channel arbitration, 259

channel descriptor ring, 259–260

DO CONCURRENT, 171

definition, 172–173

and DATA RACES, 171–172

definition, 172–173

vs. FOR ALL, 173

vs. OpenMP Parallel, 173

Double precision floating point (DP), 24

Drand48 Vectorization, example of, 135

Dynamic load balance, 358–361

Efficiency metrics, 364–370

carry-propagate instructions (CPI), 365–369

compute to data access ratio, 369–370

Elemental functions, 155

Erand38 vectorization, 138f

Error Correcting Codes (ECC), 263–264, 280–282

Event monitoring registers, 364

Extended math unit (EMU), 243, 253, 254

External Bridge Topology, in Linux, 304

file filelist directive, 312–313

File location parameters, 301

First Intel Xeon Phi coprocessor

key facts about, 7–9

preproduction, 7f

Floating point operations per second (FLOP/s), 24, 27–28, 375

Floating-point arithmetic variations, 339–342

basics, 339

floating-point exceptions, 340

-fp-model precise option, 340

-fp-model switch, 340

fused multiply-add (FMA) instruction, 340

math functions, precision of, 341

Xeon Phi coprocessors and Intel Xeon processors, 341–342

Fortran

arrays, 211

allocating memory for parts, 213–214

asynchronous data transfer

from coprocessor to processor, 233–234

from processor to coprocessor, 229–233, 233–234

example of using pointers for various allocation types, 208f

example showing static vs. dynamic memory declarations, 207f

length parameter in, 201

offload pragma examples, 204–206

target-specific code using a directive in, 208f, 209, 210f

vector addition using Cilk Plus array notation, 109f

with “short vector” syntax, 110f

vector addition using standard C, 109f, 109f

Fortran 2008, 167t

DATA RACES, 171–172

DO CONCURRENT, 171

Fortran array sections, 150–152

subscript triplets, 151

vector subscripts, 151

Fortran compiler, 115, 269–270

Fortran prefetch intrinsics, 121f

Full vectors, 138–141, 161f

remainder loop, 139–140

Fused multiply and add (FMA), 24, 28

Ganglia, 280

gettimeofday, 364–365

GPUs, 16

Guided Auto-parallelization (GAP) report, 112

Hardware prefetcher (HWP), 253

Hello World, MPI version of, 344, 344f, 350, 354–356

MPI implementation specific settings, 356

mpirun argument sets, 355

wrapper script, 355–356

helloflops1.c code, 27–28, 31f

helloflops2.c code, 37f

helloflops3.c code, 39f

building of, 38

Fortran listing of, with OpenMP directives for scaling, 43f

in Intel Xeon Phi coprocessor, 40f

hellomem.c code, 50, 55f

code listing of, 53f

Heterogeneity, 345–348

High Performance Computing (HPC), 243

High Speed Driver Certificate, 105

Hint value, 119

Hit Rate metrics, 372

Host Channel Adapter (HCA), 282

Host-only model, 345, 346t

Huge 2-MB memory pages, 79–80

HUGE_TLB flag, 288

Hyper-threading vs. multithreading, 17–18

InfiniBand (IB), 275

core coprocessor modifications, 286

Proxy Client, 286–287

Proxy Daemon, 286

Proxy Server, 286

InfiniBand Host Channel Adapter (HCA), 282

Inlining, importance of, 126

Intel C/C++ compiler, 269–270

Intel Cilk, 107

Intel Cilk Plus, 19–20, 165, 167t

Intel Cluster Studio XE 2013, 269–270

Intel Compiler, 124–126, 327

Intel Coprocessor Communication Link (Intel CCL), 275, 275, 282, 282–285, 284

Intel Coprocessor Offload Infrastructure (Intel COI), 278, 278

Intel Many Integrated Core (MIC) Architecture, 10, 28

Intel Manycore Platform Software Stack (Intel MPSS), 270, 276, 277–287, 278, 280

Coprocessor Offload Infrastructure, 278

coprocessor components for MPI applications, 282–287

coprocessor communication link (Intel CCL), 282–285

IB core coprocessor modifications, 286

IB Proxy Client, 286–287

IB Proxy Daemon, 286

IB Proxy Server, 286

OFED/SCIF, 287

vendor HCA proxy driver, 287

coprocessor system management, 279–282

board tools, control panel, MicAccess SDK, 280–282

ganglia, 280

sysfs, 279–280, 280t

MYO (mine yours ours), 277

SCIF (symmetric communications interface), 278

sysfs and ganglia support, 281f

virtual networking (NetDev), TCP/IP, and sockets, 278

Intel Math Kernel Library (Intel MKL), 15, 107, 276

automatic offload, 192

differences on coprocessor, 327

and Intel compiler, 327

overview, 326–327

support function examples

for C/C++, 330f

for Fortran, 330f

Intel Math Kernel Library Reference Manual, 327

Intel MPI library, 269–270

Intel MPSS service, 296–297

Intel Parallel Studio XE 2013, 269–270

Intel Threading Building Blocks (TBB), 165, 167t, 174–181, 181

blocked_range, 177–178

borrowing components from, 183–184

history of, 175–177

loaning components to, 184

notes on C++11, 180–181

overview of, 175f

parallel_for, 177

parallel_invoke, 180

parallel_reduce, 179–180

partitioners, 178–179

use of, 177

Intel Trace Analyzer and Collector (ITAC), 378–380

coprocessor only application, 379

processor+coprocessor application, 379–380

Intel VTune Amplifier XE, 16, 16, 93, 104, 370–371, 377–378

Intel Xeon Phi coprocessor, 1, 59–60, 107, 165, 244–245, 270, 271–272, 341–342, 354, 372, 373

architecture of, 8f

cache optimizations, 20–21

compiler and programming models, 19–20

maximum performance, achieving, 16–17

examples, 21

first coprocessor, 6–9

floating point operations per second (FLOP/s), 27–28

GPUs, 16

High Speed Driver Certificate, 105

highly parallel execution, measuring readiness for, 15–16

hyper-threading vs. multithreading, 17–18

Linux commands, 27

Linux support for, 287

memory architecture of, 50f

microarchitecture of, 9f

MPI vs. offload model, 18–19

multiple cores, effect of using, 56f

need for, 2–5

ninja gap, controlling, 9

operational software, 269, 269

parallel program performance, maximizing, 15

performance, maximizing, 11–12

platform specifications, 62t

platforms with, 5–6

preproduction of, 7f

scaling, need for, 12–14

silicon chip of, 246–247

software architecture, 270f

sysfs nodes, 281t

test card specs, 25t

time of use, 11

transformation, for performance, 17

transforming-and-tuning double advantage, 10

using MPI on, 345–348

vector floating point formats on, 25f

Intelligent Platform Management Interface (IPMI), 246

Intel Composer XE 2013, 26

Intel Many core Platform Software Stack (MPSS), 26

Intel Many Integrated Core (MIC), 28

Internal Bridge Topology, in Linux, 303

IVDEP directive, 135–137

examples in C, 136–137

fortran example, 136

jrand48 vectorization, 140f

Kernel, 271

Keyword spelling, 184

Knights Corner, 6–7

L1 Cache, 100

L2 Cache, 100

Language extensions for offload (LEO), 192–195, 193t, 194t

“Least recently used” (LRU), 100

Linux on coprocessor, 293

boot process, 315–318

kernel command line, 315–316

Linux kernel image, 317

bootstrap and configuration, 294–295

changing coprocessor configuration, 297–305

configurable components, 297–298

configuration files, 298

configuring boot parameters, 298–300

coprocessor root file system, 300–305

coprocessor Linux baseline, 293–294

default coprocessor Linux configuration, 295–297

external bridge topology, 304

Linux cluster, coprocessors in, 318–322

Intel Cluster Ready, 319

micctrl utility, 305–312

booting coprocessors, 306

configuration initialization and propagation, 308–309

configuration parameters, helper functions for, 309–311

coprocessor state control, 306

rebooting the coprocessors, 306–307

resetting coprocessors, 307

shutting down coprocessors, 306

software adding, 312–315

dirfilelist directive, 312

download image file, 313

file filelist directive, 312–313

new global file set, 314–315

nod filelist directive, 313

pipe filelist directive, 313

root file system, 313–314

slink filelist directive, 313

sock filelist directive, 313

support for Intel Xeon Phi coprocessors, 287

Linux operating system (OS), 269

Linux Standard Base (LSB), 294, 294t

Linux support, for Intel Xeon Phi coprocessors, 287

Loadable kernel modules (LKMs), 294

Logging stdout and stderr from offloaded code, 240

Loop

aligned version, 160f

source code for, 160f

unaligned version, 160f

unrolling, 123–124

Low trip count, 131

lrand38 vectorization, 138f

Machine check architecture (MCA), 264–265

mallocs, 288

Manual prefetching, 120f, 121f

Manual unrolling, 123

Many core Platform Software Stack (MPSS), 26

Many Integrated Core (MIC) Architecture, See Intel Many Integrated Core (MIC) Architecture

Many-one array section, 151

Math Library, 109, 325

precision choices and variations, 339–342

fast transcendentals and mathematics, 339

for floating-point arithmetic variations, 339–342

McCool, Michael, 385

memcpy, 196

Memory allocation for pointer variables

C/C++ offload pragma examples, 203–204

coprocessor memory management for input pointer variables, 201–202

Fortran offload pragma examples, 204–206

length parameter in Fortran, 201

managing, 198–199

pointer variables, alignment of, 203

target memory management for output pointer variables, 202

transferring data into pre-allocated memory on target, 202

Memory architecture on Intel Xeon Phi coprocessor, 50f

Memory bandwidth, 376–377

accessing, 49–54

description and usage, 376–377

events used, 376

formulas and threshold, 376

maximising, 54–57

tuning suggestions, 377

Memory disambiguation inside vector-loops, 127–128

Message Passing Interface (MPI), 6, 276, 343

on Intel Xeon Phi coprocessors, 345–348

heterogeneity, 345–348

offload from, 349–354

Hello World, 350

trapezoidal rule, 350–354

prerequisites, 348

using natively on coprocessor, 354–361

Hello World, 354–356

trapezoidal rule, 356–361

programming models, 274–275

coprocessor-only model, 275

on the host processor platform using offload to coprocessors, 274f

offload model, 274

running on coprocessors only, 275f

symmetric communications, 276f

symmetric model, 275

MIC Access Software Developers Kit (MicAccess SDK), 280, 280–282, 282

MIC elapsed time counter (micetc), 380

miccheck, 280

micctrl utility, 305–312

MIC_ENV_PREFIX=MIC, 48

micflash, 280

micinfo, 280

Microarchitecture of Intel Xeon Phi coprocessor, 9f

Microsoft’s Parallel Patterns Library (PPL), 176

MKL, 325

Moore’s Law, 4f

MPI, 343

MPI application using Intel Coprocessor Communications Link, 271, 285f

MPI Library, 378

MPI rank, 343, 356

MPI vs. offload model, 18–19

mpirun argument sets, 355

mrand48 vectorization, 139f

Multiple cores, effect of using

on Intel Xeon Phi coprocessor, 56f

Multiple declarations, target attribute to, 234–238

measuring timing and data in offload regions, 236

Vec-report option used with offloads, 235–236

Multithreading, 17–18

MYO (Mine Your Ours), 277

Native model, 18

Need for Intel Xeon Phi coprocessor, 2–5

Nesting, 170

NetDev drivers, 278, 279f

Network access, 302–305

Ninja gap, 9

nod filelist directive, 313

Non-shared memory model, 190, 191

NONTEMPORAL clause, 134

Non-uniform memory access (NUMA), 345

-no-opt-prefetch option, 121

NOVECTOR directives, 134–135

nrand48 vectorization, 139f

OFED (Open Fabrics Enterprise Edition), 282

Offload, 189

asynchronous computation, 228–229

asynchronous data transfer, 229–234

from coprocessor to processor, 233–234

from processor to coprocessor, 229–234, 232–233, 233–234

C/C++ arrays, allocating memory for, 212–213

choosing vs. native execution, 191–192

_cilk_offload, 219–220

rules for using, 222

writing target-specific code with, 223–224

_cilk_shared, 220–221

rules for using, 222

compiler options and environment variables for, 193–195

creating offload libraries, 237–238

explicit copy modifiers, 201t

fortran arrays, allocating memory for, 213–214

Intel Math Kernel Library (Intel MKL) automatic, 192

language extensions for, 192–195

libraries in offloaded code, 237

logging stdout and stderr from offloaded code, 240

memory allocation for pointer variables, 198–199

C/C++ offload pragma examples, 203–204

coprocessor memory management for input pointer variables, 201–202

Fortran offload pragma examples, 204–206

length parameter in Fortran, 201

pointer variables, alignment of, 203

target memory management for output pointer variables, 202

transferring data into pre-allocated memory on target, 202, 202–203

to multiple coprocessors, 195

multiple declarations, target attribute to, 234–238

measuring timing and data in offload regions, 236

Vec-report option used with offloads, 235–236

non-shared memory model, 191

_Offload_report, 236

and OpenMP, 198f

performing file i/o on the coprocessor, 238–240

placing variables and functions on coprocessor, 198–199

pragma in C/C++, target-specific code using, 206–209

pragma offload, 190t

predefined macros for Intel MIC architecture Fortran arrays, 211

processor-only execution, 209–211

restrictions, using pragma, 215–216

shared functions, 219

shared memory management functions, 219

shared virtual memory model, 190t, 191–192, 217–228

C++ declarations of persistent data with, 227–228

persistent data using, 225–227

restrictions on, 224–225

and shared variables, 217–218

sharing environment variables for, 195

sharing variables and functions, 220–221

support, 197t

synchronization, 222

synchronous and asynchronous function execution, 219–220

target-specific code using

directive in fortran, 209

time optimization, 206

two models, 190

using pragma/directive, 195–216

Offload model, 18, 274, 345, 346t, 349–354

Offload pragmas, 192

Offload programming model, 41–44

On-Die Interconnect (ODI), 247f

Open Fabrics Enterprise Distribution (OFED), 274, 282, 287

OpenACC models, 194t

OpenMP, 167t, 168–170, 346–348

clause, 91

code, 33

controls, 170t

directives, 169

nesting, 170

significant controls over, 169–170

OpenMP Parallel vs. DO CONCURRENT, 173–174

OpenMP Target (TR1) models, 194t

Oper, 132

-opt-assume-safe-padding, 138–141, 142

-opt-prefetch=50 option, 121

Padding, 76

Parallel 9-Point 2D stencil code, 75f

Parallelization on processor and coprocessor, 59

“alignment,” adjusting, 76–77

baseline 9-point stencil implementation, 61–68

baseline stencil code, running, 68–70

huge 2-MB memory pages, 79–80

9-point stencil algorithm, 60

streaming stores, 77–79

tuning, 75–80

vectorization, 70–72

vectors plus scaling, 72–75

Parallel Patterns Library (PPL), 176

Parallel processing model, 168–169

Parallel programming, 10

maximizing performance of, 15

parallel_for, 177

parallel_invoke, 180

parallel_reduce, 179–180

Partitioned Global Address Space (PGAS) model, 171

Partitioners, 178–179

Peel loop, 140–141

“Peel” unneeded code from inner loop, 97–100

Performance application programming interface (PAPI), 378

pipe filelist directive, 313

Platform specifications of Intel Xeon Phi coprocessor, 62t

Platforms with coprocessors, 5–6

9-point stencil algorithm, 60

over two-dimensional array, 61f

Portable High Performance Programming, 1

Potential performance issues, 370–377

general cache usage, 371–373

TLB misses, 373–374

Power gate core, 261f

Power management capabilities, 260–262

PowerManagement kernel command line parameter, 299–300

Pragma in C/C++, target-specific code using, 206–209

#pragma ivdep, 130

#pragma nounroll, 140

Pragma offload, 190t, 192

#pragma offload target, 44

#pragma omp parallel, 130

#pragma simd, 19–20, 94, 95, 129, 178–179

#pragma vector novecremainder, 140

Prefetching, 20, 112–123, 116–121

compiler prefetches, 118–119

controls, 119

manual prefetches, 119–121

Preproduction Intel Xeon Phi coprocessor, 7f

Privileged verbs, 284–285

Processor and coprocessor

core/thread parallelism, 3f

speed era, 2f

transistor count, 4f

vector parallelism, 3f

Profiling, 363

efficiency metrics, 364–370

compute to data access ratio, 369–370

CPI, 365–369

event monitoring, registering, 364

Intel Trace Analyzer and Collector, 378–380

coprocessor only application, 379

processor+coprocessor application, 379–380

Intel VTune amplifier XE product, 16, 377–378

performance application programming interface, 378

potential performance issues, 370–377

general cache usage, 371–373

memory bandwidth, 376–377

TLB misses, 373–374

VPU usage, 374–375

and timing, 380–383

clocksources on coprocessor, 380

MIC elapsed time counter, 380

setting the clocksource, 381

time penalty, 382–383

time stamp counter (tsc), 380–381

time structures, 381–382

Programming options, in coprocessor, 273f

qrestrict, 127

Random number function vectorization, 137–138

RDMA Peer-to-Peer Transfer with Intel CCL, 284f

Real-world code example, 83

basic diffusion calculation, 84

boundary effects accounting, 84–91

code scaling, 91–93

data locality, 100–104

peeling code from inner loop, 97–100

tiling, 100–104

vectorization ensuring, 93–96

Reducers, 187

Reinders, James, 385

Reliability, availability, and serviceability (RAS), 263–265

Remote direct memory access (RDMA), 282

restrict keyword, 127

restrict pointer, 128

Robison, Arch, 385

Root file system, 300–305

RootDevice kernel command line parameter, 300

Runtime Type Information (RTTI), 225

Scaling code, 91–93

Scatter/gather instructions, 161f

sect-subscript-list, 150

Share virtual memory offload, 192

Shared secure shell (SSH) keys, 348

Shared virtual memory model, 190t, 191–192

C++ declarations of persistent data with, 227–228

persistent data using, 225–227

restrictions on, 224–225

Silicon chip, 243, 245, 246–247

Simple partitioner, 178

Simple profiling, importance of, 126

Single Dynamic Library (SDL), 327

Single instruction, multiple data (SIMD), 24

clauses, 131–133

directives, 129–134, 133–134

vectorize, requirements to, 130–131

Single precision floating point (SP), 24

Single-program, multiple-data (SPMD), 343

slink filelist directive, 313

sock filelist directive, 313

Software architecture and components, 269

architecture, 269–271

ring levels, 271

symmetry, 271

components, 276–277

development tools and application layer, 276–277

coprocessor programming models and options, 271–275

breadth and depth, 273

coprocessor-only model, 275

offload model, 274

symmetric model, 275

Intel Manycore platform software stack, 277–287

board tools, control panel, MicAccess SDK, 280–282

Coprocessor Offload Infrastructure, 278

coprocessor communication link (Intel CCL), 282–285

coprocessor system management, 279–282

ganglia, 280

IB core coprocessor modifications, 286

IB Proxy Client, 286–287

IB Proxy Daemon, 286

IB Proxy Server, 286

MPI applications, coprocessor components for, 282–287

MYO (mine yours ours), 277

OFED/SCIF, 287

SCIF (symmetric communications interface), 278

sysfs, 279–280, 280t

vendor HCA proxy driver, 287

virtual networking (NetDev), TCP/IP, and sockets, 278

Software architecture of Intel Xeon Phi coprocessor, 270f

Software Developers Kit (SDK), 280

Source Code, for loop, 160f

Spawning block, 185

Specifications, of coprocessor, 24–26

std::condition variable, 181

stderr, 240

std::lock guard, 180

std::mutex, 180

std::ostream, 187

stdout, 240

std::thread, 180–181

Stencil, 84, 85, 99, 100

Stencil Diffusion Algorithm Code, 85f

stencil9pt_base(), 63

sten2d9pt_base.c Listing, 67f

sten2d9pt_omp Function, 73f

sten2d9pt_vect Function, 71f

streaming stores, 77–79, 114, 121–123

Structured Parallel Programming, 385

Subrange reduction, 180

Subscript triplet, 151

Swizzle modifier, 256

Symmetric communications interface (SCIF), 278, 282, 295

Symmetric model, 275, 345, 346t

Symmetric multiprocessor (SMP), 5, 243, 247

Symmetry, 271

Sync, 185–186

Sysfs, 279–280, 280t

System Management Bus (SMBus), 246

System Management Controller (SMC), 246, 265, 265–266

Tasks, 165

TBB, See Intel Threading Building Blocks (TBB)

Thread pools, importance of, 168

Threads

increasing the number of, 38

running, 32–38

Threading Building Blocks (TBB), See Intel Threading Building Blocks (TBB)

Thunder road, 93–96

Tiling, 100, 100–101

performance improvements from, 104f

Time stamp counter (tsc), 380–381

Timer hardware devices, 380

Timing, 380–383

clocksources on coprocessor, 380

MIC elapsed time counter, 380

setting the clocksource, 381

time penalty, 382–383

time stamp counter (tsc), 380–381

time structures, 381–382

Transforming-and-tuning double advantage, 10

Translation Lookaside Buffer (TLB), 79, 251–252, 288

description and usage, 374

events used, 373

formulas and thresholds, 373

misses, 373–374

tuning suggestions, 374

Trapezoidal rule, 350–354, 356–361

cluster example, 361

dynamic load balance, 358–361

hybrid workloads, manual load balance of, 357–358

Tuning memory allocation performance, 288–290

number of 2-MB pages

controlling, 288

monitoring, 288–289

sample method for allocating 2-MB pages, 289–290

UNALIGNED clauses, 134

Unaligned version, of loop, 160f

Uniform Memory Access (UMA), 247

Unrolling, 123–124

User-level application code, 271

Var, 132, 133

Vec-report option with offloads, 235–236

Vector, 24

VECTOR directives, 134–135, 135

Vector floating point formats on Intel Xeon Phi coprocessor, 25f

Vector processing unit (VPU), 15, 24, 243, 374–375

description and usage, 375

events used, 374

formula and threshold, 374–375

tuning suggestions, 375

Vectorization, 19

directives/pragmas to assist, 109

ensuring, 93–96, 94

numerical result variations with, 163

toolkit, 110

Vectorization Intensity, 156, 368–369, 375

VECTORLENGTH, 130, 131

Vectors, 107

alignment, 112–123, 114–116

approaches to achieving, 108–110

data layout, 112–123, 113–114

need for, 107–108

numerical result variations with, 163

prefetching, 112–123, 116–121

process of, 108

six step vectorization methodology, 110–112

streaming through caches, 112–123

techniques to achieve, 108t

Vendor HCA proxy driver, 286–287, 287

VerboseLogging kernel command line parameter, 299

Verbs, 284–285

Virtual networking (NetDev), TCP/IP, and sockets, 278

Virtual-shared memory model, 190

Volume, 84, 84, 85

VTune Amplifier, 370–371, 377–378

Whole Package C6, 263f

Wrapper script, 355–356

Xeon Phi coprocessor, 372, 373, 385

xiar, 237–238

xild, 237–238

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Index

Create new playlist

Sign In

Sign Up

Table of Contents for
Index