Chapter 12. Performance

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Performance

This chapter describes the performance considerations for IBM z14 ZR1 servers.

This chapter includes the following topics:

•12.1, “IBM z14 ZR1 performance characteristics” on page 398

•12.2, “LSPR workload suite” on page 399

•12.3, “Fundamental components of workload performance” on page 399

•12.4, “Relative Nest Intensity” on page 401

•12.5, “LSPR workload categories based on relative nest intensity” on page 403

•12.6, “Relating production workloads to LSPR workloads” on page 403

•12.7, “Workload performance variation” on page 404

12.1 IBM z14 ZR1 performance characteristics

z14 ZR1 Z06 is designed to offer up to 12% more capacity and twice the amount of memory than the z13s Z06 system.

Uniprocessor performance also increased. On average, a z14 ZR1 model Z01 offers performance improvements of up to 9% over the z13s model Z01 (see Figure 12-1).

Figure 12-1 IBM Z generations capacity comparison

The Large System Performance Reference (LSPR) provides capacity ratios among various processor families that are based on various measured workloads. It is a common practice to assign a capacity scaling value to processors as a high-level approximation of their capacities. The numbers for z14 ZR1 were obtained with z/OS V2R2.

For z/OS V2R2 studies, the capacity scaling factor that is commonly associated with the reference processor is set to a 2094-701 with a Processor Capacity Index (PCI) value of 593. This value is unchanged since z/OS V1R11 LSPR. The use of the same scaling factor across LSPR releases minimizes the changes in capacity results for an older study and provides more accurate capacity view for a new study.

On average, z14 ZR1 servers can deliver up to 12% more performance in a 6-way configuration than an z13s 6-way. However, the observed performance increase varies depending on the workload type.

Consult the LSPR when you consider performance on the z14 ZR1. The range of performance ratings across the individual LSPR workloads is likely to include a large spread. Performance of the individual logical partitions (LPARs) varies depending on the fluctuating resource requirements of other partitions and the availability of processor units (PUs). For more information, see 12.7, “Workload performance variation” on page 404.

For more information about performance, see the Large Systems Performance Reference for IBM Z page of the Resource Link website.

For more information about millions of service units (MSU) ratings, see the IBM z Systems Software Contracts page of the IBM IT infrastructure website.

12.2 LSPR workload suite

Historically, LSPR capacity tables, including pure workloads and mixes, were identified with application names or a software characteristic; for example, CICS, IMS, OLTP-T,¹ CB-L,² LoIO-mix,³ and TI-mix.⁴ However, capacity performance is more closely associated with how a workload uses and interacts with a particular processor hardware design.

The CPU Measurement Facility (CPU MF) data that was introduced on the z10 provides insight into the interaction of workload and hardware design in production workloads. CPU MF data helps LSPR to adjust workload capacity curves that are based on the underlying hardware sensitivities; in particular, the processor access to caches and memory. This processor access to caches and memory is called nest. By using this data, LSPR introduces three workload capacity categories that replace all older primitives and mixes.

LSPR contains the internal throughput rate ratios (ITRRs) for the z14 ZR1 and the previous generation processor families. These ratios are based on measurements and projections that use standard IBM benchmarks in a controlled environment.

The throughput that any user experiences can vary depending on the amount of multiprogramming in the user’s job stream, the I/O configuration, and the workload processed. Therefore, no assurance can be given that an individual user can achieve throughput improvements that are equivalent to the performance ratios that are stated.

12.3 Fundamental components of workload performance

Workload performance is sensitive to the following major factors:

•Instruction path length

•Instruction complexity

•Memory hierarchy and memory nest

These factors are described in this section.

12.3.1 Instruction path length

A transaction or job runs a set of instructions to complete its task. These instructions are composed of various paths through the operating system, subsystems, and application. The total count of instructions that are run across these software components is referred to as the transaction or job path length.

The path length varies for each transaction or job, and depends on the complexity of the tasks that must be run. For a particular transaction or job, the application path length tends to stay the same, assuming that the transaction or job is asked to run the same task each time.

However, the path length that is associated with the operating system or subsystem can vary based on the following factors:

•Competition with other tasks in the system for shared resources. As the total number of tasks grows, more instructions are needed to manage the resources.

•The number of logical processors (n-way) of the image or LPAR. As the number of logical processors grows, more instructions are needed to manage resources that are serialized by latches and locks.

12.3.2 Instruction complexity

The type of instructions and the sequence in which they are run interacts with the design of a microprocessor to affect a performance component. This factor is defined as instruction complexity. The following design alternatives affect this component:

•Cycle time (GHz)

•Instruction architecture

•Pipeline

•Superscalar

•Out-of-order execution

•Branch prediction

•Transaction Lookaside Buffer (TLB)

•Transactional Execution (TX)

•Single instruction multiple data instruction set (SIMD)

•Simultaneous multithreading (SMT)⁵

Performance varies as workloads are moved between microprocessors with various designs. However, when on a processor, this component tends to be similar across all models of that processor.

12.3.3 Memory hierarchy and memory nest

The memory hierarchy of a processor generally refers to the caches, data buses, and memory arrays that stage the instructions and data that must be run on the microprocessor to complete a transaction or job.

The following design choices affect this component:

•Cache size.

•Latencies (sensitive to distance from the microprocessor).

•Number of levels, the Modified, Exclusive, Shared, Invalid (MESI) protocol, controllers, switches, the number and bandwidth of data buses, and so on.

Certain caches are private to the microprocessor core, which means that only that microprocessor core can access them. Other caches are shared by multiple microprocessor cores. The term memory nest for an IBM Z processor refers to the shared caches and memory along with the data buses that interconnect them.

A memory nest in a fully populated z14 ZR1 CPC drawer is shown in Figure 12-2.

Figure 12-2 Memory hierarchy in a fully populated z14 ZR1 CPC drawer

Workload performance is sensitive to how deep into the memory hierarchy the processor must go to retrieve the workload instructions and data for running. The best performance occurs when the instructions and data are in the caches nearest the processor because little time is spent waiting before running. If the instructions and data must be retrieved from farther out in the hierarchy, the processor spends more time waiting for their arrival.

As workloads are moved between processors with various memory hierarchy designs, performance varies because the average time to retrieve instructions and data from within the memory hierarchy varies. Also, when on a processor, this component continues to vary because the location of a workload’s instructions and data within the memory hierarchy is affected by several factors that include, but are not limited to, the following factors:

•Locality of reference

•I/O rate

•Competition from other applications and LPARs

12.4 Relative Nest Intensity

The most performance-sensitive area of the memory hierarchy is the activity to the memory nest. This area is the distribution of activity to the shared caches and memory.

The term Relative Nest Intensity (RNI) indicates the level of activity to this part of the memory hierarchy. By using data from CPU MF, the RNI of the workload that is running in an LPAR can be calculated. The higher the RNI, the deeper into the memory hierarchy the processor must go to retrieve the instructions and data for that workload.

RNI reflects the distribution and latency of sourcing data from shared caches and memory, as shown in Figure 12-3.

Figure 12-3 Relative Nest Intensity

Many factors influence the performance of a workload. However, these factors often are influencing is the RNI of the workload. The interaction of all these factors results in a net RNI for the workload, which in turn directly relates to the performance of the workload.

These factors are tendencies, not absolutes. For example, a workload might have a low I/O rate, intensive processor use, and a high locality of reference, which all suggest a low RNI. But, it might be competing with many other applications within the same LPAR and many other LPARs on the processor, which tends to create a higher RNI. It is the net effect of the interaction of all these factors that determines the RNI.

The traditional factors that were used to categorize workloads in the past are shown with their RNI tendency in Figure 12-4.

Figure 12-4 Traditional factors that were used to categorize workloads

Little can be done to affect most of these factors. An application type is whatever is necessary to do the job. The data reference pattern and processor usage tend to be inherent to the nature of the application. The LPAR configuration and application mix are mostly a function of what must be supported on a system. The I/O rate can be influenced somewhat through buffer pool tuning.

However, one factor, software configuration tuning, is often overlooked but can have a direct effect on RNI. This term refers to the number of address spaces (such as CICS application-owning regions [AORs] or batch initiators) that are needed to support a workload. This factor always existed, but its sensitivity is higher with the current high-frequency microprocessors. Spreading the same workload over more address spaces than necessary can raise a workload’s RNI. This increase occurs because the working set of instructions and data from each address space increases the competition for the processor caches.

Tuning to reduce the number of simultaneously active address spaces to the optimum number that is needed to support a workload can reduce RNI and improve performance. In the LSPR, the number of address spaces for each processor type and n-way configuration is tuned to be consistent with what is needed to support the workload. Therefore, the LSPR workload capacity ratios reflect a presumed level of software configuration tuning. Retuning the software configuration of a production workload as it moves to a larger or faster processor might be needed to achieve the published LSPR ratios.

12.5 LSPR workload categories based on relative nest intensity

A workload’s RNI is the most influential factor in determining workload performance. Other more traditional factors, such as application type or I/O rate, have RNI tendencies. However, it is the net RNI of the workload that is the underlying factor in determining the workload’s performance. The LSPR now runs various combinations of former workload primitives, such as CICS, Db2, IMS, OSAM, VSAM, WebSphere, COBOL, and utilities, to produce capacity curves that span the typical range of RNI.

The following workload categories are represented in the LSPR tables:

•LOW (relative nest intensity)

A workload category that represents light use of the memory hierarchy.

•AVERAGE (relative nest intensity)

A workload category that represents average use of the memory hierarchy. This category is expected to represent most production workloads.

•HIGH (relative nest intensity)

A workload category that represents a heavy use of the memory hierarchy.

These categories are based on the RNI. The RNI is influenced by many variables, such as application type, I/O rate, application mix, processor usage, data reference patterns, LPAR configuration, and the software configuration that is running. CPU MF data can be collected by z/OS System Measurement Facility on SMF 113 records or z/VM Monitor starting with z/VM V5R4.

12.6 Relating production workloads to LSPR workloads

Historically, the following techniques were used to match production workloads to LSPR workloads:

•Application name (a client that is running CICS can use the CICS LSPR workload)

•Application type (create a mix of the LSPR online and batch workloads)

•I/O rate (the low I/O rates that are used a mix of low I/O rate LSPR workloads)

The IBM Processor Capacity Reference for IBM Z (zPCR) tool supports the following workload categories:

•Low

•Low-Average

•Average

•Average-high

•High

For more information about the no-charge IBM zPCR tool (which reflects the latest IBM LSPR measurements), see the Getting Started with zPCR (IBM's Processor Capacity Reference) page of the IBM Techdocs Library website.

As described in 12.5, “LSPR workload categories based on relative nest intensity” on page 403, the underlying performance sensitive factor is how a workload interacts with the processor hardware.

Beginning with the z10 processor, the hardware characteristics can be measured by using CPU MF (SMF 113) counters data. A production workload can be matched to an LSPR workload category through these hardware characteristics. For more information about RNI, see 12.5, “LSPR workload categories based on relative nest intensity” on page 403.

The AVERAGE RNI LSPR workload is intended to match most client workloads. When no other data is available, use the AVERAGE RNI LSPR workload for capacity analysis.

Low-Average and Average-High categories allow better granularity for workload characterization.

For z10 and newer processors, the CPU MF data can be used to provide an extra hint as to workload selection. When available, this data allows the RNI for a production workload to be calculated.

By using the RNI and another factor from CPU MF, the L1MP (percentage of data and instruction references that miss the L1 cache), a workload can be classified as LOW, AVERAGE, or HIGH RNI. This classification and resulting hit are automated in the zPCR tool. It is preferable to use zPCR for capacity sizing.

12.7 Workload performance variation

As the size of transistors approaches the size of atoms that stand as a fundamental physical barrier, a processor chip’s performance can no longer double every two years (known as the Moore’s Law⁶).

A holistic performance approach is required when the performance gains because of frequency are reduced today. Therefore, hardware and software synergy becomes an absolute requirement.

Starting with z13, Instructions Per Cycle (IPC) improvements in core and cache became the driving factor for performance gains. As these microarchitectural features increase (which contributes to instruction parallelism), overall workload performance variability also increases because not all workloads react the same way to these enhancements. Also, the memory and cache designs affect various workloads in many ways. All workloads are improved, with cache-intensive loads expected to benefit the most.

The workload variability for moving from z13s to z14 ZR1 is expected to be stable. Workloads that are migrating from zBC12 and previous generations to z14 ZR1 can expect to see similar results with slightly less variability than the typical z13s experience.

The effect of this variability is increased deviations of workloads from single-number metric-based factors, such as millions of instructions per second (MIPS), MSUs, and CPU time charge-back algorithms.

Experience demonstrates that IBM Z servers can be run at up to 100% utilization levels, sustained. However, most clients prefer to leave some room and run at 90% or slightly under. For any capacity comparison exercise that uses a single metric, such as MIPS or MSU, is not a valid method. When deciding the number of processors and the uniprocessor capacity, consider the workload characteristics and LPAR configuration. For these reasons, the use of zPCR and involving IBM technical support are recommended when you plan capacity.

12.7.1 Main performance improvement drivers with z14 ZR1 servers

z14 ZR1 servers deliver new levels of performance and capacity for large-scale consolidation and growth. The attributes and design points of z14 ZR1 servers contribute to overall performance and throughput improvements as compared to the z13s.

The z/Architecture implementation includes the following enhancements:

•Guarded Storage Facility: An enhancement for garbage collected languages.

•Vector Packed Decimal Facility: An enhancement of packed decimal operations.

•Vector Enhancements Facility: Includes several vector enhancements, such as adding support for single precision floating point and VMSL for cryptographic computations.

•Order Preserving Compression and Entropy Encoding for CMPSC: Allows comparisons against compressed data. Entropy Encoding increases compression ratio.

•Miscellaneous New General Instructions: An enhancement of 64-bit halfword operations and new multiply instructions.

•Removal of Runtime Instrumentation External Interruption: Avoids RI Buffer Full interrupt.

•Semaphore Assist Facility, Enhanced NIAI (next instruction access intent) code points.

•(MSA 6, 7, 8), adding SHA-3 hash, true random number generation, and AES-GCM mode.

The z14 ZR1 microprocessor includes the following design enhancements:

•14nm FINFET SOI technology with IBM embedded dynamic static random access memory (eDRAM) technology

•Up to nine active processor cores per chip

•Clock frequency at 4.5 GHz

•A new translation/TLB2 design with four hardware-implemented translation engines reducing latency when compared with one pico-coded engine on z13s

•Branch prediction improvements:

– 33% Branch Target Buffer (BTB)1-and-2 growth

– New perceptron predictor

– Simple call-return stack

•Pipeline optimization:

– Improved instruction delivery

– Faster branch wake-up

– Reduced execution latency

– Improved OSC prediction

•Second generation of Simultaneous multithreading (SMT):

– Includes SAPs, zIIPs, and IFLs

– Improved thread balancing

– Multiple outstanding translations

– Optimized hang avoidance mechanisms

•Improved Hot Cache Line handling; dynamic throttling

•Cache improvements:

– New power efficient logical directory design

– L1 I-Cache increased from 96 K to 128 K per Core (33%)

– L2 D-Cache increased from 2 MB to 4 MB per Core (2x)

– L3 Cache (shared) increased from 64 MB to 128 MB per PU SCM (2x)

– New L4 Cache design with 672 MB (shared) per drawer and L4 sequential prefetch

•Enhanced binary coded decimal architecture (full set of register-to-register BCD instructions)

•New instructions for Single-instruction multiple-data (SIMD) operations

•One cryptographic/compression co-processor per core, redesigned:

– CP Assist for Cryptographic Functions (CPACF) (hardware) runs more operations, such as SHA-3, SHAKE hashing algorithms, and True Random-number generation (TRNG)

– Improved Galois Counter Mode (GCM) performance

– Entropy-Encoding Compression Enhancement with Huffman encoding

– Order-Preserving compression

•Adjusted Hiperdispatch to use new chip configuration

The z14 ZR1 design features the following enhancements as compared with the z13s:

•Increased number of characterizable cores, from 20 to 30

•Hardware system area (HSA) increased from 40 GB to 64 GB

•A total of 8 TB of addressable memory (configurable to LPARs)

•PR/SM enhancements:

– Improved memory affinity

– Optimized LPAR placement algorithms

•Dynamic Partition Manager Version 3.2:

– FC (with z14 hardware) and FCP storage support

– Storage Groups management enhancements

•SMT enablement for system assist processor (SAP) processors

•New Coupling Facility Control Code (CFCC) with improved performance and following enhancements:

– Asynchronous Cross-Invalidate (XI) of CF cache Structures

– Coupling Facility (CF) Processor Scalability

– CF List Notification Enhancements

– CF Request Diagnostics

– Coupling Link Constraint Relief

– CF Encryption

– Asynchronous duplexing of CF lock structures

The following new features are available on z14 ZR1 servers:

– Dynamic I/O for Standalone Coupling Facility CPCs

– Coupling Express Long Reach

– zHyperlink Express

– OSA-Express7S⁷ 25GbE SR

– FICON Express16S+

– 25GbE RoCE Express2

– 10GbE RoCE Express2

– Crypto Express6S with up to 40 domains

¹ Traditional online transaction processing workload (formerly known as IMS).

² Commercial batch with long-running jobs.

³ Low I/O Content Mix Workload.

⁴ Transaction Intensive Mix Workload.

⁵ Available for IFL, zIIP, and SAP processors only,

⁶ For more information, see the Moore’s Law website.

⁷ Check the IBM United States Hardware Announcement 118-075 and Driver Exception Letter for feature availability.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 12. Performance

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 12. Performance