The POWER Hypervisor
This chapter introduces the POWER7 Hypervisor and describes some of the technical details for this product. It covers the the following topics:
3.1 Introduction to the POWER7 Hypervisor
Power Virtualization was introduced in POWER5 systems, so there are many reference materials that are available that cover all three resources (CPU, memory, and I/O), virtualization, capacity planning, and virtualization management. Some of these documents are shown in the reference section at the end of this section, which focuses on POWER7 Virtualization usually. As for any workload deployments, capacity planning, selecting the correct set of technologies, and appropriate tuning are critical to deploying high performing workloads. However, in deploying workloads in virtualized environments, there are more aspects to consider, such as consolidation ratio, workload resource usage patterns, and the suitability of a workload to run in a shared resource environment (or latency requirements).
The first step in the virtualization deployment process is to understand if the performance of a workload in a shared resource environment meets customer requirements. If the workload requires consistent performance with stringent latency requirements, then such workloads must be deployed on a dedicated partition rather than on a shared LPAR. The exceptions are where shared processor pool is not heavily over committed and overutilized; such workloads could meet stringent requirements in a shared LPAR configuration also.
It is a preferred practice to understand the resource usage of all workloads that are planned for consolidation on a single system, especially when you plan to use a shared resource model, such as shared LPARs, IBM Active Memory™ Sharing, and VIO server technologies. The next step is to use a capacity planning tool that takes virtualization impacts into consideration, such as the IBM Workload Estimator, to estimate capacity for each partition. One of the goals of virtualization is maximizing usage. This usage can be achieved by consolidating workloads that peak at different times (that is, in a non-overlapping manner, so each workload (or partition) does not have to be sized for peak usage but rather for average usage). At the same time, each workload can grow to consume free resources from the shared pool that belong to other partitions on the system. This situation allows the packing of more partitions (workloads) on a single system, producing a higher consolidation ratio or higher density on the deployed system. A higher consolidation ratio is a key metric to achieve in the data center, as it helps to reduce the total cost of ownership (TCO).
Let us look at a list of key attributes that require consideration when deploying workloads on a shared resource model (virtualization):
Levels of variation between average and peak usage of workloads:
 – A large difference between average and peak usage
 – A small difference between average and peak usage
Workloads and their peak duration, frequency, and estimate when they potentially peak:
Select workloads that peak at different times (non-over lapping).
Workload Service Level Agreement SLA requirements (latency requirements and their tolerance levels).
Ratio of active to inactive (mostly idle) partitions on a system.
Provisioning and de-provisioning frequency.
IBM PowerVM has a richer set of technology options than virtualization on other platforms. It supports dedicated, shared, and a mix of dedicated and shared resource models for each of the system resources, such as processor cores, memory, and I/O:
 – Shared LPAR: Capped versus uncapped.
 – Shared LPAR: Resources overcommit levels to meet the peak usage (the ratio of virtual processors to physical processor entitled capacity).
 – Shared LPAR: Weight selection to assign a level of priority to get uncapped capacity (excess cycles to address the peak usage).
 – Shared LPAR: Multiple shared pools to address software licensing costs, which prevents a set of partitions from exceeding its capacity consumption.
 – Active Memory Sharing: The size of a shared pool is based on active workload
memory consumption:
 • Inactive workload memory is used for active workloads, which reduces the memory capacity of the pool.
 • The Active Memory De-duplication option can reduce memory capacity further.
 • AIX file system cache memory is loaned to address memory demands that lead to memory savings.
 • Workload load variation changes active memory consumption, which leads to opportunity for sharing.
 – Active Memory Sharing: A shared pool size determines the levels of memory overcommit. Starts without overcommit and is based on workload consumption that reduces the pool.
 – Active Memory Expansion: AIX working set memory is compressed.
 – Active Memory Sharing and Active Memory Expansion can be deployed on the
same workload.
 – Active Memory Sharing: VIO server sizing is critical for CPU and memory.
 – Virtual Ethernet: An inter-partition communication VLANs option that is used for higher network performance.
 – Shared Ethernet versus host Ethernet.
 – Virtual disk I/O: Virtual small computer system interface (vSCSI), N_Port ID Virtualization (NPIV), file-backed storage, and storage pool.
 – Dynamic resource movement (DLPAR) to adopt to growth.
3.2 POWER7 virtualization
PowerVM hypervisor and the AIX operating system (AIX V6.1 TL 5 and later versions) on POWER7 implement enhanced affinity in a number of areas to achieve optimized performance for workloads that are running in a virtualized shared processor logical partition (SPLPAR) environment. By using the preferred practices that are described in this guide, customers can attain optimum application performance in a shared resource environment. This guide covers preferred practices in the context of POWER7 Systems, so this section can be used as an addendum to other PowerVM preferred practice documents.
3.2.1 Virtual processors
A virtual processor is a unit of a virtual processor resource that is allocated to a partition or virtual machine. PowerVM hypervisor can map a whole physical processor core, or it can create a time slice of a physical processor core.
PowerVM hypervisor create time slices of Micro-Partitioning on physical CPUs by dispatching and undispatching the various virtual processors for the partitions that are running in the shared pool.
If a partition has multiple virtual processors, they might or might not be scheduled to run simultaneously on the physical processor cores.
Partition entitlement is the guaranteed resource available to a partition. A partition that is defined as capped can consume only the processors units that are explicitly assigned as its entitled capacity. An uncapped partition can consume more than its entitlement, but is limited by many factors:
Uncapped partitions can exceed their entitlement if there is unused capacity in the shared pool, dedicated partitions that share their physical processor cores while active or inactive, unassigned physical processors, and Capacity on Demand (CoD) utility processors.
If the partition is assigned to a virtual shared processor pool, the capacity for all of the partitions in the virtual shared processor pool might be limited.
The number of virtual processors in an uncapped partition is throttled depending on how much CPU it can consume. For example:
 – An uncapped partition with one virtual CPU can consume only one physical processor core of CPU resources under any circumstances.
 – An uncapped partition with four virtual CPUs can consume only four physical processor cores of CPU.
Virtual processors can be added or removed from a partition using HMC actions.
Sizing and configuring virtual processors
The number of virtual processors in each LPAR in the system ought not to exceed the number of cores available in the system (central electronic complex (CEC)/framework). Or, if the partition is defined to run in a specific virtual shared processor pool, the number of virtual processors ought not to exceed the maximum that is defined for the specific virtual shared processor pool. Having more virtual processors that are configured than can be running at a single point in time does not provide any additional performance benefit and can actually cause more context switches of the virtual processors, which reduces performance.
If there are sustained periods during which there is sufficient demand for all the shared processing resources in the system or a virtual shared processor pool, it is prudent to configure the number of virtual processors to match the capacity of the system or virtual shared processor pool.
A single virtual processor can consume a whole physical core under two conditions:
1. SPLPAR has an entitlement of 1.0 or more processors.
2. The partition is uncapped and there is idle capacity in the system.
Therefore, there is no need to configure more than one virtual processor to get one
physical core.
For example, a shared pool is configured with 16 physical cores. Four SPLPARs are configured, each with entitlement 4.0 cores. To configure virtual processors, consider the sustained peak demand capacity of the workload. If two of the four SPLPARs would peak to use 16 cores (the maximum available in the pool), then those two SPLPARs would need 16 virtual CPUs. If the other two SPLPARs peak only up to eight cores, those two would be configured with eight virtual CPUs.
Entitlement versus virtual processors
Entitlement is the capacity that an SPLPAR is ensured to get as its share from the shared pool. Uncapped mode allows a partition to receive excess cycles when there are free (unused) cycles in the system.
Entitlement also determines the number of SPLPARs that can be configured for a shared processor pool. The sum of the entitlement of all the SPLPARs cannot exceed the number of physical cores that are configured in a shared pool.
For example, a shared pool has eight cores and 16 SPLPARs are created, each with 0.1 core entitlement and one virtual CPU. We configured the partitions with 0.1 core entitlement because these partitions are not running that frequently. In this example, the sum of the entitlement of all the 16 SPLPARs comes to 1.6 cores. The rest of the 6.4 cores and any unused cycles from the 1.6 entitlement can be dispatched as uncapped cycles.
At the same time, keeping entitlement low when there is capacity in the shared pool is not always a preferred practice. Unless the partitions are frequently idle or there is plan to add more partitions, the preferred practice is that the sum of the entitlement of all the SPLPARs configured should be close to the capacity in the shared pool. Entitlement cycles are guaranteed, so while a partition is using its entitlement cycles, the partition is not preempted, while a partition can be preempted when it is dispatched to use excess cycles. Following this preferred practice allows the hypervisor to optimize the affinity of the partition’s memory and processor cores and also reduces unnecessary preemptions of the virtual processors.
Matching entitlement of an LPAR close to its average usage for better performance
The aggregate entitlement (minimum or wanted processor) capacity of all LPARs in a system is a factor in the number of LPARs that can be allocated. The minimum entitlement is what is needed to boot the LPARs, but the wanted entitlement is what an LPAR gets if there are enough resources available in the system. The preferred practice for LPAR entitlement is to match the entitlement capacity to average usage and let the peak be addressed by more uncapped capacity.
When to add more virtual processors
When there is sustained need for a shared LPAR to use more resources in the system in uncapped mode, increase the virtual processors.
How to estimate the number of virtual processors per uncapped shared LPAR
The first step is to monitor the usage of each partition and for any partition where the average utilization is about 100%, and then add one virtual processor, that is, use the capacity of the configured virtual processors before you add more. Additional virtual processors run concurrently if there are enough free processor cores available in the shared pool.
If the peak usage is below the 50% mark, then there is no need for more virtual processors. In this case, look at the ratio of virtual processors to configured entitlement and if the ratio is greater than 1, then consider reducing the ratio. If there are too many virtual processors that are configured, AIX can “fold” those virtual processors so that the workload would run on fewer virtual processors to optimize virtual processor performance.
For example, if an SPLPAR is given a CPU entitlement of 2.0 cores and four virtual processors in an uncapped mode, then the hypervisor can dispatch the virtual processors to four physical cores concurrently if there are free cores available in the system. The SPLPAR uses unused cores and the applications can scale up to four cores. However, if the system does not have free cores, then the hypervisor dispatches four virtual processors on two cores so that the concurrency is limited to two cores. In this situation, each virtual processor is dispatched for a reduced time slice as two cores are shared across four virtual processors. This situation can impact performance, so AIX operating system processor folding support might be able to reduce to number of virtual processors that are dispatched so that only two or three virtual processors are dispatched across the two physical cores.
Virtual processor management: Processor folding
The AIX operating system monitors the usage of each virtual processor and aggregate usage of an SPLPAR, and if the aggregate usage goes below 49%, AIX starts folding down the virtual CPUs so that fewer virtual CPUs are dispatched. This action has the benefit of virtual CPUs running longer before it is preempted, which helps improve performance. If a virtual CPU gets a shorter dispatch time slice, then more workloads are cut into time slices on the processor core, which can cause higher cache misses.
If the aggregate usage of an SPLPAR goes above 49%, AIX starts unfolding virtual CPUs so that additional processor capacity can be given to the SPLPAR. Virtual processor management dynamically adopts number of virtual processors to match the load on an SPLPAR. This threshold (vpm_fold_threshold) of 49% represents the SMT thread usage starting with AIX V6.1 TL6; before that version, vpm_fold_threshold (which was set to 70%) represents the core utilization.
With a vpm_fold_threshold value of 49%, the primary thread of a core is used before unfolding another virtual processor to consume another core from the shared pool on POWER7 Systems. If free cores are available in the shared processor pool, then unfolding another virtual processor results in the LPAR getting another core along with its associated caches. Now the SPLPAR can run on two primary threads of two cores instead of two threads (primary and secondary) on the same core. A workload that is running on two primary threads of two cores can achieve higher performance if there is less sharing of data than the workload that is running on primary and secondary threads of the same core. The AIX virtual processor management default policy aims at using the primary thread of each virtual processor first; therefore, it unfolds the next virtual processor without using the SMT threads of the first virtual processor. After it unfolds all the virtual processors and consumes the primary thread of all the virtual processors, it starts using the secondary and tertiary threads of the
virtual processors.
If the system is highly used and there are no free cycles in the shared pool, when all the SPLPARs in the system try to get more cores by unfolding more virtual processors and use only the primary of thread of each core, the hypervisor creates time slices on the physical cores across multiple virtual processors. This action impacts the performance of all the SPLPARs, as time slicing increases cache misses and context switch cost
However, this alternative policy of making each virtual processor use all of the four threads (SMT4 mode) of a physical core can be achieved by changing the values of a number of restricted tunables. Do not use this change in normal conditions, as most of the systems do not consistently run at high usage. You decide if such a change is needed based on the workloads and system usage levels. For example, critical database SPLPAR needs more cores even in a highly contended situation to achieve the best performance; however, the production and test SPLPARs can be sacrificed by running on fewer virtual processors and using all the SMT4 threads of a core.
Processor bindings in a shared LPAR
In AIX V6.1 TL5 and AIX V7.1, binding virtual processors is available to an application that is running in a shared LPAR. An application process can be bound to a virtual processor in a shared LPAR. In a shared LPAR, a virtual processor is dispatched by the PowerVM hypervisor. The PowerVM hypervisor maintains three levels of affinity for dispatching, such as core, chip, and node level affinity in eFW7.3 and later firmware versions. By maintaining affinity at the hypervisor level and in AIX, applications can achieve higher level affinity through virtual processor bindings.
3.2.2 Page table sizes for LPARs
The hardware page table of an LPAR is sized based on the maximum memory size of an LPAR and not what is assigned (or wanted) to the LPAR. There are some performance considerations if the maximum size is set higher than the wanted memory:
A larger page table tends to help performance of the workload, as the hardware page table can hold more pages. This larger table reduces translation page faults. Therefore, if there is enough memory in the system and you want to improve translation page faults, set your max memory to a higher value than the LPAR wanted memory.
On the downside, more memory is used for hardware page table, which not only wastes memory, but also makes the table become sparse, which results in the
following situations:
 – A dense page table tends to help with better cache affinity because of reloads.
 – Less memory that is consumed by the hypervisor for the hardware page table means that more memory is made available to the applications.
 – There is less page walk time as page tables are small.
3.2.3 Placing LPAR resources to attain higher memory affinity
POWER7 PowerVM optimizes the allocation of resources for both dedicated and shared partitions as each LPAR is activated. Correct planning of the LPAR configuration enhances the possibility of getting both CPU and memory in the same domain in relation to the topology of a system.
PowerVM hypervisor selects the required processor cores and memory that is configured for an LPAR from the system free resource pool. During this selection process, hypervisor takes the topology of the system into consideration and allocates processor cores and memory where both resources are close. This situation ensures that the workload on an LPAR has lower latency in accessing its memory.
When you power on partitions for the first time, power on the partitions of highest importance first. By doing so, the partitions have first access to deallocated memory and
processing resources.
 
 
Partition powering on: Even though a partition is dependent on a VIOS, it is safe to power on the partition before the VIOS; the partition does not fully power on because of its dependency on the VIOS, but claims its memory and processing resources.
What the SPPL option does on a Power 795 system
A new option named Shared Partition Processor Limit (SPPL) is added to give hints to the hypervisor about whether to contain partitions to minimum domains or to spread partitions across multiple domains. On Power 795, a book can host four chips that total up to 32 cores. If SPPL is set to 32, then the maximum size of an LPAR that can be supported is 32 cores. This hint enables hypervisor to allocate both physical cores and memory of an LPAR within a single domain as much as possible. For example, in a three-book configuration, if the wanted configuration is four LPARs, each with 24 cores, three of those LPARs are contained in each of the three books, and the fourth LPAR is spread across three books.
If SPPL is set to MAX, then a partition size can exceed 32 cores. This hint helps hypervisor to maximize the interconnect bandwidth allocation by spreading LPARs across more domains for larger size LPARs.
 
SPPL value: The SPPL value can be set on only Power 795 systems that contain four or more processing books. If there are three or less processor books, the SPPL setting is controlled by the system and is set to 32 or 24, based on the number of processors
per book.
On a Power 795 system, where the SPPL value is set to MAX, there is a way to configure individual partitions so they are still packed into a minimum number of books. This setup is achieved by using the HMC command-line interface (CLI) through the lpar_placement profile attribute on the chsyscfg command. Specifying a value of lpar_placement=1 indicates that the hypervisor attempts to minimize the number of domains that are assigned to the LPAR. Setting lpar_placement=0 is the default setting, and follows the existing rules when SPPL is set to MAX.
How to determine if an LPAR is contained within a domain
From an AIX LPAR, run lssrad to display the number of domains across which an LPAR
is spread.
The lssrad syntax is:
lssrad -av
If all the cores and memory are in a single domain, you should receive the
following output with only one entry under REF1:
REF1 SRAD MEM CPU
0 0 31806.31 0-31
1 31553.75 32-63
REF1 represents a domain, and domains vary by platform. SRAD always references a chip. However, lssrad does not report the actual physical domain or chip location of the partition: it is a relative value whose purpose is to inform if the resources of the partition are within the same domain or chip. The output of this lssrad example indicates that the LPAR is allocated
with 16 cores from two chips within the same domain. Note that the lssrad command output was taken from an SMT4 platform, and thuse CPU 0-31 actually represents 8 cores.
When all the resources are free (an initial machine state or reboot of the CEC), the PowerVM allocates memory and cores as optimally as possible. At partition boot time, PowerVM is aware of all of the LPAR configurations, so placement of processors and memory are made regardless of the order of activation of the LPARs.
However, after the initial configuration, the setup might not stay static. Numerous operations take place, such as:
Reconfiguration of existing LPARs with new profiles
Reactivating existing LPARs and replacing them with new LPARs
Adding and removing resources to LPARs dynamically (DLPAR operations)
Any of these changes might result in memory fragmentation, causing LPARs to be spread across multiple domains. There are ways to minimize or even eliminate the spread. For the first two operations, the spread can be minimized by releasing the resources that are currently assigned to the deactivated LPARs.
Resources of an LPAR can be released by running the following commands:
chhwres -r mem -m <system_name> -o r -q <num_of_Mbytes> --id <lp_id>
chhwres -r proc -m <system_name> -o r --procunits <number> --id <lp_id>
The first command frees the memory and the second command frees cores.
When all of the partitions are inactive, there is another way to clear the resources of all of the existing configurations before you create a configuration. In this situation, all the resources of all the partitions can be cleared from the HMC by completing the following steps:
1. Shut down all partitions.
2. Create the all-resources partition.
3. Activate the all-resources partition.
4. Shut down the all-resources partition.
5. Delete the all-resources partition.
Fragmentation because frequent movement of memory or processor cores between partitions is avoidable with correct planning. DLPAR actions can be done in a controlled way so that the performance impact of resource addition or deletion is minimal. Planning for growth helps alleviate the fragmentation that is caused by DLPAR operations. Knowing the LPARs that must grow or shrink dynamically, and placing them with LPARs that can tolerate nodal crossing latency (less critical LPARs), is one approach to handling the changes of critical LPARs dynamically. In such a configuration, when growth is needed for the critical LPAR, the resources that are assigned to the non-critical LPAR can be reduced so that the critical LPAR can grow.
Affinity groups (introduced in the 730 firmware level)
PowerVM firmware 730 and later has support for affinity groups that can be used to group multiple LPARs to place (allocate resources) within a single or a few domains. On the Power 795 with 32 cores in a book, the total physical core resources of an affinity group are not to exceed 24/32 cores or the physical memory that is contained within a book.
This affinity group feature can be used in multiple situations:
LPARs that are dependent or related, such as server and client, and application server and database server, can be grouped so they are in the same book.
Affinity groups can be created that are large enough such that they force the assignment of LPARs to be in different books. For example, if you have a two-book system and the total resources (memory and processor cores) assigned to the two groups exceeds the capacity of a single book, these two groups are forced to be in separate books. A simple example is if there is a group of partitions that totals 14 cores and a second group that totals 20 cores. Because these groups exceed the 32 cores in a 795 book, the groups are placed in different books.
If a pair of LPARs is created with the intent of one being a failover to another partition, and one partition fails, the other partition (which is placed in the same node, if both are in the same affinity group) uses all of the resources that were freed up from the failed LPAR.
The following HMC CLI command adds or removes a partition from an affinity group:
chsyscfg -r prof -m <system_name> -i name=<profile_name>
lpar_name=<
partition_name>,affinity_group_id=<group_id>
group_id is a number 1 - 255 (255 groups can be defined), and affinity_group_id=none removes a partition from the group.
When the hypervisor places resources at frame reboot, it first places all the LPARs in group 255, then the LPARs in group 254, and so on. Place the most important partitions regarding affinity in the highest configured group.
PowerVM resource consumption for capacity planning considerations
PowerVM hypervisor consumes a portion of the memory resources in the system. During your planning stage, consider the layout of LPARs. Factors that affect the amount of memory that is consumed are the size of the hardware page tables in the partitions, HEA resources, HCA resources, the number of I/O devices, hypervisor memory mirroring, and other factors. Use the IBM System Planning Tool (available at http://www.ibm.com/systems/support/tools/systemplanningtool/) to estimate the amount of memory that is reserved by the hypervisor.
Licensing resources and Capacity Upgrade on Demand (CUoD)
Power Systems support capacity on demand so that customers can license capacity on demand as business needs for compute capacity grows. Therefore, a Power System might not have all resources that are licensed, which poses a challenge to allocate both cores and memory from a local domain. PowerVM (eFW 7.3 level firmware) correlates customer configurations and licensed resources to allocated cores and memory from the local domain to each of the LPARs. During a CEC reboot, the hypervisor places all defined partitions as optimally as possible and then unlicenses the unused resources.
For more information about this topic, see 3.3, “Related publications” on page 65.
3.2.4 Active memory expansion
Active memory expansion (AME) is a capability that is supported on POWER7 and later servers that employs memory compression technology to expand the effective memory capacity of an LPAR. The operating system identifies the least frequently used memory pages and compresses them. The result is that more memory capacity within the LPAR is available to sustain more load, or the ability to remove memory from the LPAR to be used to deploy more LPARs. The POWER7+ processor provides enhanced support of AME with the inclusion of on-chip accelerators onto which the work of compression and decompression
is offloaded.
AME is deployed by first using the amepat tool to model the projected expansion factor and CPU usage of a workload. This modeling looks at the compressibility of the data, the memory reference patterns, and current CPU usage of the workload. AME can then be enabled for the LPAR by setting the expansion factor. The operating system then reports the physical memory available to applications as actual memory times the expansion factor. Then, transparently, the operating system locates and compresses cold pages to maintain the appearance of expanded memory.
Applications do not need to change, and they are not aware that AME is active. However, not all applications or workloads have suitable characteristics for AME. Here is a partial list of guidelines for the workload characteristics that can be a good fit for AME:
The memory footprint is dominated by application working storage (such as heap, stack, and shared memory).
Workload data is compressible.
Memory access patterns are concentrated to a subset of the overall memory footprint.
Workload performance is acceptable without the use of larger page sizes, such as 64 KB pages. AME disables the usage of large pages and uses only 4 KB pages.
The average CPU usage of the workload is below 60%.
Users of the application and workload are relatively insensitive to response
time increases.
For more information about AME usage, see Active Memory Expansion: Overview and Usage Guide, available at:
3.2.5 Optimizing Resource Placement – Dynamic Platform Optimizer
In firmware 760 level and later, on select Power Systems servers, a feature is available called the Dynamic Platform Optimizer. This optimizer automates the manual steps to improve resource placement. For more information visit the following Web site and select the Doc-type Word document P7 Virtualization Best Practice.
 
Comment: This document is intended to address POWER7 processor technology based PowerVM best practices to attain the best LPAR performance. This document should be used in conjunction with other PowerVM documents.
3.3 Related publications
The publications that are listed in this section are considered suitable for a more detailed discussion of the topics that are covered in this chapter:
Active Memory Expansion: Overview and Usage Guide, found at:
IBM PowerVM Active Memory Sharing Performance, found at:
IBM PowerVM Virtualization Introduction and Configuration, SG24-7940
IBM PowerVM Virtualization Managing and Monitoring, SG24-7590
POWER7 Virtualization Best Practice Guide, found at:
PowerVM Migration from Physical to Virtual Storage, SG24-7825
Virtual I/O (VIO) and Virtualization, found at:
Virtualization Best Practice, found at:
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset