Best practices and guidelines
This chapter describes the preferred practices and guidelines for using IBM Spectrum Virtualize Data Reduction Pool (DRP) technology. The suggestions are based on the newly revised architecture of the product.
This chapter provides use cases and preferred practices for the DRP technology and its performance and efficiency gains.
This chapter is intended for experienced IBM SAN Volume Controller (SVC) and IBM Storwize administrators and consultants. This book requires advanced knowledge of the SVC and Storwize products.
 
Recommended software level: The minimum recommended software level is V8.1.3.2 which is available from IBM Fix Central and contains a number of critical fixes.
5.1 RACE and DRP overview
Flash and solid-state drive (SSD) technology are constantly improving performance, and this technology provides low latency for application workloads.
Although Flash technology is cheaper than before, it has also reduced TCO because it requires less cooling and rack space, but it is currently still more expensive than traditional spinning disks. For this reason, storage administrators optimize the amount of data stored on Flash storage to drive the TCO even lower.
IBM Data Reduction Pool technology is developed to optimize the Flash workload to provide cost savings by storing less data but at the same time providing stable and predictable performance.
5.1.1 DRP performance benefits
The SVC and Storwize first RACE implementation was software-only. Later, dedicated hardware compression was introduced. The RACE implementation was created at a time when spinning disks were implemented in the majority of storage infrastructures. Today, SSD/Flash technology use is more widespread, and data storage and capacity optimization are one of the key goals in the industry.
Initially, the original RACE implementation was done using a software-only approach on the Storwize platform. As the product matured, we were able to use hardware accelerators and Flash/SSD back-end technology to increase performance.
Data Reduction Pool technology is targeted to leverage the hardware accelerator and Flash technology. The architecture of DRP compression does not dedicate any hardware to compression software, as significantly improved load balancing and Flash optimized I/O paths provide an optimal performance for a compressed workload.
5.1.2 DRP performance improvements
These are some of the major DRP improvements:
Multithreaded approach improved latency and throughput
All memory and cores available to all functions, with no dedicated cores or memory
Up to 3x throughput can be achieved for systems with hardware accelerators
New compression and deduplication implementation to leverage new hardware
 
Note: Storwize V5030 does not have a dedicated compression accelerator card, so it uses software compression only.
Data Reduction Pool I/O handling
Data Reduction Pool technology uses a fixed block size (also known as grain size) of 8 KB. This block size was chosen specifically to optimize the workload for Flash storage where small block I/O leverages low latency flash performance. Host I/O is acknowledged by the upper cache layer and then processed in 8 KB chunks.
Figure 5-1 shows DRP compression.
Figure 5-1 DRP compression
DRP technology introduces an implementation of compression that provides a stable performance irrespective of the I/O size and workload pattern, compared to the RACE implementation where certain workload types might have reduced performance. The compression workload provides predictable performance no matter what the workload type is. This consistent performance encourages users to compress workloads on Flash, and it provides efficiency, cost savings, and better performance.
DRP technology is ideal for Flash storage with compression. For example, with a compression pattern of 2:1 or higher, host I/O is “sliced” into 8KB equal chunks. In addition, each 8 KB chunk will be compressed to 4 KB or less, and is the optimal block size for leveraging flash performance.
The new compression implementation can produce up to 4x throughput for compressed workloads, with consistent performance.
DRP deduplication with compression provides the best storage efficiency. This combination deduplicates and then compresses the data, reducing the storage capacity usage. It is advised to use this option when the workload is compressible and it has a high duplicate ratio identified by the DRET tool. For more information, see 4.2, “Evaluating workload using Data Reduction Estimator Tool” on page 41.
DRP Deduplication I/O handling
Host I/O is processed in 8 KB chunks, and then an in-memory calculation is performed for deduplication. Host write I/O for the deduplication volume is acknowledged back to the host by the upper cache layer. Deduplication is performed after acknowledging the write to the host, so the host performance is not impacted by deduplication.
The host read may be a cache hit or a read from the back-end depending upon the workload type.
Figure 5-2 shows DRP deduplication.
Figure 5-2 DRP deduplication
DRP volume types
As a reminder from previously, DRP technology enables you to create five types of volumes:
Fully allocated
This type provides no storage efficiency but the best performance, and is available for migration.
Thin
This type provides storage efficiency but no compression or deduplication.
Thin and Compressed
This type provides storage efficiency with compression, and this combination provides the best performance numbers.
Thin and Deduplication
This type provides storage efficiency but without compression.
Thin, Compressed, and Deduplication
This type provides storage efficiency with maximum capacity savings.
With storage efficiency, DRP thin and compressed volumes provide the best performance numbers. This is due to the new compression implementation. This implementation provides better load balancing and consistent performance when compared to the RACE implementation. This feature is also the second best performer to fully allocated volumes, followed by thin, compressed and deduplication volumes when it comes to storage efficiency.
Figure 5-3 shows the types of volumes in the DRP pools.
Figure 5-3 Volume types
We will discuss these volume combinations from a performance perspective.
Users should select the volume type that meets their business objectives and leverages good performance from DRP technology. Each of the above volume combinations provides certain benefits for storage efficiency and have different performance characteristics.
DRP technology allows the user to create a fully allocated volume. Fully allocated volumes provide the best performance but do not provide the best storage efficiency. This option is allowed for migrations, and also for applications that require maximum performance with lowest possible latency.
The current suggestion is to use DRP compressed volumes. There are two exceptions to this:
If the workload has higher than a 30% deduplication ratio, use DRP deduplicated volumes instead.
If the hardware platform is Storwize V5030, size for compressibility and performance. For more details, see Chapter 4, “Estimator and sizing tools” on page 37.
Configuring your system for best DRP performance
SVC/Storwize storage pools are designed to separate workload or to mitigate against hardware failures using a failure domain. Standard pools can get similar performance numbers whether they are using single or multiple pools. However, DRP performance gets a boost by increasing the numbers of pools in the system. Storage infrastructure should be designed in such a way that the workload is separated by workload type in different failure domains to achieve maximum DRP performance.
SVC/Storwize system are multi-core machines where the system load balances the workload utilizing all of the available resources to provide better performance. With DRP technology, the best performance numbers can be achieved using four Data Reduction Pools. Four DRP storage pools enables the optimal use of node canister hardware resources, such as CPU cores.
Using standard pools (non-DRP), the storage administrator has to cater for the performance of each volume at an application level. However, using DRP technology, storage administration is simplified because the storage administrator only monitors the storage pool for capacity and performance, and the volumes no longer require monitoring for capacity. More DRP pools enable the system to load balance the resource more evenly, which provides good performance.
Size your workload and capacity using the tools described in Chapter 4, “Estimator and sizing tools” on page 37. An application workload with a 2:1 or higher compression ratio leverages the performance benefits from DRP compression, but to enhance the performance further using up to four pools provides the best performance numbers.
5.1.3 Garbage collection
DRP technology is built to leverage SSD/Flash technology. DRP technology is based on log structured array (LSA), where nothing is ever overwritten. Storage is allocated on demand and if a host or application rewrites a region of a disk, the new write is stored in a new location on the back end. DRP technology invalidates the old data on the back-end disk to be garbage collected. For more information, see Chapter 1, “Architecture” on page 1.
Garbage collection keeps track of extent usage in pools, and extents with the most unused space are garbage collected. Under normal circumstances, garbage collection works in a trickle mode, clearing up space and optimizing the back-end storage allocation. In a case where the pool is over provisioned and close to getting full, 85 - 100% of physical capacity, garbage collection has to work harder and it will choose the least-used storage pool extents to clear space.
The entire environment should be sized for performance and capacity before implementing DRP technology. DRP compression enables the user to compress the data, and it lowers storage costs. This could also lead to the user over-provisioning, and in some cases potentially an out-of-space condition. If a system runs out of space on the back-end disks, the DRP pool goes offline.
A DRP storage pool over-provisioning condition could create a performance issue when the physical space in the pool is between 85 - 100%. In this scenario, host I/O and garbage collection can lead to disk I/O contention, which could potentially overload the back end or running out of space completely.
Another use case where garbage collection may have to work harder is when a user deletes a large amount of data from the host, or one or more volumes are deleted. If this happens, the reclaimable capacity in the DRP pool goes up, and garbage collection needs to run and reclaim space. If the storage system is not back-end disk limited, garbage collection should not impact the performance of the system.
Users can check and monitor the amount of data that needs garbage collection. Monitor a Data Reduction Pool using the lsmdiskgrp CLI command or the GUI. Look for the new DRP attribute called reclaimable_capacity. This can be viewed using either the CLI or the GUI view.
When using the CLI, issue the following command:
svcinfo lsmdiskgrp <pool_id>
In this case, <pool_id> is the numeric id of the Data Reduction Pool. Also, view the amount of data that needs garbage collection in reclaimable_capacity output.
Example 5-1 shows the output from the svcinfo lsmdiskgrp 0 command, with the reclaimable capacity in bold.
Example 5-1 svcinfo lsmdiskgrp 0
svcinfo lsmdiskgrp 0
id 0
name Group0
status online
mdisk_count 16
vdisk_count 64
capacity 208.00TB
extent_size 1024
free_capacity 203.81TB
virtual_capacity 5.00TB
used_capacity 3.91TB
real_capacity 3.91TB
overallocation 2
warning 0
easy_tier auto
easy_tier_status balanced
tier tier0_flash
tier_mdisk_count 16
tier_capacity 208.00TB
tier_free_capacity 204.06TB
tier tier1_flash
tier_mdisk_count 0
tier_capacity 0.00MB
tier_free_capacity 0.00MB
tier tier_enterprise
tier_mdisk_count 0
tier_capacity 0.00MB
tier_free_capacity 0.00MB
tier tier_nearline
tier_mdisk_count 0
tier_capacity 0.00MB
tier_free_capacity 0.00MB
compression_active no
compression_virtual_capacity 0.00MB
compression_compressed_capacity 0.00MB
compression_uncompressed_capacity 0.00MB
site_id
site_name
parent_mdisk_grp_id 0
parent_mdisk_grp_name Group0
child_mdisk_grp_count 0
child_mdisk_grp_capacity 0.00MB
type parent
encrypt no
owner_type none
owner_id
owner_name
data_reduction yes
used_capacity_before_reduction 5.00TB
used_capacity_after_reduction 1.83TB
overhead_capacity 2.08TB
deduplication_capacity_saving 0.00MB
reclaimable_capacity 1.04GB
physical_capacity 93.13TB
physical_free_capacity 85.33TB
shared_resources no
This can also be achieved using the GUI by selecting Pool View → Select DRP Pool → Action → Properties → View More Details, and reclaimable capacity is shown in Figure 5-4.
Figure 5-4 Reclaimable capacity
The reclaimable capacity is the amount of data that will be garbage collected. As long as there is more than 15% of the total capacity free, the user should not experience any performance degradation.
5.1.4 Inter-node ports
IBM Spectrum Virtualize SVC/Storwize hardware platforms consist of at least two and up to a maximum of 8 nodes in a clustered system. The performance of the system relies on the timely communication of nodes in the cluster to provide good performance.
The Storwize family uses a dedicated interconnect to communicate to the partner node canister in the same enclosure. In a single I/O group configuration, the Storwize family code only uses the dedicated link unless the link (hardware) is broken, and FC ports are not used for inter-node connectivity. For a cluster with two or more control enclosures, inter-node communication between two control enclosures is done on FC ports as it is with the SVC.
SVC nodes use FC ports for connectivity. Prior to the DRP/Deduplication release, the advice was to use at least two FC ports for inter-node communication.
The two FC port inter-node suggestion remains the same for DRP users unless the host/application using DRP has a high write throughput (above 2 GBps host write workload) or high write IOPS workload. Specifically, for deduplication users with a high write workload (above 1 GBps host workload) we suggest using four FC ports dedicated to inter-node traffic.
For more information about the original two FC port recommendation, see the following publication:
IBM System Storage SAN Volume Controller and Storwize V7000 Best Practices and Performance Guidelines, SG24-7521
5.1.5 SVC with Storwize family back-end controllers
Because both systems will be running the same code, there is no performance or functionality gain available by using DRP compression or deduplication on both systems together. Therefore, it is suggested to use DRP technology at the SVC layer.
5.1.6 SVC with external back-end controllers
If external back-end controllers are used and virtualized behind SVC, again the advice is to use DRP compression on the front-end SVC layer. Careful planning is required for the back-end controller, which also can compress or deduplicate, and you should consider using a 1:1 compression and deduplication ratio. This is to avoid running out of physical capacity.
5.1.7 Storwize family with external back-end controllers
If Storwize is used to virtualize any other back-end controller behind it, use DRP at the Storwize layer. This combination provides storage efficiency and good, stable performance.
5.2 When to use DRP Compression
DRP pools use a new compression algorithm implementation. The suggestion for optimal performance is to use compression on all platforms and on all workloads (except Storwize V5030, as noted previously).
There is no performance penalty for writing non-compressible data, and for application workloads with a 2:1 or higher compression ratio there is significant capacity saving with no performance overhead. This eliminates a lot of planning work, and simplifies capacity savings for any storage solution.
Storwize V5030 does not have hardware accelerator cards, and it uses only software compression on the six core hardware platform. To use compression on Storwize V5030, use the same RACE guidance for compressibility, nodes CPU usage. See the following publications for guidelines on using compression on Storwize V5030:
IBM Real-time Compression in IBM SAN Volume Controller and IBM Storwize V7000, REDP-4859
Implementing the IBM Storwize V5000 Gen2 (including the Storwize V5010, V5020, and V5030) with IBM Spectrum Virtualize V8.1, SG24-8162
5.2.1 When to use DRP deduplication
Application workloads with more than a 30% deduplication ratio should consider using the DRP deduplication feature. A workload with a 2:1 or higher deduplication ratio will benefit from capacity savings. In addition, the deduplication feature should be used with compression, which provides the best capacity savings.
To identify the deduplication ratio, use the DRET tool. For more details, see 4.2, “Evaluating workload using Data Reduction Estimator Tool” on page 41.
5.3 Performance monitoring
IBM advises performance monitoring of the system, which enables the system administrator to monitor the production workload and storage solution performance. IBM provides Spectrum Control which is an enterprise class performance monitoring solution for SVC, Storwize and other storage products.
There is another alternative IBM Cloud™ offering known as Storage Insights, which can be used for performance monitoring of SVC and Storwize solutions. For more details, contact your IBM sales representative or Business Partner. Alternatively, visit the following link for more information:
5.3.1 Total port data rate by port
Users should monitor the port data rates using the following guidelines:
Although FC ports are full duplex, setting a warning level near the single direction line speed is advised to prevent ports from becoming saturated. Specifically, we suggest a warning level of 1200 MBps for 8 Gb ports, and 2100 MBps for
16 Gb ports.
Exceeding this limit requires further review. Workload can be balanced or moved to other available resources (ports, nodes, or I/O Groups), or additional resources might be needed to accommodate additional growth.
5.3.2 Total SVC node data rate
The suggested monitoring limit for the front-end data rate is 1800 MBps per node. This number is based on 75% utilization of 8 Gb ports shared for host and backend traffic.
CPU utilization percentage: maximum per SVC node
Users should monitor CPU usage using the following guidelines:
While a weighted average CPU utilization for the cluster (subsystem) is available, it’s best to monitor utilization per node.
Exceeding the node average of 70% might be impacting I/O performance for some workloads.
CPU utilization per core should also be monitored to ensure that no single core exceeds 90%.
Port to local node send response time: maximum per node
Users should monitor port to local node send response time using the following guidelines:
The port to local node response time is critical to ensuring stable performance of the cluster. Values greater than 0.5 ms impact the overall performance of the cluster to varying degrees.
Poor port to local node send response time is generally only an issue when the ports utilized for local node traffic are shared with other traffic (host/storage or replication). But even with dedicated ports, elevated port to local node send response time can still be an issue for other conditions, such as port saturation or degraded link quality due to hardware conditions.
Buffer credit depletion: maximum per SVC port
Users should monitor buffer credit depletion using the following guidelines:
Buffer credit depletion exceeding 20% that is sustained for longer than 20 minutes can negatively impact performance for all workloads that are sharing that port.
Short duration spikes in credit depletion that occur because of spikes in workload are normal FC flow control events, and are not indicative of a problem.
16 Gbps adapters do not support buffer credit counters. A replacement metric called Port Send Delay ms/op is available in SVC V7.8.1 and later, and supported in IBM Spectrum Control V5.2.14 and later.
Volume and back-end response times
Each environment can have a unique workload pattern and require a certain desired performance from the given workload. For example, a video streaming storage solution might require a 10 ms response time from volumes and 5 - 10 ms latency is not an issue. Another environment that hosts financial software, for example an OLTP database, might require <1 ms response times. In these sorts of environments, we recommend that clients engage with IBM Professional Services.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset