SCSI reservations
This appendix describes SCSI reservation, and how it can be used to provide faster disk fallover times when the underlaying storage supports this feature. For example, SCSI 3 Persistent Reservation allows the stripe group manager (also known as file system manager) to “fence” disks during node fallover by removing the reservation keys for that node. In contrast, non-PR disk fallover causes the system to wait until the disk lease expires.
 
Attention: You should not run these commands in your systems. By running these commands, this section shows you how disk reservations work, especially in a clustered environment, which demands more care while managing disk reservations.
This appendix discusses SCSI reservations, and contains the following:
SCSI reservations
SCSI 2 reservations gives us a mechanism to reserve and control access to a SCSI device from a node. An initiator obtains ownership of the device by using the reserve system call and works as a lock against any I/O attempt from other imitators. Another initiator trying to access this reserved disk would get a reservation conflict error code. Only the original initiator can release this reservation by issuing a release or reset system call.
SCSI 3 Persistent Reservations provides us the mechanism to control access to a shared device from multiple nodes. The reservation persists even if the bus is reset for error recovery. This is not the case with SCSI 2 command, where device reservations do not survive after node reboots. Also SCSI 3 PR supports multiple paths to a host, where SCSI 2 works only with one path from host to a disk. The scope of a persistent reservation is the entire logical unit.
SCSI 3 Persistent Reservations uses the concept of register and reserve. Multiple nodes can register their reservation keys (also known as PR_Key) with the shared device and establish a reservation in any of the following modes, as shown in Table A-1.
Table A-1 Types of SCSI reservations
Types
Code
Write exclusive
1h
Exclusive access
3h
Write exclusive - Registrants only
5h
Exclusive Access - Registrants only
6h
Write Exclusive - All registrants
7h
Exclusive Access - All registrants
8h
In All Registrants type of reservations (WEAR and EAAR), each registered node is a Persistent Reservation (PR) Holder. The PR Holder value would be set to zero. The All registrants type is an optimization that makes all cluster members equal, so if any member fails, the others continue.
In all other types of reservation, there is a single reservation holder, which is one of the following I_T nexus examples:
The nexus for which the reservation was established with a PERSISTENT RESERVE OUT command with the RESERVE service action, the PREEMPT service action, the PREEMPT AND ABORT service action, or the REPLACE LOST RESERVATION service action
The nexus to which the reservation was moved by a PERSISTENT RESERVE OUT command with the REGISTER AND MOVE service action
An I_T nexus refers to the combination of the initiator port on the host with the target port on the server:
1h Write Exclusive (WE)
Only the Persistent reservation holder shall be permitted to perform write operations to the device. Only one persistent reservation holder at a time.
3h Exclusive Access (EA)
Only the Persistent reservation holder shall be permitted to access (includes read/write operations) the device. Only one persistent reservation holder at a time.
5h Write Exclusive Registrants only (WERO)
Write access commands are permitted only to registered nodes. A cluster designed around this type must declare one cluster owner (the persistent reservation holder) at a time. If the owner fails, another must be elected. The PR_key_Holder value would be pointing to the PR_Key of the I_T nexus that holds the reservation of the disk. Only one persistent reservation holder at a time, but all registered I_T nexuses are allowed to do write operations on the disk.
6h Exclusive Access Registrants only (EARO)
Access to the device is limited only to the registered nodes and like in WERO, if the current owner fails, the reservation must be established again to gain access to the device. Only one persistent reservation holder at a time, but all registered I_T nexuses are allowed to do read/write operation on the disk.
7h Write exclusive All Registrants (WEAR)
While this reservation is active, only the registered initiators shall be permitted write operations to the indicated extent. This reservation shall not inhibit read operations from any initiator or conflict with a read exclusive reservation from any initiator. Each registered I_T nexus is a reservation holder, and is allowed to write to the disk.
8h Exclusive access All Registrant (EAAR)
While this reservation is active, no other initiator shall be permitted any access to the indicated extent apart from registered nodes. Each registered I_T nexus is a reservation holder, and is allowed to read/write to the disk.
Table A-2 shows the read/write operations with the type of All Registrants.
Table A-2 Read and write operations with All Registrants type
Type
WEAR (7h)/WERO (5h)
 
EAAR (8h)/EARO (6h)
 
 
Not registered
Registered
Not registered
Registered
WRITE
Not allowed
Allowed
Not allowed
Allowed
READ
Allowed
Allowed
Not allowed
Allowed
In Registrants Only (RO) type, reservation is exclusive to one of the registrants. The reservation of the device is lost if the current PR holder removes his PR Key from the device. In order to avoid losing the reservation, any other registrant can replace himself (known as preempt) as the Persistent Reservation Holder. Alternatively, in All Registrants (AR) type, the reservation is shared among all registrants.
ODM reserve policy
Accordingly, the AIX ODM device reserve_policy attribute needs to be set to open the device in any of the previous reservation types. The following values are the current valid values of the reserve_policy attribute, which can be seen using lsattr with the -R option, as shown in Example A-1.
Example A-1 Current valid values of the reserve_policy attribute
#lsattr -Rl <hdisk#> -a reserve_policy
no_reserve
single_path
PR_exclusive
PR_shared
 
Note: The values shown in Example A-1 on page 317 can change according to the ODM definitions or host attachment scripts provided by the disk or storage vendors.
The following attribute values are valid:
no_reserve does not apply a reservation methodology for the device. The device can be accessed by any initiators.
single_path applies a SCSI 2 reserve methodology.
PR_exclusive applies SCSI 3 persistent reserve, exclusive host methodology. Write Exclusive Registrants Only type of reservations would require reserve_policy attribute to be set to PR_exclusive.
PR_shared applied SCSI 3 persistent reserve, shared host methodology. Write Exclusive All Registrants type of reservations would require reserve_policy attribute to be set to PR_shared.
This attribute can be set and read as shown in Example A-2.
Example A-2 Setting the disk attribute to PR_shared
# chdev -l hdisk1 -a reserve_policy=PR_shared
hdisk1 changed
 
# lsattr -El hdisk1 -a reserve_policy
reserve_policy PR_shared Reserve Policy True+
The command lsattr with the -E option displays the effective policy for the disk in the AIX ODM. The -P option displays the policy when the device was last configured. This is the reservation information on the AIX kernel that is used to enforce the reservation during disk opens.
Setting these attributes using the chdev command can fail if the resource is busy, as shown in Example A-3.
Example A-3 Setting the disk attribute with the chdev command
# chdev -l hdisk1 -a reserve_policy=PR_shared
Method error (/usr/lib/methods/chgdisk):
0514-062 Cannot perform the requested function because the specified device is busy.
When the device is in use, we can use the -P flag to chdev to change the effective policy only. The change is made to the database and the changes will be applied to the device when the system is restarted. Another method is to use the -U flag where the reservation information is updated with the AIX ODM and the AIX kernel. However, not all devices support the -U flag. One of the ways to determine this support is to look for the True+ value in the lsattr output, as shown in Example A-4.
Example A-4 Checking if the device supports the U flag using the lsattr command output
# lsattr -Pl hdisk1 -a reserve_policy
reserve_policy PR_shared Reserve Policy True+
Persistent Reserve IN (PRIN)
 
Attention: You should not run these commands in your systems. By running these commands, this section shows you how disk reservations work, especially in a clustered environment, which demands more care while managing disk reservations.
PRIN commands are used to obtain information about active reservations and registrations on a device. The following PRIN service actions are commonly used:
Read keys To read PR Keys of all registrants of the device.
Read reservation To obtain information of Persistent Reservation Holder. PR Holder value would be zero if All Registrants type of reservation exists on the device. Else it would be the PR Key of the node holding the reservation of the device exclusively.
Report capabilities To read the capability information of the device. The capability bits indicate whether the device supports persistent reservations and the types of reservation supported by the device. A devrsrv implementation of this service action is shown in Example A-5.
Example A-5 Output of the devrsrv implementation
# devrsrv -c prin -s 2 -l hdisk1
PR Capabilities Byte[2] : 0x1 PTPL_C
PR Capabilities Byte[3] : 0x81 PTPL_A
PR Types Supported : PR_WE_AR PR_EA_RO PR_WE_RO PR_EA PR_WE PR_EA_AR
Persistent Preserve OUT (PROUT)
 
Attention: You should not run these commands in your systems. By running these commands, this section shows you how disk reservations work, especially in a clustered environment, which demands more care while managing disk reservations.
PROUT commands are used to reserve, register and remove the reservations and reservation keys. The following PROUT service actions are commonly used:
Register To register and unregister a PR key with device.
Reserve To create a persistent reservation for the device.
Release To release the selected persistent reservation and not remove any registrations.
Clear To release any persistent reservation and remove all registrations on the device.
Preempt To replace the persistent reservation or remove registrations.
Preempt and abort Along with preempting, to abort all tasks for one or more preempted nodes.
The value of the service action key and the reservation type matters when Preempt or Preempt and Abort actions are performed. Therefore, a little more detail about these service actions is necessary.
A PROUT command with PREEMPT or PREEMPT AND ABORT is used to perform one of the following actions:
Preempt (for example, replace) the persistent reservation and remove registrations
Remove registrations
The PREEMPT AND ABORT service action is identical to the responses to a PREEMPT service action except that all tasks from the device associated with the persistent reservations or registrations being preempted (but not the task containing the PROUT command itself) shall be aborted. See Table A-3.
Table A-3 Effects of preempt and abort under different reservation types
Reservation type
Service action reservation key
Action
All registrants
Zero
Preempt the persistent reservation and remove registrations.
Not zero
Remove registrations.
All other types
Zero
Illegal request.
Reservation holder’s reservation key
Preempt the persistent reservation and remove registrations.
Any other, non-zero reservation key
Remove registrations.
Understanding register, reserve, and preempt
We have a cluster of four systems with shared access to disk, as shown in Figure A-1. Assign PR_key_value from each node, and also set the reserve_policy of the target disk to PR_shared or PR_exclusive. The unique PR_key of each device is registered with the disk and the reserved disk with SCSIPR reservation, which gives access to registered devices only.
Figure A-1 Four node cluster setup with shared disk
We performed the register action from each system (1 - 4) to register its reservation key with the disk and reserve action to establish the reservation. The PR_Holder_key value represents the current reservation holder of the disk. As shown in Table A-4 on page 321, in the RO type only one system can hold the reservation of the disk at a time (key 0x1 in our example). However, all of the four registrant systems hold the reservation of the disk under the AR type, so you see that the PR_Holder_key value is Zero.
Table A-4 Differences with RO and AR
Type
All registrant (Types 7h/8h)
Registrant only (Types 5h/6h)
Registrants
0x1 0x2 0x3 0x4
0x1 0x2 0x3 0x4
PR_Holder_Key
0
0x1
A read key command displays all of the reservation keys that are registered with the disk (0x1, 0x2, 0x3, and 0x4). The read reservation command gives the value of PR_Holder_Key, which varies per reservation type. If there is a network or any other failure such that system 1 and the rest of the systems are unable to communicate with each other for a certain period, results in a split brain or split cluster situation as shown in Figure A-2.
Figure A-2 Split cluster situation
Suppose that your cluster manager decides on system 2 to take ownership (or the sub cluster with system 2), then the system can issue a PROUT command preempt or preempt and abort and remove the PR_Key 0x1 registration from the disk. The result is that the reservation is moved away from system 1, as shown in Table A-5 and is denied access to the shared disk.
Table A-5 Differences with RO and AR
Type
All registrant (Types 7h/8h)
Registrants only (Types 5h/6h)
PR_Holder_Key
0
0x2
Preempt or preempt_and_abort functions can take the following arguments:
Current_key PR_key of nodes issuing command, for example 0x2.
Disk The shared disk in discussion.
Action_key PR_key on which the action needs to be taken.
The action_key is 0x1 with the RO type of reservation. The action_key can be either 0 or 0x1 with the AR type of reservation. The two methods of preempting in case of an AR type are explained as follows:
Method 1: Zero action key
If the action key is zero, the following action takes place:
 – Registration of systems 1,3 and 4 are removed.
 – Release persistent reservation
 – Create new reservation from system 2.
This results in access only to system 2, as shown in Figure A-3.
Figure A-3 Result of preempt with action key zero
If the access to the rest of the system in active sub clusters needs to regained, we need to drive an event to re-register keys of systems of the active cluster (systems 3 and 4).
Method 2: Non-zero action key
If the action key is Non-Zero (Key of system1 in our case), there is no release of persistent reservation, but registration of the PR_Key 0x1 is removed. This achieves fencing, as shown in Figure A-4.
Figure A-4 Disk fencing
Table A-6 shows the result of prin commands after preempting system 1.
Table A-6 Difference with RO and AR
scsipr command
All registrants (Types 7h/8h)
Registrants only (Types 5h/6h)
 
Method 1
Method 2
 
Read key
0x2
0x2 0x3 0x4
0x2 0x3 0x4
Read reservation
0
0
0x2
Unregister
A registered PR_key can be removed by issuing a register or register and ignore command through that node. The service action key needs to be set to zero to unregister a reservation key. The list of registrants and PR_key_holder are shown in Table A-7.
Table A-7 Differences with RO and AR
Type
All registrants (Types 7h/8h)
Registrants only (Types 5h/6h)
Registrants
0x2 0x3 0x4
0x2 0x3 0x4
PR_Holder_Key
0
0x2
If the unregistered key is the PR_Holder_key (0x2) in RO type of reservation, along with the PR_key, the reservation to the disk is also lost. Removing Key 0x2 has no impact on reservation in the case of AR reservation type. The same is true when other keys are removed.
Any preempt attempt by system 1 fails with a conflict because its key is not registered with
the disk.
Release
A release request from persistent reservation holder node would release the reservation of the disk only, and the pr_keys would remain registered. Referring to Table A-7 on page 323, with AR type of reservation, a release command from any of the registrants (0x2 0x3 0x4) results in the reservation being removed. In the case of RO type, a release command from non pr_holders (0x3 0x4) would return good but with no impact on the reservation or registration. Release request should come from PR_holder (0x2) in this case.
Clear
Referring again to Table A-7 on page 323, if a clear request is made to the target device from any of the nodes, the persistent reservation of the disk is lost, and all of the pr_keys registered with the disk (0x2 0x3 0x4) are removed. Note that as T10 document rightly suggests, the clear action must be restricted to recovery operations, because it defeats the persistent reservation feature that protects data integrity.
 
Note: When a node opens the disk or a register action is performed, it would register with the PR_key value through each path to the disk. Therefore, we can see multiple registrations (I_T nexuses) with the same key. The number of registrations would be equal to the number of active paths from the host to the target, because each path represents an I_T nexus.
Storage
Contact your storage vendor to understand if the your device or Multipathing driver is capable of SCSI Persistent Reservation, and the types of reservations it supports. Your storage vendor can also provide you the minimum firmware level, driver version that is needed, and the flags required to enable support for persistent reservations.
The following configurations provide examples of support for persistent reservations:
IBM XIV®, Ds8k, SVC storages with native AIX MPIO supports1 SCSI PR Exclusive and Shared reservations by default as shown in Example A-6.
Example A-6 IBM storage support with native AIX MPIO of the SCSI PR Exclusive
# lsattr -Rl hdiskx -a reserve_policy | grep PR
PR_exclusive
PR_shared
The devrsrv utility enables you to verify the capability of your disks.
Hitachi disks with native AIX MPIO support2 all SCSI PR reservation types, provided that Host Mode Options (HMOs) 2 and 72 are set. The minimum code to support HMO72 is 70-04-31-00/00.
EMC disks support3* PR Shared reservations and not Exclusive reservation with powerpath v6.0, as shown in Example A-7.
Example A-7 EMC disk reservation support with powerpath v6.0
# lsattr -Rl hdiskpowerX -a reserve_policy | grep PR
PR_shared
Director bits SCSI3 Interface (SC3) and SCSI Primary Commands (SC2) must be enabled. Flag SCSI3_persist_reserv must also be enabled in order to use persistent reservation on powerpath devices.
More about PR reservations
During the reset sequence of the disk through a path, we send a PR IN command with service action READ RESERVATION(01h). This returns the current reserved key on the disk, if any Persistent reservation exists. If an All Registrant type reservation is on the disk, the reserved key would be zero.
In the case of a PR_exclusive type of reservation, the following actions occur:
If the current reservation key is same as the node’s key as in ODM, we register the key using PR OUT command with Register and Ignore Existing Key service action.
If the current reservation key is zero and that TYPE field (persistent reservation type as shown in Table A-1 on page 316) is also 0, which means no persistent reservation on the disk, we complete the following steps:
a. Register the key on to the disk using a PR OUT command Register and Ignore Existing Key service action.
b. If not reserved already by this host, we reserve it using a PR OUT command with Reserve service action and a type of Write Exclusive Registrants Only (5h).
If the current reservation key is different from the current host’s key, then it means that some other host holds the reservation. If we are not trying to open the disk with the -force flag, the open call fails. If we are trying to open the disk with the -force flag, complete the following steps:
a. Register the disk with our key using a PR OUT command with Register and Ignore Existing Key service action.
b. Preempt the current reservation with a PR OUT command with Preempt and Abort service action to remove the registration and reservation of the current reservation holder. The key of the current reservation holder is given in the Service Action Reservation Key field.
In the case of a PR_shared reservation, the following actions occur:
If the current reservation key is zero and the TYPE field (persistent reservation type as shown in Table A-1 on page 316) is also 0, this means that there is no persistent reservation on the disk. If the TYPE field is Write Exclusive All Registrants(7h), then some other host is already registered for shared access. In either case, complete the following actions:
a. Register our key on to the disk using a PR OUT command with the Register and Ignore Existing Key service action.
b. Reserve the disk using a PR OUT command with the RESERVE service action and the type of Write Exclusive All Registrants (7h).
While closing the disk, for PR_exclusive reservations alone, we send a PR OUT command with the Clear service action to the disk to clear all of the existing reservations and registration. This command is sent through any one of the good paths of the disk (the I_T nexus where registration has been done successfully).
While changing the reserve_policy using chdev from PR_shared to PR_exclusive, from PR_shared or PR_exclusive to single_path (or no_reserve if the key in ODM is one of the registered keys on the disk), we send a PR OUT command with Clear service action to the disk to clear all of the existing reservations and registration.
Persistent reservation commands
The devrsrv command of AIX queries, and can even break, persistent reservations on the device. The following IBM Knowledge Center explains the usage of the devrsrv command:
Use the following syntax for the devrsrv command:
devrsrv -c query | release | prin -s sa | (prout -s sa -r rkey -k sa_key -t prtype) -l devicename
The clrsrvmgr command of PowerHA 7.2 lists and clears the reservation of a disk or a group of disks in a Volume Group.
Use the following syntax for the clrsrvmgr command:
clrsrvmgr -r {[-l DiskName]|[-g VGname]} [-v]
clrsrvmgr -c {[-l DiskName]|[-g VGname]} [-v]
clrsrvmgr -h
This command lists or Clears the reservation status of a disk or a volume group. The command will display the following key attributes related to disk reservations:
Configured Reserve Policy. This is the reservation information in the AIX kernel used to enforce the reservation during disk opens etc.
Effective Reserve Policy. Reservation policy for the disk in the AIX ODM.
Reservation Status. This is the status of the actual reservation on the storage disk itself.
The options are mostly self explanatory:
-r read
-c clear
-h help
-v verbose
-l expects diskname
-g expects a volume group name
The manager does not guarantee the operation because disk operations depend on the accessibility of the device. However, it tries to show the reason for failure when used with the -v option. The utility does not support operation at both the disk and volume group levels together. Therefore, the -l and -g options cannot co-exist. At the volume group level, the number of disks in the VG, and each target disk name, are displayed as shown in the following code:
# clrsrvmgr -rg PRABVG
Number of disks in PRABVG: 2
hdisk1011
Configured Reserve Policy : no_reserve
Effective Reserve Policy : no_reserve
Reservation Status : No reservation
hdisk1012
Configured Reserve Policy : no_reserve
Effective Reserve Policy : no_reserve
Reservation Status : No reservation
At disk level, the disk name is not mentioned because the target device is known well:
# clrsrvmgr -rl hdisk1015 -v
Configured Reserve Policy : PR_shared
Effective Reserve Policy : PR_shared
Reservation Status : No reservation
 

1 Confirm with the storage and driver vendors.
2 Confirm with the storage and driver vendors.
3 Confirm with the storage and driver vendors.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset