Chapter 17. Availability considerations

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Availability considerations

This chapter describes the considerations of availability in each of the JES environments and compares how each of them are maintained.

17.1 Standard practices for availability

It is assumed that for either JES configuration, IBM standard practices are followed with regards to availability. All configurations must maintain redundant processes that are documented and known to operations support staff. For example, neither JES configuration should require an all-systems sysplex IPL for normal routine maintenance. The concept of a rolling IPL can be the standard for change management.

Also, it is assumed that in either environment, there is security protection from the more destructive operator commands.

17.2 Tailoring JES for best availability

In normal situations, there are parameters in both JES3 and JES2 that help building a more reliable, hence available, environment.

17.2.1 JES3 global-local communication

Implementing XCF signaling through coupling facility list structures provides significant advantages in the areas of systems management and recovery, and thus, provides enhanced availability for sysplex systems, as global/local communication is concerned.

17.2.2 JES3 spool partitions

A spool partition is a logical grouping of spool data sets. You control five factors:

• The number of spool partitions used.

• The number of spool data sets that are in each spool partition.

• The work load distribution across spool partitions.

• The type of spool data to be included in each spool partition.

• The size of a track group for each partition.

These factors influence the reliability, availability, and serviceability (RAS) of spool data sets and the performance impact of accessing a spool data set.

If a spool data set fails, the failure affects only a subset of the jobs in the JES3 complex. That is, the failure affects only those jobs that have data in the spool partition including the failed spool data set, not jobs that have data in other spool partitions. (The failure might not affect all jobs in that spool partition, however. Some jobs might not have had any data on the failed data set.) Thus, spool partitioning improves spool RAS.

17.2.3 JES3 spool volume recovery

Spool recovery procedures and operator commands are an important part of maintaining the JES3 system. When spool volume errors occur, you must have procedures in place to provide recovery from any type of error.

The *F Q command allows you to control activity on a spool data set. For spool volume recovery, you can do the following:

You can stop JES3 from allocating additional space on a specific spool data set and then restart space allocation processing at a later time. This action does not affect the jobs that already have data on this data set; the jobs continue to run in the normal manner.

If necessary, you can place a spool data set and all jobs with spool data on the data set in hold status and release both the data set and the jobs at a later time.

Another parameter allows you to place the data set in hold status and cancel all jobs with spool data on the data set. You then can release the data set from hold status and resume allocating space on the data set.

All these changes remain in effect when you restart JES3, using a hot or warm start:

•Avoid cold starts due to spool failures

•Single track errors or large numbers

•Provide method for suspending volume use

•Prevent new allocations to volume

•Replace spool volumes

•No RAS facilities for JCT and Checkpoint

Other techniques could also be possible.

17.2.4 JES3 system select phase

The system select phase of MDS is performed when a job requires one or more resources managed by the storage management subsystem (SMS). If the job does not require any SMS-managed resources, the job proceeds directly to MDS allocation. JES3 is not aware of the availability or connectivity of SMS-managed resources. If a job requires SMS-managed resources, JES3 requests SMS to determine the availability of those resources and to determine which mains have access to those resources. If JES3 determines that one or more mains have access to all of the required resources, the job proceeds into the allocation phase.

JES3/DFSMS communication

JES3 SMS support provides complex-wide data set awareness for DFSMS-managed data sets through subsystem interface communication with DFSMS, main processor, and DFSMS resource availability are determined for scheduling jobs into execution using these interface calls.

JES3 and DFSMS communication requires to:

•Make sure that catalog locates are done on a processor with access to the required catalogs.

•Make sure that jobs requiring DFSMS resources execute on processors to which the resources are accessible and available.

•Provide complex-wide data set awareness for all DFSMS managed requests (even for new, non-specific requests).

•Remove JES3 awareness of units and volumes for DFSMS managed data sets. (One of the DFSMS objectives is to remove user awareness of the physical storage.)

17.2.5 JES2 checkpoint data set

We described checkpoint data sets in 1.2.5, “Checkpoint data set” on page 14. When you review the allocation of checkpoint data, your interests are not solely performance related. The prime objective is availability. You need to adopt either dual or duplex mode for your checkpoint. Duplex mode must always be established when using a coupling facility structure for your checkpoint.

Dual mode can only be established when coupling facilities are not being used. In dual mode, processor overhead is used to reduce the amount of data transferred to the device and thus shortening the total time for the I/O to complete. Dual mode can add an additional 10% in JES2 processor cycles.

Generally, with modern DASD, the amount of I/O delay reduced by adopting dual mode is often consumed by the extra time needed by the additional JES2 processor cycles. Therefore, dual mode does not provide a significant advantage.

When configuring checkpoint, we do not recommend placing both checkpoints on coupling facility structures or placing the primary checkpoint on DASD and the secondary checkpoint on a coupling facility structure.

If both checkpoints reside on coupling facilities that become volatile (a condition where, if power to the coupling facility device is lost, the data is lost), your data is more susceptible than when a checkpoint data set resides on DASD. If no other active MAS member exists, you can lose all checkpoint data and require a JES2 cold start.

Placing the primary checkpoint on DASD while the secondary checkpoint resides on a coupling facility provides no benefit to an installation.

17.2.6 Coupling facility structure duplexing

The introduction of coupling facility structure duplexing with CFLEVEL12 on System z processors with z/OS V1R4 and above gives recommended availability benefits. There is no significant performance overhead to JES2 itself, and it will avoid the operational use of the JES2 checkpoint reconfiguration dialog.

17.2.7 JES2 SPOOL partitioning

SPOOL partitioning (or fencing) allows track group allocation across explicit volumes. Standard JES2 processing allows the allocation of track groups across any available spool volume. By fencing volumes, you can improve JES2 performance. Frequently, run system jobs can be isolated on separate volumes leaving the remaining volumes to be used by user-defined work. This can realize performance improvements for sysout intensive batch jobs. However, in general terms, performance will be better with FENCE=NO. Jobs might access all volumes, so increasing pathing capability and reducing the impacts of IOSQ and PEND time.

The main benefit of fencing is availability. Without it, the loss of a spool volume can result in the loss of all jobs. With it, jobs can be limited to one volume and isolation of the failure. With the common usage of RAID DASD, spool fencing might seem less beneficial with data striped across multiple physical volumes. However, this only caters to the physical residence of the data. There are potential failure scenarios at UCB level that continue to qualify the availability benefits of fenced volumes.

Spool fencing is enabled by the FENCE parameter on the SPOOLDEF initialization statement. Exit11 and Exit12 facilitate masks to limit access to the volumes based on jobname or jobclass. Alternatively, fencing can be used in association with spool affinity (see section 17.2.8, “JES2 SPOOL affinity” on page 221) to control the member selection criteria for fenced volumes.

17.2.8 JES2 SPOOL affinity

Spool affinity is an alternative (or addition) to fencing and provides a mechanism to logically partition the spool. This gives availability benefits. It facilitates the split of a large DASD spool pool into smaller pools for critical systems. It also provides a mechanism to have spool space online and ready to use if spool volumes start to fill up.

17.3 Updating the configuration

The following section provides a brief overview of what is required to update the configuration of the respective JES systems.

17.3.1 In a JES3 environment

There are three types of configuration changes to consider: JES3 initialization parm changes, system changes, and JES3 Exit changes.

JES3 initialization changes

Changes to most JES3 parameters do not require JES3plex IPLs, and can be done dynamically by performing a JES3 hot start with refresh. This is a recycle of the JES3 address space on the Global system and then restarting it using a new inish deck with a certain parameter member.

Dynamically adding volumes with SPOOL ADD and SPOOL DELETE (using FORMAT, SPART, and TRACK) can be done in JES3. However, spool delete requires a rolling IPL because a “spool migrate” function does not exist.

The following changes also require a JES3plex IPL:

•BADTRACK (bypass defective tracks)

•CONSTD (console service standards)

System changes

In a JES3 environment, system changes via IPL can be implemented using a “Rolling IPL” method for all JES3 Local systems but then must be considered for implementing on the JES3 Global system. When implementing on the JES3 Global system, either you must tolerate an outage of the JES3 Global system and suspend processing on all the Locals at the time or perform a DSI to move the Global function to another system in the sysplex before IPLing.

JES3 exits implementation

To implement changes to JES3 exits, the dynamic exit facility can be invoked. However, ensuring that all systems in the JES3plex will pick up the change and test it before production implementation requires a separate JES3plex configuration. Be aware that the default exits have the capability of setting a flag to prevent any additional calls being made to them.

17.3.2 In a JES2 environment

A normal recycle of JES2 is more closely associated with a shutdown and IPL of a system since all jobs running on the system are submitted under the JES2 subsystem. This can be seen in response to a $D JES2 command where all started tasks and jobs are shown. However, since most changes can be performed dynamically, the requirement to do this occurs less often. Most changes can be made dynamically and then added to the initparm to have it take affect permanently on subsequent IPLs.

Adding or replacing spool volumes

JES2 spool volumes in a MAS can be added by using the $S SPOOL command. Spool volumes can be removed by using the $Z SPOOL command to halt or the $P SPOOL command to drain. The procedure can be found in z/OS JES2 Initialization and Tuning Guide, SA22-7532.

Checkpoint reconfiguration dialog

Maintenance and changes to the checkpoint data sets are done through the checkpoint reconfiguration dialog. Operational procedures need to be created for this highly critical process. See 9.6.3, “JES2 Checkpoint Reconfiguration Dialog” on page 127 for a sample display of the dialog. The procedure is explained in further detail in z/OS JES2 Initialization and Tuning Guide, SA22-7532.

JES2 exits implementation

Since each system in a JES2 configuration is independent, each system might have its own set of JES2 exits. JES2 provides more dynamic capability than JES3. Implementation of changes to new exits either dynamically or with a system IPL can be implemented using a rolling IPL method.

For more information, see:

http://publib.boulder.ibm.com/infocenter/ieduasst/stgv1r0/topic/com.ibm.iea.zos/zos/1.7/JobEntrySS/JES2_Exits_Overview.pdf

17.3.3 Secondary JES

JES2 can be started as a secondary subsystem when the primary subsystem is JES2. This is also known as Poly-JES and is primarily used for testing new configurations, release or Initparms in JES2. It would not be recommended to remain in production for any length of time since there are known limitations from other system products and their support. For example, WLM does not support a secondary JES2 subsystem in the same MAS as the primary. WLM-managed initiators will only select jobs from the primary JES.

Note: JES2 can run as a secondary subsystem when the primary subsystem is JES3, but is not supported.

17.4 Unplanned outage

This section describes the impact of an unplanned outage in each of the JES configurations with a brief review of the recovery process.

17.4.1 JES3 unplanned outages

The following describes the two types of unplanned outages in a JES3 configuration, JES3 local outage and a JES3 global outage.

JES3 local outage

Outside of a normal shutdown of a JES3 local, using the 8RETURN command, an outage of a JES3 local system only requires a restart of the local JES3 address space. If a JES3 local fails to restart, a possible IPL of that system or the JES3plex would be required and could be a symptom of a larger problem.

JES3 global outage

If the JES3 global system goes down or the JES3 address space on the current global system goes down, job processing is quiesced for all systems in the JES3plex when an element of work on the local wants a JES service. Thus IMS, DB2, and CICS (for example) might not be immediately impacted. Immediate recovery must be attempted depending on the recovery scenario. If the JES3 global address space cannot be restarted with a hot start on the currently defined JES3 global system, there are different options that can be considered:

•Restart the JES3 address space with HA start and remove any jobs that might be causing the problem.

•Re-IPL the global system and attempt to restart JES3.

•Perform a Dynamic System Interchange (DSI) and move the global function to another system in the JES3plex.

17.4.2 JES2 unplanned outage

The configuration of a JES2 JESplex is set up so that each system acts independently of each other and an outage in any one system does not affect any other JES2 system in the complex. A sample JES2 configuration is shown in Figure 17-1. Note in this configuration that member SYSA has AUTOEMEM=YES in the MASDEF of the init parm defined and SYSB and SYSD have RESTART=YES defined.

Figure 17-1 Four system JES2plex

If one of those members, system SYSA in the example, is removed from the JESplex, the work scheduled to that system can either wait for that system to be restarted into the JESplex or be automatically attempted to be restarted on any other eligible system in the JESplex as shown in Figure 17-2 on page 224.

Figure 17-2 Four system JES2plex with one system removed

Any work scheduled to that system can either wait for that system to be restarted into the JESplex or be automatically attempted to be warm started on any other eligible member in the JESplex that has RESTART=YES.

Note: The usage of AUTOEMEM might not be what is intended for all installations. Consideration needs to be made for any applications with a system affinity to a particular system or set of systems. For example, some production DB2 applications are only available on SYSA and SYSC and cannot run on SYSB and SYSD, so they would not be candidates for this system switch if SYSA were to come down. In this case, it would be preferred to use automatic restart manager (ARM) or automation.

17.4.3 JES2 abend

If the JES2 address space itself abends or a system that contains a JES2 subsystem terminates abnormally, the behavior is the same as the system coming down and would depend on the recovery options in the configuration.

An operator initiated abend can be displayed as shown in Example 17-1.

Example 17-1 Sample response after $P JES2,ABEND is entered

*$HASP198 REPLY TO $HASP098 WITH ONE OF THE FOLLOWING: 101

END - STANDARD ABNORMAL END

END,DUMP - END JES2 WITH A DUMP (WITH AN OPTIONAL TITLE)

END,NOHOTSTART - ABBREVIATED ABNORMAL END (HOT-START IS AT RISK)

SNAP - RE-DISPLAY $HASP088

DUMP - REQUEST SYSTEM DUMP (WITH AN OPTIONAL TITLE)

*062 $HASP098 ENTER TERMINATION OPTION

This can also be done when a system is removed from the MAS by using the $E MEM(sysname) command. This also restarts jobs on other eligible systems. Note: this command will be ignored on an active system.

17.4.4 JES2 checkpoint data set errors

It is recommended to have an alternate entry for the JES2 checkpoint data set so any errors that are encountered will be automatically detected and switched to the alternate or the checkpoint reconfiguration dialog is displayed.

JES2 checkpoint data set configuration

It is recommended that the JES2 configuration include a primary checkpoint data set that can be defined in the Coupling Facility as a structure or on DASD and an alternate on DASD.

Note: If you only have one Coupling Facility (CF), do not define both the CKPT1 and CKPT2 data sets as structures to that CF. Make sure that at least one checkpoint data set is defined on DASD. Otherwise, you might unintentionally lose your checkpoint data sets if access to your CF is disrupted, which will potentially require a JES2 cold start to recover.

Also, for checkpoint data sets that are defined on DASD, it is recommended to add the entry to the GRSRNLxx PARMLIB member to minimize contention on the data set.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 17. Availability considerations

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 17. Availability considerations