Advanced alert management
With IBM Spectrum Control (Spectrum Control) Standard and Advanced Edition, the ability to define alerts and be notified of capacity changes, configuration changes, and performance conditions on resources is enhanced.
This chapter describes and provides practical scenarios that use this extensive set of new alert conditions to better monitor and identify more of the critical situations in your environment. These scenarios are focused on adding value from a storage administrator perspective. Top-level entities (storage system, switch, and server) are covered.
For more information about alerting in Spectrum Control, see the Spectrum Control IBM Knowledge Center, found at:
6.1 Alerting overview
Determining when and how you are alerted to conditions and violations of performance thresholds within a storage environment is important to helping you maintain and administer storage resources.
With Spectrum Control, you can now apply an extensive set of alert conditions to detect problems. You can define alerts for almost all attributes of a resource, including attributes for status, configuration, capacity, and performance. Use this extensive set of alert conditions to better monitor your environment without needing to look at all the metrics manually. Instead, look once and then define thresholds according to your environment. From time to time, you should review your alerts and thresholds because business requirements change.
In Spectrum Control, there are over 50 conditions on which you can receive alerts, including conditions such as Used Capacity, Total Volume Capacity, and Available Pool Space. For performance alerts, you can now receive alert on any of the hundreds of metrics that are collected by Spectrum Control.
You can now configure multiple alerts for the same condition with different severity, suppression, and notification settings.
6.2 Alerts
The Spectrum Control Alert Server expands the ability of Spectrum Control to alert at the storage component level, providing added flexibility for capacity and performance alerts and attribute-based configuration alerts. The alert server is an independent component server that is built on WebSphere Liberty Core.
For more information about WebSphere Liberty Core, see WebSphere Application Server IBM Knowledge Center, found at:
The Alert Server is packaged with all Spectrum Control offerings and is automatically installed. It coexists with both the Device and Data Servers with the same level of service controls and logging.
If you use a custom start/stop script, you must include the Alert Server as well.
On the Spectrum Control Dashboard, you see how many alerts of which category occurred in the last hour, last day, or last week.
Figure 6-1 on page 179 shows the Alert Overview on the Dashboard.
Figure 6-1 Alert Overview on the Dashboard
6.2.1 Features
The following list shows the main features that are available starting with
Spectrum Control V5.2.8:
Expands the total number of storage resources that can be tracked and alerted.
Classifies alerts in three categories for simplified configuration: Informational, warning, and error.
Centralizes alerts and threshold violations under a single tab, where you have the possibility to acknowledge them.
Three different Alert or Threshold Violation Suppression settings are available:
 – Only alert once until problem clears.
 – Only generate alerts every x minutes, hours, or days.
 – Do not alert until condition has been violated for more than x minutes, hours, or days.
Alert notification supports email, Omnibus, SNMP, and Windows/UNIX Event Logs.
Multiple thresholds can be set for the same metric for different severities, which can be combined with specifying different email recipients or different scripts.
6.2.2 Triggering conditions for alerts
You can set up Spectrum Control so that it examines the data of your resources for the conditions and performance thresholds that you specify.
The conditions that trigger alert notifications depend on the type of resource that you are monitoring. Some attribute and performance conditions require that you enter values for triggering alerts. In general, the following types of conditions can trigger alerts:
An attribute of a resource changed.
A performance metric for a resource fell outside a specified range.
The storage infrastructure changed.
Data cannot be collected for a resource.
6.2.3 Reporting
You can use reporting functions to view overview and detailed information about your storage, for example:
Spectrum Control web-based GUI
Native SQL
Cognos BI
6.3 Alerting and event processing
With Spectrum Control, you can define alerts for almost all attributes of a resource, including attributes for status, configuration, capacity, and performance. For performance alerts, you can now alert on any of the hundreds of metrics that are collected by Spectrum Control.
Alerts and Threshold violations are defined on the device level, either from the resource list page or from the resource page.
Figure 6-2 shows the Edit Alert Definitions menu on the resource list page.
Figure 6-2 Edit Alert Definitions menu on the resource list page
Figure 6-3 shows the Alert Definitions menu on the resource page.
Figure 6-3 Edit Alert Definitions menu on the resource page
 
Reference: For more information about alert and event processing, see the Spectrum Control IBM Knowledge Center:
For improvements of the alerting function in future versions of Spectrum Control, see the What’s new topic in the Spectrum Control IBM Knowledge Center.
The following sections describe practical user scenarios from a storage administrator perspective that can be implemented in your business environments.
6.3.1 Scenario 1: Monitoring and being notified about storage utilization
In this scenario, a storage administrator wants to be notified if the file systems, which can be on a server, NAS, or hypervisor, are being underutilized for an extended period so that storage can be reclaimed to minimize storage costs. (This scenario can be expanded to go into detail about capacity threshold levels, the level of suppression, and who to notify).
To delve deeper into this scenario, assume that the storage administrator recently was notified that the request to purchase more storage in the coming year was denied. Given that the budget remains flat, the administrator wants to put a plan in place to identify whether any of the server storage is being underutilized.
Before making any major changes, the administrator wants to be notified if any of the file systems have extra capacity so the administrator can then take a deeper look into what storage might be a target for storage reclamation. To accomplish this task, complete the following steps:
1. From Spectrum Control, go to Server’s Overview. Select Alerts in the left pane, as shown in Figure 6-4) and select the Definitions tab in the Definition pane.
Figure 6-4 Alert definition window
2. Select the File Systems and Logical Volumes tab. On this tab, you notice that the alerts are categorized into General for Configuration alerts and Capacity for all Capacity-related alerts. Select the Capacity tab.
3. Enable Used Space and set it to be less than or equal to 25%, as shown in Figure 6-5, which shows an example of how to set a Used Space alert on a file systems or logical volume level. Now, you can track all file systems that are using less than 75% of the available capacity and might be good candidates for storage reclamation.
Figure 6-5 Example of how to set a Used Space alert on file systems or logical volume level
4. To notify the correct people, in Figure 6-6, which shows how to override notification settings for a special alert, click the envelope icon to set alert notification email addresses for this specific alert. Then, click Override Notification settings, select the Email check box, insert the email address, and click Done.
Figure 6-6 How to override notification settings for a special alert
5. If you know that the file system is being monitored daily and want to avoid a flood of emails, you can change the default suppression setting from Only alert once until the problem clears to Only generate alerts every 5 days. Click the struck through exclamation mark next to the envelop icon, as shown in Figure 6-6 on page 185. Then, click Only generate alerts every, select 5, and select days as shown in Figure 6-7, which shows an example of alert suppression. With this suppression setting, you apply a fixed window of time where suppression occurs regardless of violation. This setting ensures that it does not matter whether the utilization within the five days is good or bad.
Figure 6-7 Example of alert suppression
Alternatively, you can choose Do not alert until condition has been violated for more than 5 days, as shown in Figure 6-8, which shows the Suppression Settings dialog box. In this case, the violation must occur consistently for more than 5 days to trigger an alert without any clearing of the condition. With this option, you will get an initial alert when the problem occurs.
Figure 6-8 Suppression Settings dialog box
6.3.2 Scenario 2: Performance troubleshooting with advanced alert management
In this scenario, a storage administrator wants to track the performance of an IBM Storwize V7000 Unified storage system. Some of the clients are complaining about general performance degradation, so it is beneficial to set a few metrics on Spectrum Control to send notifications that are related to device performance.
To delve deeper into this scenario, the storage administrator wants to make sure all of the processors are being used to balance the subsystems workload and that none of the central processing units (CPUs) are taking on too much of the workload.
Also, the administrator wants to know whether the port bandwidth might be the problem, and specifically whether it is the sending or receiving of data that is causing a bottleneck on any ports. As with the CPUs, the administrator wants to know whether any ports are being underutilized so that the administrator can rebalance workload if necessary. The last thing that the administrator must consider is the backup window each night. For example, performance always spikes each night for about 3 hours while clients initiate their backups, so the administrator wants to not be notified of this performance slow down.
To resolve this scenario, complete the following steps:
1. From the Spectrum Control Dashboard, go to the Overview window for the IBM V7000 Unified storage system (Figure 6-9, which shows how to define the metrics on which you want to define thresholds). In the V7000 Overview window, open the Alerts Definitions pane, click Ports, and click the Performance tab. Click Add Metric, and select the performance alerts check boxes for Overall Port Bandwidth Percentage, Port Send Bandwidth Percentage, and Port Receive Bandwidth Percentage. Click OK.
Figure 6-9 Define the metrics on which you want to define thresholds
2. After enabling all metrics, set Overall Port Bandwidth Percentage to less than 15% (Figure 6-10, which shows how to set thresholds for individual performance metrics) to track whether any Subsystem Ports are being underutilized. The Overall Port Bandwidth Percentage alert can be defined with Informational severity level because this is not an immediate performance impact. To do so, click the drop-down arrow next to the red cross and select the yellow exclamation mark. For Port Send Bandwidth Percentage and Port Receive Bandwidth Percentage, define a Warning severity threshold at greater than 75%.
Figure 6-10 Set thresholds for individual performance metrics
3. Set the Critical severity threshold at greater than 85%. To do this task, define the Warning threshold alert at greater than 75% as shown in Figure 6-11, which shows how to set a warning and a critical threshold on one metric, then click the (+) icon for this alert to define the Critical severity threshold of greater than 85%.
Figure 6-11 Set a warning and a critical threshold on the same metric
4. Now, from the Port Performance tab, click the Nodes Performance tab (see Figure 6-12, which shows how to set performance thresholds for nodes, on the Alerts Definitions window). Then, use the same methods to define Node Performance Alerts for the subsystem. To identify idle CPU usage, add the System CPU Utilization metric from the Add Metrics dialog box and define the alert with less than or equal to 10%. Define this alert with Informational severity because it is not an immediate performance impact. To track excessive CPU utilization, define the System CPU Utilization with a Warning threshold of greater than 75%, then click the (+) icon on this alert to define the Critical threshold of greater than 90%.
Figure 6-12 Set Performance Thresholds for Nodes
5. To account for the known backup window, you must set suppression on each alert you want to avoid during this window. To do this task, click the suppression icon on each alert. The default value is to alert only once until the problem is cleared. Because the backup window lasts for 3 hours each night, you must do the following steps:
a. Select the suppression radio button Do not alert until condition has been violated more than 3 hours, as shown in Figure 6-13, which shows an example for alert suppression that can be used for a backup window. This suppression setting is okay during the backup window, but it also means that during the rest of the day you get alerts only if the corresponding thresholds also are continuously violated for 3 hours.
Figure 6-13 Example for alert suppression that can be used for a backup window
b. What you need is a blackout window for the backup time frame, but this is not available yet.The alternative suppression setting is Only generate alerts every three hours. With this option, you receive the alert notification only once.
Figure 6-14 shows another example for alert suppression that can be used for a backup window.
Figure 6-14 Another example for alert suppression that can be used for a backup window
6.3.3 Scenario 3: Closely monitoring new application usage with advanced alert management
In this scenario, a storage administrator recently received an email that a department that the administrator supports is upgrading one of their medical imaging software applications. Clients can use this new application to scan medical images with a greater level of detail, but it also requires the size of the application request to almost triple. Previously, the application requests where about 4 KB and they had about 5000 I/O requests per second. The department is using an IBM Storwize V7000 Unified storage system for storage and managed to stay under the maximum transfer rate.
To delve deeper into this scenario, the storage administrator, as the department implements its new application, wants to closely monitor the I/O transfer size on the back-end storage that is assigned to make sure that it is sufficient. The existing performance monitoring of the Storwize V7000 Unified storage system experienced some intermittent gaps, so the administrator want to make sure that they can quickly gather diagnostic tests on the Spectrum Control server so that support can be quickly engaged if needed.
To track the performance of the Storwize V7000 Unified storage system, you must run the performance monitors. If there is a problem with the performance monitors, a script is initiated to open a ticket in the company's ticket interface.
Complete the following steps:
1. Go to the Block Storage for the IBM V7000 Unified Overview window and click the Alert Definitions tab. Click the Managed Disks Performance Alerts tab. Click Add Metrics and select the check boxes for Total Back-end I/O Rate, Overall Back-end Transfer Size, and Back-end Write Transfer Size.
Figure 6-15 shows an example of how to select performance thresholds for MDisks.
Figure 6-15 Example of how to select performance thresholds for MDisks
2. After enabling all metrics, set the Overall Back-end Transfer Size alert to greater than 8 KiB/op, as shown in Figure 6-16, which shows an example on setting performance thresholds for MDisks. Then, define the Back-end Write Transfer Size and Back-end Read Transfer Size to greater than 6KiB/op. To track the overall transfer rate, define Total Back-end I/O Rate by setting the alert to greater than 3,000 ops/s.
Figure 6-16 Example of setting performance thresholds for MDisks
3. To ensure you that you can quickly engage support if Performance Monitors fail, you can define attribute alerts on the storage system, as shown Figure 6-17, which shows how to enable an alert on Performance Monitoring, by clicking the General tab. Enable the Performance Monitor Status alert setting the attribute to is not normal.
Figure 6-17 Enable an alert on Performance Monitoring
4. Then, from the Notification Settings window, which is shown in Figure 6-6 on page 185, set Run Script to upload the Spectrum Control service.bat/sh script to run upon failure. Click the envelope icon, click Run Script, click Select File, and select Upload Script, as shown in Figure 6-18, which shows how to run a script upon alert notification. Click Browse and select Script. Define the storage resource agent by using the drop-down arrow next to Run script on Storage Resource agent and select the Storage Resource Agent. Click Done.
Figure 6-18 Run a script upon alert notification
 
Reference: For more information about running scripts as a triggered action for an alert condition, see the Spectrum Control IBM Knowledge Center, found at:
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset