Home Page Icon
Home Page
Table of Contents for
Effective Monitoring and Alerting
Close
Effective Monitoring and Alerting
by Slawek Ligus
Effective Monitoring and Alerting
Effective Monitoring and Alerting
Preface
Who Should Read This Book
Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgements
1. Introduction
Monitoring, Alerting, and What They Can Do for You
Early Problem Detection
Availability
Performance
Decision Making
Baselining
Predictions
Automation
Admission Control
Autonomic Computing
Monitoring and Alerting in a Nutshell
Metrics and Timeseries
Alarms, Alerts, and Monitors
Monitoring System
The Process of Alerting
Issue Tracking
Tickets and queues
The Challenges
Important Terms
2. Monitoring
The Building Blocks
Data Collection
Coverage
Resources
Network
Computational resources
Solution stack
Operating system
Middleware
Application
User experience
Metrics
Summary statistics
Frequency distribution and percentiles
Rate of change
Time granularity
Metric aggregation
Example: Inputs, Metrics, and Timeseries
Understanding Metrics
Type of unit
Data Collection Mode
Data Source
Number of Inputs per Data Point
Type of Quantity
Timeseries Patterns
Drawing Conclusions from Timeseries Plots
Interpretation of Anomalies
Flow
Stock
Availability
Throughput
Applications of quantities
Frequently Encountered Anomalies
Flattening Effect
Warm-Up Effect
Regular Anomalies
Spikes During Troughs
Determining Causality
Capturing the Daily Cycle, Trends, and Seasonal Changes
3. Alerting
The Challenge
Prerequisites
Monitoring and Alerting Platform
Audit Trail
Issue Tracking
Understanding Failure and Its Impact
Establishing Significance
Identifying Causes
Anatomy of an Alarm
Boolean Function
Metric Monitor
Upper Limit
Lower Limit
Outside Range
Data Points Not Recorded
Time Evaluation
Another Alarm as Input Source
Suppression
Aggregation
Case Study: A Data Pipeline
Types of Alerts
Setting Up Alarms
Identifying Impact
Establishing Severity
Picking the Right Timeseries
Configuring Monitors
Coming Up with a Threshold
Static thresholds
Data-driven thresholds
Breach and Clear Delay
Setting Up Alarms
Testing Alerting Configurations
Alerting Suggestions
4. At Scale
Implications of Scale
Composition of Large-Scale Systems
Commonalities of Large-Scale Alerting Configurations
Monitoring Coverage
Reflecting Dimensions in Metrics
Managing Large Alerting Configurations
Addressing the Problems
Organize alarms and monitors in a namespace
Calculate threshold values from metric data
Periodically refresh and clean up the configuration
Suggested Solution
Refresh intervals
Running the engine
Naming
Alarm creation and threshold calculation
Cleanup procedures
Writing Modules
Suppression
Extra Features
Result
5. Monitoring in System Automation
Choosing Appropriate Maintenance Times Automatically
Controlling the Rate of Upgrade
Recovery-Oriented Admission Control
Automated Deployment and Rollback
6. The Work Environment
Keeping an Audit Trail
Working with Tickets
Root Cause Analysis
The Five Whys
Extracting Categories
Dealing with Anomalies
Learning from Outages
Using Checklists
Creating Dashboards
Service-Level Agreements
Preventing the Ironies of Automation
Culture
7. Measuring Success
The Feedback Loop
Root Cause Classification
A Short Story of a Long Classifier List
Timing
Ticket Reporting
Frequency of Incidence
Incidence Times
Time to Respond and Time to Resolution
Measuring Detectability
False Positives and False Negatives
Precision and Recall
The F-Measure
Transition to Automated Alarms
Maintenance Overhead
How (Not) to Measure
8. The Principles
Get in the Habit of Measuring
Draw Conclusions Reliably
Monitor Extensively
Alarm Selectively
Work Smart, Not Hard
Learn from the Experience of Others
Have a Tactic
Run a Bank of Cases
Enjoy the Process
A. Setting Up OpenTSDB
The Software
Architecture
Getting OpenTSDB
First Steps
Starting TSD
Pushing Data
Input Tagging
Tag Wildcards
Temporal Aggregation
Summary Statistics
Rate of Change
Gathering Data System-Wide
Running tcollector
Writing a Custom Collector
Timeseries Plots
Plotting Tips
Get Involved
About the Author
Copyright
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Prev
Previous Chapter
Cover
Next
Next Chapter
Preface
Effective Monitoring and Alerting
Slawek Ligus
Published by
O’Reilly Media
Beijing ⋅ Cambridge ⋅ Farnham ⋅ Köln ⋅ Sebastopol ⋅ Tokyo
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset