Monitoring, compliance, and optimization through automation

Cloud native architectures that use complex services and span across multiple geographic locations require the ability to change often based on usage patterns and other factors. Using automation to monitor the full span of the solution, ensure compliance with company or regulatory standards, and continuously optimize resource usage shows an increasing level of maturity. As with any evolution, building on the previous stages of maturity enables the use of advanced techniques that allow for increased scale to be introduced to a solution.

One of the most important data points that can be collected is the monitoring data that cloud vendors will have built into their offerings. Mature cloud vendors have monitoring services natively integrated to their other services that can capture metrics, events, and logs of those services that would otherwise be unobtainable. Using these monitoring services to trigger basic events that are happening is a type of automation that will ensure overall system health. For example, a system using a fleet of compute virtual machines as the logic tier of a services component normally expects a certain amount of requests, but at periodic times a spike in requests causes the CPU and network traffic on these instances to quickly increase. If properly configured, the cloud monitoring service will detect this increase and launch additional instances to even out the load to more acceptable levels and ensure proper system performance. The process of launching additional resources is a design configuration of the system that requires Infrastructure as Code automation, to ensure that the new instances are deployed using the same exact configuration and code as all the others in the cluster. This type of activity is often called auto scaling, and it also works in reverse, removing instances once the spike in requests has subsided.

Automated compliance of environment and system configurations is becoming increasingly critical for large enterprise customers. Incorporating automation to perform constant compliance audit checks across system components shows a high level of maturity on the automation axis. These configuration snapshot services allow a complete picture in time of the full environmental makeup, and are stored as text for long-term analysis. Using automation, these snapshots can be compared against previous views of the environment to ensure that configuration drift has not happened. In addition to previous views of the environment, the current snapshot can be compared against desired and compliant configurations that will support audit requirements in regulated industries.

Optimization of cloud resources is an area that can easily be overlooked. Before the cloud, a system was designed and an estimation was used to determine the capacity required for that system to run at peak requirements. This led to the procurement of expensive and complex hardware and software before the system had even been created. Due to that, it was common for a significant amount of over-provisioned capacity to sit idle and waiting for an increase in requests to happen. With the cloud, those challenges all but disappear; however, system designers still run into situations where they don't know what capacity is needed. To resolve this, automated optimization can be used to constantly check all system components and, using historical trends, understand if those resources are over-or under-utilized. Auto scaling is a way to achieve this; however, there are much more sophisticated ways that will provide additional optimization if implemented correctly; for example, using an automated process to check running instances across all environments for under-used capacity and turning them off, or performing a similar check to shut down all development environments on nights and weekends, could save lots of money for a company.

One of the key ways to achieve maturity for a cloud native architecture with regards to monitoring, compliance, and optimization is to leverage an extensive logging framework. Loading data into this framework and analyzing that data to make decisions is a complex task, and requires the design team to fully understand the various components and make sure that all required data is being captured at all times. These types of frameworks help to remove the thinking that logs are files to be gathered and stored, and instead focus on logs as streams of events that should be analyzed in real time for anomalies of any kind. For example, a relatively quick way to implement a logging framework would be to use ElastiCache, Logstash, and Kibana, often referred to as an ELK stack, to capture all types of the system log events, cloud vendor services log events, and other third-party log events.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset