In this chapter, I am going to try really, really hard not to get up on my soapbox.
Really, hard…
But seriously! How can you use Google Cloud and not know how logging and monitoring work? I mean, how can you find what’s broken, know when you might be running short on capacity, or tell when something untoward has occurred if you don’t have good data? And where do you get said data? Yes, Google Cloud Logging and Cloud Monitoring.
Look, I’m old enough and comfortable enough in my own skin to know the things I’m good at and the things I’m not. Don’t ask me to draw you a picture or how to spell fungible, because drawing and spelling are definatly definitely (thank you, spell checker!) not in my skillset. But I got my first computer and started writing code in 1981, and it took me about 2 weeks to figure out that me and computers are simpático.
Figuring out what’s wrong takes information. Ensuring capacity in Google Cloud, spotting bad behavior, fixing broken stuff, and, generally, knowing what’s happening is all carried out thanks to monitoring and logging. So, without further ado, let’s get to building our logging and monitoring foundation.
In this chapter, we are going to get to know the various tools in Google Cloud’s operations suite. Then, we are going to apply them to our foundation:
In the initial stages of Google Cloud, there was App Engine, some storage offerings, and only the most basic support for logging and monitoring. As GCP grew, Google added more capabilities to Cloud Logging and Cloud Monitoring, but the available services were still lacking. So, in 2014, Google being Google bought something: Stackdriver. If you’ve been working with Google Cloud for a while, you might remember when the heading at the top of the instrumentation section in the Google Cloud menu was titled “Stackdriver,” and you will still see references to the name in examples and documentation online to this day. In 2020, Google finally decided that the name Stackdriver should go away, and so they went to work on a replacement. The interim name they came up with was “Operations.” However, while they were working on coming up with a new name with more pizazz, Covid-19 hit and the importance of getting a good name moved to the back burner.
As I sit here writing this, I wonder whether Google has just decided to stick with Operations, or whether I’m going to wake up one day soon (likely the day after this book drops) to a new group heading. Regardless, at this point, the name is Operations, and that’s the name I’m using.
When you look at the current Google Cloud Console navigation menu, the subsection related to our topic at hand looks like the following:
Inside the OPERATIONS product grouping, you will see six products. The most generally applicable instrumentation offerings are Monitoring and Logging. The other four products, Error Reporting, Debugger, Trace, and Profiler, are all designed to varying degrees to help troubleshoot misbehaving applications and to help developers.
To summarize the importance of each product, let’s take a quick operations suite tour.
Let’s start with the focus of this chapter, that is, the central two products in Google Cloud’s operations suite:
Monitoring is your central tool for getting information on how things are performing now and how they perform over time. Additionally, it will feed you the data you need for various types of troubleshooting, capacity planning, and performance evaluation, just to name a few.
We are going to spend the majority of the rest of this chapter discussing logging and monitoring as they relate to building a solid GCP foundation. However, before we do, let’s do a quick tour of the four remaining instrumentation products:
Error Reporting builds on Cloud Logging to help spot unhandled errors and report them to you in a centralized way. You modify your code to send out a specifically formatted log entry, or you loop in the Error Reporting API, and when things go bad, Error Reporting notices, and can send you an alert. For more information on Error Reporting, please refer to https://cloud.google.com/error-reporting/docs.
The second thing to know about Cloud Debugger is that if you can’t replicate the issue in your dev environment, or if debugging on-premises doesn’t really work for this issue because it needs to be running in Google Cloud first, then that’s when you debug with Cloud Debugger. Unlike a typical debugger, Cloud Debugger doesn’t interrupt executing code, allowing you to debug in production without any negative side effects. As with a typical debugger, Cloud Debugger augments the data you have by giving you a glimpse into what’s happening inside the running code. Not only will it allow you to see the internal state when a line of code executes, such as exactly what is in that x variable when line 27 passes, but it can also help you add extra logging to the code without having to edit and redeploy it. Honestly, I’m old school, and that extra-logging-without-having-to-edit-and-redeploy thing is my favorite Cloud Debugger feature. For more information, you can start here: https://cloud.google.com/debugger/docs.
Note
At the time of writing, Google announced that Cloud Debugger was officially being deprecated and that sometime in 2023, they would be replacing it with an open source tool, so keep an eye on that.
Good. Now that you’ve met everyone, let’s take a deeper dive into monitoring and logging, and see how our foundation leverages them. Let’s start with logging.
One year, two days before Christmas, I was as sick as a dog. My son and I were the only ones home as the rest of the family was visiting Mickey Mouse down in Florida. I decided to take myself into the local doc-in-the-box where a terribly young new doctor was on duty. He walked in and shook my hand, looked at my son, and said, “I’m going to wash my hands now because your dad is running a fever.”
There, that fever? It was a piece of information about the inner state of my body. The doctor used that information to start his search for exactly what was making me feel bad.
For the most part, systems running inside Google Cloud operate as black boxes. You see them, know they are there, but they tend to hide a lot of their inner workings. To a certain degree, all of Google Cloud is a black box. You might have an account, perhaps quite a large one, but that doesn’t mean Google is going to let you take a tour of a data center. We’ve all seen that movie where James Bond or someone sneaks off the tour to hack some server, right? Google isn’t having it, and since you can’t just walk over and poke at a server in the data center, you are going to have to rely on the data Google does provide to get answers.
Generally, we have two choices when it comes to instrumenting systems. First, if we have full control over all aspects of a system, perhaps because we wrote the code ourselves, there might be various ways to make the box clearer and see exactly what’s happening inside it. For systems we don’t have control over, our only choice is to rely on the telemetry being passed out of the black box into Cloud Logging and Cloud Monitoring.
Sometimes when I teach, I have a short example application that I use as a demo: https://github.com/haggman/HelloWorldNodeJs. It’s written in Node.js (JavaScript), and it’s easy to run in Google Cloud’s App Engine, Kubernetes Engine, or Cloud Run. It’s a basic Hello World-level application running as a web application. You load it up into GCP, visit its home page, and it returns Hello World!. The source code is as follows:
const express = require('express');
const app = express();
const port = process.env.PORT || 8080;
const target = process.env.TARGET || 'World';
app.get('/', (req, res) => {
console.log('Hello world received a request.');
res.send('Hello ${target}!');
});
app.listen(port, () => {
console.log('Hello world listening on port', port);
});
The two lines I want to draw your attention to are the ones that start with console.log. In JavaScript/Node.js applications, console.log is a way to log a message to standard-out. That’s programming speak for the default console location. Normally, that would be the command line, but with code running on some server, it typically ends up in some log file. The second console.log line prints a message to let you know the Express web server is up and ready to work. When you visit the application’s home (/) route, the first console.log line logs a message just before the app prints Hello World! to the browser.
If I were to run the preceding application from the Google Cloud Shell terminal window, then the log messages would appear on the command line itself, since under those conditions, standard-out would be the command line. However, if I deploy the application to Cloud Run and run it there, the log messages will get redirected to Cloud Logging.
Best Practice: Log to Standard-Out
Traditionally, applications manage their own log files. You create a my-app/logs folder or something, and manually write your logs into it. In Google Cloud, that’s not a good design, even for applications designed to run on their own VMs. As a best practice, code your application so that it logs to standard-out, and then let the environment determine how best to handle the messages. If you need a little more control, for example, you want Cloud Logging to create a special log just for your application logs, then use the Cloud Logging API or a logging utility that understands how to work with Google Cloud. For example, in Node.js, I can easily use Winston or Bunyan since they both have nice integrations for Google Cloud Logging. For other examples, please refer to https://cloud.google.com/logging/docs/setup.
For demonstration purposes, I just cloned my Hello World code into Cloud Shell. I built the container and loaded it into Google Container Registry using Google Cloud Build. Yes, that’s the same Cloud Build I’m using to handle my infrastructure CI/CD pipeline. Then, I deployed the container into Cloud Run. When I visit the Cloud Run service test URL, I see my Hello World! message.
Before I discuss specific logs, let’s take a quick detour and check out Google Cloud Logging’s Logs Explorer. The Logs Explorer option can be found in the Logging section of the navigation menu. It’s Google’s central tool for viewing logs. As of the time of writing, it looks similar to the following screenshot:
In the Logs Explorer window, you can see the following:
Speaking of my demo application, what kinds of log messages do you think were generated when I tested it? The answer is a bunch of them.
Google breaks logs into categories. A good discussion of the available logs in Google Cloud can be found at https://cloud.google.com/logging/docs/view/available-logs. I want to focus on the main three: platform logs, user-written logs, and security logs.
First, there are the platform logs. Platform logs are generated by specific Google Cloud products. More information on platform logs can be found at https://cloud.google.com/logging/docs/api/platform-logs. In the preceding screenshot, you are seeing logs coming out of the Cloud Build job to build my application’s container. If I wanted, I could use the log fields filter (#6 in the preceding screenshot) and filter out everything but the Cloud Build logs. Instead, I’m going to use the log name filter (#4) and filter the entries to the Cloud Run platform logs. The results look like this:
The bottom two entries in the preceding screenshot show my test HTTP GET request, which was successfully handled (200), along with my browser automatically asking for the page’s favicon.ico. I haven’t set this up, hence the unhappy 404-not-found response. Above those messages, you see the logs I’m generating in my code. There are two Hello world listening on port 8080 messages indicating that Cloud Run must have fired off two containers, each one indicating that it’s ready to receive traffic. Also, I can see a single message when I requested the application’s home page: Hello world received a request. Since I’m logging to standard-out in my code, Cloud Run captures these messages on my behalf and loads them into Cloud Logging.
Another log message type would be user-written logs. What I did in the preceding Hello World example isn’t technically considered user-written (pedantic much?) because I didn’t create my own log file to do it. If I had instead used the Cloud Logging API, such as in the example at https://github.com/googleapis/nodejs-logging/blob/main/samples/quickstart.js, to create a log named my-log and written to that, then it would be classified as user-written. Yes, it’s hair-splitting, but for now, let’s stick to Google’s definitions.
The last log type I’d like to mention, and arguably the most important to laying a good Google Cloud foundation, is security logs. The majority of your security logs are contained within what Google calls Audit Logs (https://cloud.google.com/logging/docs/audit). They include the following:
Note
This log will not generate entries if Google happens to touch something of yours while troubleshooting a lower-level problem. So, if Cloud Storage is broken for multiple clients, and they look at your files while troubleshooting, there will be no log entry.
To see an example of a security log at work, let’s look at the admin activity log. In this example, I’ve gone into Logs Explorer and used the log name filter to restrict to the Cloud Audit | Activity log. Additionally, I’ve used the log fields filter to only show the Cloud Run Revision entries. So, I’m asking Logs Explorer to just show me the Cloud Run-related admin activity log entries. The results are displayed as follows:
The two entries are where I first created the service for my Hello World demo application, and where I set its permissions to allow for anonymous access. Nice.
Before I wrap up the discussion of logging, for now, I’d like to talk a little about log exports.
By default, all your logging information is collected at the project level and stored in log storage buckets. The flow of the logs resembles the following diagram:
So, the steps logged messages run through are similar to the following:
The log sinks might be configured with exclusion and inclusion rules. The exclusions are processed first, ignoring any log entries matching a pattern. Then, the inclusion filter decides which of the remaining log entries should be captured and replicated into the configured destination.
Log entries can be routed into the current project or a remote one, where they will be stored in one of the following resources:
The two initial log storage buckets are _Default and _Required. _Required is a non-configurable storage bucket fed by a non-configurable log sink. The admin activity, system event, and access transparency (if enabled) logs are routed here and stored for 400 days. These are your core security audit logs, so when you are trying to figure out how that resource got removed or who modified that permission, you can find the details here.
The _Default log bucket and corresponding sink capture all the other log entries and store them for 30 days. Both are configurable.
Note: Not All Logs Are Enabled by Default
One of the key decisions you will need to make when configuring your logging foundation is which optional logs to enable. Generally, more information is better when you are trying to troubleshoot something. Having said that, logs, as with most things in Google Cloud, can cost extra. As such, you will need to give some serious thought to the logs you have enabled all the time, and those that you only enable when troubleshooting. We’ll go into more detail on that soon.
Now that we’ve had a taste of Google Cloud Logging, let’s head over and see what Cloud Monitoring has to offer. Then, we can get to laying that foundation.
Have you ever read Google’s books on building reliable systems? If not, head over to https://sre.google/books/ and take a look. At the time of writing, Google has three Sight Reliability Engineering (SRE)-related books, all of which you can read, in full, from the preceding link.
In Google’s original SRE book, they have a diagram that resembles the following:
At the top of the pyramid is the application you are building, the product. To get the product, first, we must do the work and write the code. However, if our application is going to work well, then we need a capacity plan, and we need to follow good testing and CI/CD release procedures. At some point, something is going to happen, and we are going to need to be able to find and fix problems in response to a formalized incident response. All of that will depend on us having good information, and that information will come from monitoring and logging.
We’ve discussed logging, so how does monitoring fit in?
Monitoring helps us to understand how well our application is doing. Do you remember my story about being sick and missing Disney World? When the doctor takes a measurement of your temperature, that’s a form of monitoring.
Have you ever heard of the four golden signals? In the same original SRE text that I mentioned earlier, Google lays out what they call the four golden signals. The gist is KISS. When you start monitoring, begin with the four most commonly important pieces of performance information, and then expand from there.
The four golden signals are listed as follows:
The four golden signals might be conceptually obvious, but how do you collect the data you need for each? Ah, that’s where monitoring comes in.
To monitor products in Google Cloud, my advice is for you to work your way through the following three steps until you find what you’re looking for:
Let’s run through the monitoring checklist one at a time using my Hello World application. In the folder containing my app, there’s a sub-folder that has a Python load generator that can run out of a Kubernetes cluster. I’ve spun up a GKE cluster and deployed the load generator into it. I have it configured to throw 30 RPS of load onto my Hello World application, which I’ve deployed to Cloud Run.
So, if I wanted to monitor an application running in Cloud Run, the first option says that I should start by checking to see whether Cloud Run has any monitoring. And yes, it does! Here, you can see a screenshot of the METRICS tab of the Cloud Run page for my application:
Look at all that nice data. There’s a RPS count, which shows that I’m pretty close to my 30 RPS target. There are request latencies for the 50th, 95th, and 99th centiles, which shows me that 99% of all the requests are getting an answer back in under 12 ms. The container instance count shows that there were several containers for a while, but Google has figured out that the 30 RPS load can be handled by a single active container, and that’s all that’s running now. On the second row, I see billable container instance time because Cloud Run bills for every 100 ms of execution time. Then, there’s container CPU and memory utilization. In Cloud Run, you can configure the CPU and memory resources provided to a running container, and the setting I’m running under now, 1 CPU and 512 MiB memory, looks to be more than enough. As of the time of writing, Google has just previewed partial CPU Cloud Run capabilities, and if I could drop the CPU my service specifies, I could likely save money.
See? Monitoring.
Not bad. Usually, with Cloud Run, I don’t need to look any further. If I did, I’d find that option 2’s pre-created dashboards, Monitoring | Dashboards, don’t give me anything for Cloud Run. However, there is a nice dashboard for GKE, so you might want to check that out at some point.
Moving on to option 3, Metrics Explorer, requires some prep work. Metrics Explorer has a lot of options on exactly how it can display a selected metric (https://cloud.google.com/monitoring/charts/metrics-explorer), so make sure you spend some time learning the interface. Before you can even start though, you need to do a bit of homework to figure out the specific metric that you’d like to examine.
Researching metrics starts at https://cloud.google.com/monitoring/api/metrics. For my example, I’m going to follow the link to Google Cloud metrics, and then scroll down to the Cloud Run section. At the top of this section is the base metric name. In this case, it’s run.googleapis.com/, so all the Cloud Run metric’s technical full names will start with that.
Next, scroll through the metrics until you find the one you need. For our example, let’s go with the following:
So, this is a metric that will show the amount of data downloaded from your Cloud Run application, with several details. Specifically, it shows the following:
Heading over to Metrics Explorer and searching for the metric using either the technical or human-readable name (double-check the full technical name to ensure you’re using the right metric) will allow you to see the visualization. In this case, it would look similar to the following:
Nice. So, it’s an easy view displaying the amount of data being downloaded by your application. With a little research, Metrics Explorer can be a powerful tool, but again, using options one or two first will generally save you time.
Now that you understand how monitoring works for many of the core Google Cloud resources, let’s take a moment to discuss the VMs running Google Compute Engine.
Imagine you have a VM running in Google Compute Engine. From Google’s perspective, your VM is probably one of several running on the same physical server in a Google Cloud compute zone. Google can see your VM as it runs and tell how much CPU you are using, how big your disk is, and how much data you’re sending across the network because all of that is happening at the physical server level. What Google can’t do, by default, is see what’s happening inside of your VM. Google can’t see the logs being generated by that application you have running, or how much of the disk and memory is being used by said application.
For that, you need to get an inside view of the VM’s black box. The traditional option would be to log into the VM using RDP or SSH and look at the logs and metrics from the OS. That’s fine, and not only will it work in GCP, but it also won’t add anything to your Google Cloud bill. The problem is that the VM is likely a very small part of what you are doing in Google Cloud. Since the VM isn’t pushing its logs out to Cloud Logging, you can’t really go to a single place and see the VM logs side by side with what’s happening in that GCP data storage service that your VM is accessing. To make that work, you need to open a window into the VM that Google can see through. That window is created by Google’s Ops Agent: https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent.
The Google VM Ops Agent can be installed in a number of ways, including manually via SSH, from the Observability tab in the VM’s details page in Compute Engine, using the VM Instances dashboard, or you can create an Agent Policy (https://cloud.google.com/stackdriver/docs/solutions/agents/ops-agent/managing-agent-policiesz) and automate the Ops Agent installation across entire sections of your organizational resource hierarchy. Honestly, the best practice is to install the Ops Agent into the VM image itself, and then use nothing but custom images in production. If you need a little help to get that working, take a look at HashiCorp Packer.
The Ops Agent is configured using YAML and can stream system logs, application logs, and resource utilization metrics out of the VM and into Google Cloud Logging and Monitoring. Nice.
Now, before we move on to laying our logging and monitoring foundation, let’s talk a little about how metric scopes work.
If you visit the Monitoring | Overview page in Cloud Console for one of your projects, in the upper-left corner you’ll see a Metrics Scope link, similar to the one shown in the following screenshot:
Opening your Metrics Scope link will display a page resembling the following:
A metrics scope is a group of one to many projects that can be monitored from the currently viewed project. Let’s imagine you have a microservice-based application with services spread out over five different projects. Since the different projects are different parts of the same application, sometimes, you just want to go into project 3 and monitor the resources you have there. That’s the default, and each project comes preconfigured with a metric scope containing itself. So, if you want to monitor project 3’s resources, then just go to project 3 and do so.
But metric scopes could be edited, and you can add other projects to what’s monitorable from here. So, in my five-project microservice story, you could create a project 6 specifically to help monitor the application as a whole. You add projects 1–5 into the metrics scope for project 6, and now you can use project 6 to monitor resources across the entire application.
Perfect. So, we’ve seen how logging works and how monitoring can help. Now, let’s put them together and get them into our foundation.
Every resource you use in Google Cloud will impact how and what you collect in terms of logging and monitoring information. At this point, our goal is to get a few logging and monitoring fundamentals in place as part of our foundation, and then we can extend that foundation as needed. You will need to decide which of the following recommendations makes sense in your environment, and then adapt them as needed.
Let’s start with logging.
Let’s make this clear. Logging is good because it contains information needed to spot security issues, troubleshoot applications, and more, but logging also costs money. I’d love to simply say that you should enable every type of log you can, but the reality is that the cost would likely outweigh the possible benefits. The following recommendations try to balance information needs and cost. Regardless of whether you go with my recommendations or not, make sure that you spend some time in your billing reports, so you know exactly how much you are spending as it relates to logging and monitoring. Do not get surprised by instrumentation spending.
Here is my list of logging baseline recommendations:
Here, you are going to have to do some planning. If you like the idea of data access logs, then you can enable them for all projects, but you need to spot-check how much you are paying. Typically, this is where I start, by enabling them across the entire organization.
If my data access logs are getting too big and expensive, then one possible cost-savings measure is to enable data access logs for production projects across the entire production organizational folder, but disable them in dev and non-production. If you need to troubleshoot something in dev, you can enable them temporarily. Perhaps enable them for selected resources rather than all resources. Additionally, you can use log exclusions to support enabling data access logs, but then only sample the log entries.
For some of the preceding optional logs, enabling them with Terraform will be part of setting up that particular resource. Firewall logs will be enabled in the Terraform resource for creating the firewall, and the VPC Flow Logs will be enabled when creating a subnet. If you’d like to install the Ops Agent in your VMs, you can do that when configuring the VM itself, or you might look at Google’s blueprint resource to create Ops Agent policies at https://github.com/terraform-google-modules/terraform-google-cloud-operations/tree/master/modules/agent-policy.
Of course, for the data access logs, the easiest way to enable them is using Terraform. Enable them for the org using google_organization_iam_audit_config and/or for folders using google_folder_iam_audit_config resources. For example, to enable data access logs for all the services within a folder, you could use the following:
resource "google_folder_iam_audit_config" "config" {
folder = "folders/{folder_id}"
service = "allServices"
dynamic "audit_log_config" {
for_each = ["DATA_WRITE", "DATA_READ", "ADMIN_READ"]
content {
log_type = audit_log_config.key
}
}}
If you are using Google’s example foundation scripts that I used in earlier chapters, then you can enable the data access logs in the 1-org step by setting the data_access_logs_enabled variable to true. To see how the script works, examine the envs/shared/iam.tf file.
Once you’ve made some decisions regarding the logs to enable, give some thought to log aggregation.
Log aggregation gives you the ability to export logs from multiple projects into a single searchable location. Usually, this will either be a BigQuery dataset or a Cloud Storage bucket. BigQuery is easier to query, but Cloud Storage could be less expensive if you are storing the logs long-term for possible auditability reasons. I prefer BigQuery with a partitioned storage table. I mean, what’s the use of copying all the logs to a single location if you aren’t going to be searching through them from time to time? Partitioning the table will make date-filtered queries faster, and the storage will be less expensive for partitions that are 90 days old or older.
For log aggregation, first, create a centralized project where you can store the logs. If you are using the Google example org Terraform scripts, then the project created for this is in the fldr-common folder and is named prj-c-logging.
Log aggregation can be configured with the gcloud logging sinks command or by using Terraform. Google has a log export module that you can find at https://github.com/terraform-google-modules/terraform-google-log-export. It provides facilities to create a BigQuery, Cloud Storage, or Pub/Sub destination, which is then used to sink the specified logs. Once again, if you are using the example foundation scripts, log sinks are created and configured in the 1-org step. To see how the script works, examine the envs/shared/log_sinks.tf file.
With the key decisions made on how to store logs, let’s talk a bit about monitoring.
On the monitoring end, exactly what you need to view will be a project-by-project decision. In other words, if you need to monitor a particular service that you’ve created, do that in the service’s host project. Foundationally speaking, it might be a nice idea to create a central project within each environmental folder for when you need a single pane of glass. So, in the dev folder, create a project with the metrics scope configured that can see all the other projects in the dev environment. When you need to monitor some services in dev-project-1 and dev-project-2 from a single location, this project will be set up for that.
If you are using the example foundations scripts from Google, the central monitoring project is created in the 2-environments step of the modules/env_baseline/monitoring.tf script. Essentially, it’s just a project created using Google’s project factory blueprint with a name such as prj-d-monitoring. However, the script doesn’t actually configure the metrics scope. You can either configure that manually using the UI or via the monitoring API. In the UI you would do the following:
There’s also a (currently beta) Terraform resource that you can use: google_monitoring_monitored_project.
Note: A Metrics Scope Can Contain No More Than 375 Linked Projects
So, if you have more than 375 projects in your dev/non-prod/prod environment folders, then you would need to create multiple “single-pane-of-glass” monitoring projects. Perhaps you have 1,000 different service projects, but they make up 25 major application types. In that case, perhaps you will create 25 different monitoring projects, one per major application type.
Excellent! With an idea of the things that should be in your monitoring foundation, let’s take a moment to point out a few other considerations.
I could seriously write a book on logging and monitoring in Google Cloud. In this chapter, I’ve tried to focus on the things that should be part of most Google Cloud foundations, without getting into all the variations you’ll need to implement depending on exactly what you are doing in Google Cloud. Now, let’s talk about some stuff you should at least consider.
Security Information and Event Management (SIEM) tools are third-party tools designed to help you scan your logs for untoward activities. Splunk, Elasticsearch, Sumo Logic, ArcSight, and QRadar, just to name a few, could all be configured to connect to Cloud Logging using Pub/Sub. Then, they can analyze what’s happening in your logs and spot naughty people trying to do naughty things. Additionally, Google now has Chronicle, which can scan logs alongside information coming out of Security Command Center for various threats.
Homegrown log analysis could be constructed in a few different ways. You could centrally aggregate your key logs into a Pub/Sub topic, read the topic with Apache Beam running in Dataflow, and then dump your logs into BigQuery for longer-term storage. Then, the Dataflow/Beam code could scan the log entries for anything you liked and alert you of findings in near real time. It would require custom coding, but you could grow the solution how and where you need it. I guess it all depends on whether you’re part of a build-it or buy-it sort of organization.
Another way you could do homegrown log analysis would be using BigQuery. Use Cloud Scheduler to run a job in Cloud Run or to call a cloud function once every 15 minutes. The function code executes a query in BigQuery looking for events since the last run. You could scan for admin activity log entries, logins using high-privilege accounts, IAM permission changes, log setting changes, permissions changes on important resources, and the like. If anything is found, send out a notification.
If your issues are hybrid-cloud related, and you are attempting to integrate systems running outside GCP into Cloud Logging and Monitoring, then check out Blue Medora BindPlane. It supports integration with Azure and AWS, along with on-premises VMware, and various application and database types.
Last but not least, please give some thought to how you are going to build reliable systems in Google Cloud. Read those Google SRE books I mentioned earlier (https://sre.google/books/) so that you can put the SRE teams in place, have good plans for your formalized incident response, and understand more about how to build your SLOs.
Cool beans. And with that, it’s time to wrap up our foundational monitoring and logging discussion and move on to our final two Google Cloud foundation steps.
In this chapter, we continued to build our secure and extensible foundation in Google Cloud by completing step 8, where we implemented key parts of our logging and monitoring foundations. We overviewed the tools that Google has for us to help with instrumentation, and then we focused on logging and monitoring. We explored how Google ingests and routes logs, and we looked at the various log types. Then, we moved on to monitoring and saw the various ways Google helps us monitor the services we have running inside of Google Cloud. Lastly, we discussed the additional elements our foundation needed as it relates to logging and monitoring.
Great going, y’all! Keep on hanging in there, and we are going to get this foundation built.
If you want to keep on moving through Google’s 10-step checklist with me, your personal tutor, by your side, flip the page to the next chapter where we are going to discuss our final two foundational steps: security and support.