Chapter 10. Logging

In his book In Defense of Food (Penguin Press), Michael Pollan explores the evolution of food chains in the United States, from garden-grown vegetables to processed foods created by food scientists. Encouraging readers to return to a focus on nutrition, not nutrients, Pollan distills the 205 pages of this book into the short phrase “Eat food, not too much, mostly plants.”

This phrase came to mind when I was thinking about this chapter. Logging is a distillation of software execution state, ideally striking a balance in which you log just enough information, but not too much, especially in large-scale environments. When it comes to cost-effective logging, I think of a similar phrase: “Write logs, not too much, never sensitive information.”

Done well, logging can provide you with invaluable information about the state and operation of complex systems such as data pipelines. It’s a relief to come across a well-considered log message telling you exactly what went wrong when debugging a problem.

In addition to debugging issues and observing execution, the ability to export log data to query tools like Google BigQuery can turn well-formatted logs into a database. Analyzing logs at scale can give you further insight into performance and system health, in addition to generating metrics.

Done not so well, logging can be a pit of despair. Excessive logging drags down performance and racks up cloud costs. Noisy or poorly considered logging can ruin the ability to use logs for debug and analysis. Accidentally logging secrets and other sensitive information can create security incidents and, in some cases, violate the law.

In this short chapter, you’ll see advice for logging that is effective across debug, analysis, and cost. To start, you’ll see how cloud data pipelines can lead to a lot of log data that not only is costly in terms of cloud billing but also impacts performance and is difficult to navigate.

From here, the focus turns to cost mitigation, with techniques spanning cloud services to code design to reigning in logging costs. Finally, I’ll share some tips for creating informative logs for debugging and analysis in distributed environments.

Logging Costs

The scale of data processing coupled with the elasticity of cloud storage can create a perfect storm for data pipelines to generate massive amounts of log data.

I encountered this in a pipeline where logging doubled execution time and racked up thousands of dollars per month in cloud costs. The logging approach was developed for debugging single processes over a small dataset in a local environment. In this setting, there was no need to provide identifying information with the log messages, as there was only a single process running in debug. At scale, the logs were incomprehensible, having lost the context of which process was generating a given log message.

This was a worst-of-all-worlds situation; the abundance of logs severely impacted performance, driving up cloud costs due to both log data volume and increased execution time. On top of that, the logs weren’t created with scale in mind, so they not only were useless for debugging but also crowded out important log messages with all the noise.

Impact of Scale

One of the issues is that logging practices designed for smaller scales can become problematic at larger scales. This is especially true for more verbose log settings, such as those set for debugging.

Sometimes this occurs when code designed for a smaller scale finds its way into a large-scale data processing pipeline. A volume of logging that provides helpful information when running in a centralized environment can turn into huge amounts of contextless messages when deployed in parallel. Code used for development or added for debugging can cause these issues as well. You’ll see advice on crafting a logging approach that scales in “Effective Logging”.

Besides the number of log messages, it’s also possible that the log messages themselves increase in size. A modest amount of data logged for debug purposes can balloon in size as the data grows. In the worst case, a combination of log volume and data size increases can rapidly multiply the size of log data.

Impact of Cloud Storage Elasticity

If you’ve worked on laptops with smaller hard drives or in containers that have a limited amount of disk space, you may have experienced overenthusiastic logging causing an out-of-space error of some sort. Consider yourself lucky that you were warned about log size because that doesn’t happen when using cloud logging.

When working in the cloud, oftentimes logs will be forwarded to cloud storage. As you saw in Chapter 3, there are frequently no limits for sending data to cloud storage, unless you impose them yourself.

Sometimes log forwarding happens automatically, such as AWS EMR automatically forwarding logs to S3. Other times you may have explicitly requested log forwarding, as with setting up cloud logging for Google Kubernetes Engine (GKE). Frankly, cloud logging is a great thing, giving you the ability to go back and introspect logs after a pod or cluster spins down. The drawback is that you’re unlikely to realize that a lot of log data is generated, until the cloud bill shows up.

In “Managing Retention in CloudWatch”, the FinOps Foundation reported that over $1,400 per month was attributed to unnecessary debug logs.1 An interesting thing to note about this situation is that in the cloud bill, the costs were presented as the cost of a specific object action, PutLogEvents. Recall from Chapter 3 that when objects are saved to cloud storage, you incur multiple costs: the cost of storing data at rest and the cost of operations performed on the object. When using cloud logging, there can also be costs for ingesting the logs in addition to storing the logs, as you can see in the Google Cloud Logging pricing summary.

Another thing to note about this example is that the log message contained a few hundred kilobytes of data. Thinking back to the preceding section on the impact of scale, this turns into hundreds of gigabytes at the scale of millions of events.

Reducing Logging Costs

There are a variety of things you can do to reap the benefits of cloud logging while keeping costs in check, ranging from manipulating cloud service settings to utilizing logging techniques and tools for handling logs at scale.

Let’s start on the cloud services side. When you use cloud logging, typically the logs will be kept forever unless you configure a log retention period, such as described in the AWS article “Altering CloudWatch log retention”. Establishing a retention period will result in logs being removed after a specific duration, ensuring that you aren’t holding on to logs that are no longer of use.

Similar to the lifecycle policies you saw in Chapter 3, retention periods will vary across services and environments. For instance, retaining production logs for a week could be worthwhile for debug and monitoring. In a test environment, you’ll likely debug failures over a smaller window, making it cost-effective to use a shorter retention period.

In addition to the retention period, you can configure which logs to forward to the logging service and which to ignore. This can further reduce your logging footprint by homing in on just the logs of interest. For example, Google Cloud Logging gives you the ability to set exclusion and inclusion filters to this end.

In terms of logging techniques, making use of different log levels helps you tune log verbosity. The Python logging documentation includes an overview of when to use different log levels depending on the types of messages you want to convey. You can then select the level of logging in different environments. In a production environment, you may use the default level of WARNING to surface informational messages, warnings, and errors while suppressing debug messages. In a development or test environment, you can set the log level lower to DEBUG to get additional information.

For especially noisy modules, consider more granular control over logging. For example, in a test environment with a log level of DEBUG, you can introduce an additional flag to set the log level for a module that has verbose DEBUG logging. By default, you could set the module log level flag to INFO to reduce log messages. If circumstances warrant the additional logging, you can change the level, rather than constantly dealing with a flood of log messages that may not be useful.

Using configurations lets you change the log level without code changes. This approach also enables different logging levels across various environments. I’ve seen this done with environment variables, where a DEBUG flag was turned on for the test environment but disabled for production. If an issue came up in production, our team could enable debug logs with just a flip of an environment variable.

For Python in particular, you can reduce the performance impact of logging by delaying the creation of log messages. You can see in the Python logging flow that a logging record is created only if the logger is enabled for the log level of the logging call. If not, the log record doesn’t get created.

For example, here are two different approaches for logging a message with variables, one using % formatting and the other using f-strings:

logger.debug("Finished extracting species for user %s. Species extracted: %s", 
             user_id, species)

logger.debug(f"Finished extracting species for user {user_id}. 
              Species extracted: {species}")

Let’s consider what happens in an environment where the log level is set above DEBUG. In the f-strings case, the log message is interpolated before the call is made to logger.debug. Regardless of whether the logger actually creates the log record, this string will be created.

In the %s case, you’re passing the user_id and species as arguments to the logger.debug method. Instead of the string creation happening before the logger method is called, the string interpolation will be performed by the logging handler, only if the log level warrants it.

This may seem like small potatoes, but remember that these little things can add up at scale.

Another optimization from the Python logging cookbook is conditional logging. In this technique, logs are only emitted if a certain condition is met. This is a great way to limit logs that are primarily intended to help debug failures; if the failure doesn’t occur, the would-be log messages are pass:[dumped rather than written to the log. In the scenario I shared earlier where logging doubled pipeline runtime, conditional logging that emitted the diagnostic messages only if an error occurred would have eliminated most of the logs.

As with the example in the Python documentation, setting the capacity for the log buffer is a good practice to use memory judiciously. This will prevent wasting memory on buffered log messages.

The techniques you’ve seen in this section will help reduce log volume and help you surface only those log messages that are relevant in a given environment and situation. This both reduces costs and cuts down on log noise.

The next section switches the focus from log costs to creating logs that are effective at scale and can be leveraged by aggregation tools for analysis.

Effective Logging

When it comes to building for parallel execution, whether in a distributed system or in a parallel processing environment, it’s important to reconsider how to approach logging versus a centralized environment.

For example, when you have multiple processes or workers executing in parallel, your log messages need to contain some identifying information to help you determine what a log event refers to. Returning to the earlier Python logging example, notice that I included the user_id in this log message about species extraction:

logger.debug("Finished extracting species for user %s. Species extracted: %s", 
              user_id, species)

This log message would produce something like the following:

[DEBUG] Finished extracting species for user abc. Species extracted: night heron

Contrast this with the following messages, which lack this context. You can imagine how unhelpful a string of log messages such as these would be for root-causing an issue:

[INFO] Start extracting species
[INFO] Start extracting species
[ERROR] Failed extracting species
[INFO] Finish extracting species
[INFO] Finish extracting species

This was the state of the logs in the situation I shared in “Logging Costs”; without identifying information, the log messages couldn’t be correlated to the relevant process.

When crafting log messages, think about information you would want when debugging an issue, such as:

  • The environment where the job is running

  • The service or process executing the code

  • Data identifiers, including IDs for the related batch, row, or event

Some of this information might be available as a consequence of how your infrastructure is set up. For example, a data platform I worked on that ran in Kubernetes used separate projects for each environment, so it wasn’t necessary to log this information. I could also get the service information from the name of the pods. If I wanted to look at logs in the Spark deployment on production, I opened the Logs Explorer in the production project and filtered by pod names matching “spark.”

Tip

If your source data does not include batch, event, or row identifiers, it’s advisable to create them—the earlier in the pipeline, the better. I worked on a streaming pipeline that created IDs for each row of data processed. As data passed through various parts of the pipeline, the ID was propagated as part of the Kafka payload. If malformed or missing data was identified, our team could inspect the Kafka messages to trace where the error occurred.

Returning to the species extraction example, what if there were multiple rows of data for user abc?

[DEBUG] Finished extracting species for user abc. Species extracted: night heron
[DEBUG] Finished extracting species for user abc.  Species extracted: None

In this case, you don’t know what the description field looked like, so it’s hard to say whether the extraction is correct. Including row, batch, and event IDs in your logs can help disambiguate issues like this:

[DEBUG] Finished extracting species for user abc, batch 1, record 1-1.
        Species extracted: night heron
[DEBUG] Finished extracting species for user abc, batch 1, record 1-2.
        Species extracted: None
Warning

Keeping sensitive information out of logs can be tricky when working with data. While best practices like “Don’t log the password to the production Postgres server” are straightforward, variations in data privacy laws alongside changes in data sources can make it difficult to be sure you aren’t logging sensitive information.

There are approaches for masking sensitive data, but they mostly rely on your ability to model what sensitive data looks like. My advice if you are working with sensitive data is to not log any data, and to work with your company’s security or legal teams when in doubt.

If you need data for debugging, consider writing it to a secure environment, such as a database or private cloud storage. To keep costs in check, couple this technique with automated expiration to remove old data. You’ll also want to check database log settings to ensure that sensitive information passed in transactions isn’t getting logged.

In addition to what you log, how you structure log information is important. Logging data in a JSON format gives you the ability to query logs using tools such as Google BigQuery, giving you another window into system behavior. This also gives you an opportunity to create metrics and alerts from log information. Schemas can be a great help to this end—a schema for log messages will keep your log format consistent across different teams and services. This makes analysis a lot simpler than having to accommodate different log format styles.

One of the pipelines I’ve worked on used logs as a cheap way to record and analyze profiling data, something you’ll see in Chapter 11, which covers monitoring. When execution of the profiled tasks finished, the results were written to the log in JSON. Our team could later query the logs to acquire and analyze the profiling data.

For example, you could capture the earlier debug messages in JSON to see which species were extracted for a given batch of bird survey data:

extract_info = {"user": user_id, "species": species}
logger.debug('Finished extracting species: %s', extract_info)

The preceding code would produce log messages like the following:

[DEBUG] Finished extracting species: {'user': 'user1', 'species': 'night heron'}
[DEBUG] Finished extracting species: {'user': 'user2', 
                                      'species': 'great blue heron'}
[DEBUG] Finished extracting species: {'user': 'user3', 'species': 
                                      'great blue heron'}

With the log data in JSON format, you can extract attributes:

SELECT 
    JSON_VALUE(log_data_json,'$.user) AS user, 
    JSON_VALUE(log_data_json,'$.species) AS species
    . . .

In the preceding code, log_data_json can be extracted from the log messages using REGEXP_EXTRACT to capture the data following "Finished extracting species:", and PARSE_JSON can be used to convert the log string to JSON.

Summary

Much like refocusing your diet on eating real food, logging in cloud data pipelines benefits from taking a step back from the constituent components and considering the picture on a larger scale.

In this chapter, you saw how logging practices designed without scalability in mind can negatively impact performance and cost, while in the worst cases can also hinder your ability to use logs for debug and analysis. Be judicious when logging data, and think about how many times a log message may be emitted when running at scale.

When it comes to reducing the costs of logging, setting retention periods and using filters in your cloud logging settings can reduce the amount of data you collect and store. On the code development side, splitting log messages into different levels keeps verbose debug information out of environments where it isn’t needed. For especially verbose modules, selectively enabling debug logging can give you more control.

You can minimize logging overhead in Python by leveraging the string interpolation provided by logging methods, rather than interpolating strings ahead of time. Setting up conditional logging will limit logs to circumstances where the additional information is helpful, such as if an exception occurs. This not only lowers log costs and performance impacts but tidies up your logs as well.

Finally, when code can be split across multiple processes, workers, and deployments, including identifying information in log messages is indispensable for identifying the data or processes that produced the message. For further introspection and analysis, logging diagnostic data in JSON format enables you to leverage tools for querying logs.

In this chapter, you saw how to cost-effectively use logging to leave clues for data pipeline debug and observability. The next chapter illustrates how to monitor data pipelines for debug, design, and cost efficiency. Oftentimes, monitoring lets you know a problem occurred, and logs provide more detail to help you root-cause the issue.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset