Chapter 6: Hadoop Logs: Relating and Interpretation

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 6

Hadoop Logs: Relating and Interpretation

The other day, a very annoyed director of business intelligence (at a client site) stormed into my office and complained about one of the contractors deleting some ledger records from a production server. She had received a daily summary audit log report that showed 300 ledger records (financial transaction entries) had been deleted! To start with, the contractor in question shouldn’t have had access to them. So I investigated, and it turned out that the ERP (Enterprise resource planning) software that client was using had a bug that provided access through the “Public” role. I wouldn’t have discovered the bug if I didn’t have audit logging enabled, which proves how important audit logging can be from a security perspective.

The purpose of HDFS audit logging is to record all HDFS access activity within Hadoop. A MapReduce audit log has entries for all jobs submitted. In addition, the Hadoop daemon log files contain startup messages, internal diagnostic information, errors, informational or warning messages, configuration logs, and so forth. You can filter the information that’s not required later, but it’s helpful to log all access, including authorized access. Even authorized users can perform tasks for which they are not authorized. For example, a police officer might perform an unauthorized update to his girlfriend’s ticket record without appropriate approvals. Besides, for audited applications or any SOX-compliant applications, it is mandatory to audit all access to data objects (e.g., tables) within an application, as well as to audit all job activity that changes any data within an audited application.

In this chapter, I will discuss how to enable auditing for Hadoop and how to capture auditing data. Log4j is at the heart of Hadoop logging, be it audit logs or Hadoop daemon logs. I will begin with a high-level discussion of the Log4j API and how to use it for audit logging, and then discuss the Log4j logging levels and their purpose. After an overview of daemon logs and the information they capture, you will learn how to correlate auditing with Hadoop daemon logs to implement security effectively.

Using Log4j API

A Java-based utility or framework, Apache Log4j was created by by Ceki Gülcü and has since become a project of the Apache Software Foundation. Logging is an essential part of any development cycle and in the absence of a debugger (which is usually the case), it is the only tool for troubleshooting application code. It’s very important to use the correct type of logging—one that’s reliable, fast, and flexible. Log4j fulfills these requirements:

Reliability is the expectation that relevant error or status messages are displayed without any exceptions. Custom logging routines can be prone to bugs in that some of the messages are not displayed due to faulty logic. Log4j doesn’t have that problem. This logging system is well tested and has been popular for a long time. Reliability of the logging output logic can certainly be guaranteed with Log4j.
Speed refers to the response time of logging routine used. With Log4j, the Logger class is instantiated (an instance created) as opposed to interacting with an interface, resulting in a superfast response. Deciding what to log (based on logging level) only involves a decision based on Logger hierarchy, which is fast. Outputting of a log message is fast due to use of preformatting using Layouts and Appenders; typically, actual logging is about 100 to 300 microseconds. With SimpleLayout (the simplest Layout option for Log4j, explained in the “Layout” section), Log4j can log as quickly as a print statement (which simply prints input text to a console or a file)!
Flexibility refers to the ease of change to a logging system (without modifying the application binaries that use it) and the ease of use for the application using the modified logging. For example, with Log4j, you can direct output to two destinations, like the console and a log file, using multiple logging destinations, which are also called Appenders. Simply modify the log4j.properties configuration file to make this change; no code changes are needed.

The easiest way to include status or error messages is, of course, to insert them directly in your code. So, what’s the advantage of using Log4j for logging as opposed to inserting comments in your application code or using a custom logging module? Well, inserting comments and removing them is a tedious and time-consuming process that relies on the expertise of the programmer—who might just forget to remove them after testing. Getting the percentage of comments correct (sometimes too many, other times too few) is difficult, and selectively displaying those comments is impossible. Also, any changes to comments involve recompilation of code. Last, a custom logging module may have bugs or may not have as extensive functionality as Log4j API.

Via a configuration file, Log4j allows you to set logging behavior at runtime without modifying application binaries. A major concern with logging is its impact on performance. Any logging by nature slows down an application, but with Log4j the impact on performance is minuscule. For example, an independent test of the latest release of Log4j 2 (Version 2) showed that it can output up to 18 million messages per second (for full results, see Christian Grobmeier, “Log4j”: Performance Close to Insane,” www.javacodegeeks.com/2013/07/Log4j-2-performance-close-to-insane.html). With Log4j the impact is limited to a range from nanoseconds to microseconds, depending on your Log4j configuration, logging level, and Appenders.

The main components of Log4j logging framework are the Logger, Appender, Layout, and Filters. So you can better understand how they work together, Figure 6-1 illustrates where they fit within the framework.

Figure 6-1. Log4j framework and its main components

The sections that follow will discuss each of these components in detail and provide information about what they do and what their exact role is within the framework.

Loggers

A Logger is a named entity that is associated with a configuration (LoggerConfig) and subsequently with a logging level. For Log4j logging to function, you need to have a root Logger with a related configuration defined. The root Logger defines the default configuration (Appender, Layout, etc.). So what are these logging levels and how do they correlate?

Logging Levels for Log4j

There are seven logging levels for Log4j API. They log information in order of severity and each of the levels is inclusive of all higher levels. For example, log level INFO includes informational messages, warnings (higher-level WARN included), nonfatal errors (higher-level ERROR included) and fatal errors (higher-level FATAL included). Similarly, log level WARN includes warnings, nonfatal errors, and fatal errors. Figure 6-2 summarizes these inclusions.

Figure 6-2. Log4j logging levels and inclusions

The seven log levels are as follows:

ALL: This is the lowest possible logging level, and it logs all messages including the higher levels (e.g., fatal errors, nonfatal errors, informational messages, etc.)
TRACE: As the name suggests, this level logs finer-grained informational events than the DEBUG level.
DEBUG: Logs fine-grained informational events that are most useful to debug an application.
INFO: Logs informational messages that highlight the progress of the application at a more coarse-grained level.
WARN: Logs potentially harmful situations.
ERROR: Logs error events that might still allow the application to continue running.
FATAL: Logs very severe error events that will presumably lead the application to abort.

Please note that enabled TRACE and DEBUG levels can be considered a serious security flaw in production systems and may be reported by vulnerability scanners as such. So, please use these log levels only when troubleshooting issues and make sure that they are disabled immediately afterward.

Logger Inheritance

Logger names are case-sensitive and named hierarchically. A Logger is said to be an ancestor of another Logger if its name followed by a dot is a prefix of the descendant Logger name. A Logger is said to be a parent of a child Logger if there are no ancestors between it and the descendant Logger. So, for example, the Logger named L1.L2 is parent of the Logger named L1.L2.L3. Also, L1 is parent of L1.L2 and ancestor (think grandparent) of L1.L2.L3. The root Logger is at the top of the Logger hierarchy.

A Logger can be assigned a default log level. If a level is not assigned to a Logger, then it inherits one from its closest ancestor with an assigned level. The inherited level for a given Logger L1 is equal to the first non-null level in the Logger hierarchy, starting at L1 and proceeding upward in the hierarchy toward the root Logger. To make sure that all Loggers inherit a level, the root Logger always has an assigned level. Figure 6-3 contains an example of level inheritance.

Figure 6-3. Logger level inheritance

As you can see, the root Logger, L1, and L1.L2.L3 have assigned logging levels. The Logger L1.L2 has no logging level assigned to it and inherits the logging level L1 from its parent L1. A logging request is said to be enabled if its level is higher than or equal to the level of its Logger. Otherwise, the request is disabled.

Most Hadoop distributions have five standard Loggers defined in log4j.properties in the /etc/Hadoop/conf or $HADOOP_INSTALL/hadoop/conf directories (Figure 6-4). For Log4j logging to function, a root Logger (with related configuration) must be defined. The security Logger logs the security audit information. Audit Loggers log HDFS and MapReduce auditing information, while a job summary Logger logs summarized information about MapReduce jobs. Some distributions also have Loggers defined for Hadoop metrics, JobTracker, or TaskTracker.

Figure 6-4. Loggers and default log levels

Figure 6-5 is a sample entry for HDFS audit Logger from log4j.properties.

Figure 6-5. HDFS Audit Logger

The maxfilesize setting is the critical size (here 256MB) after which the log file will “roll” and create a new log file; maxbackupindex (20 in this case) is the number of backup copies of the log file to be created. In this example, when the log file rolls over 21 times, the oldest file will be erased. Properties of other Loggers are specified in a similar manner in the log4j.properties file.

Appenders

For the Log4j framework, an output destination is called an Appender. Currently, Appenders exist for the console, files, GUI components, remote socket servers, JMS, NT Event Loggers, and remote UNIX Syslog daemons. In other words, you can define any of these as your output destinations for logging. As of Log4j Version 2, you also can log asynchronously, to pass the control back from the Logger to the application while I/O operations are performed in the background by a separate thread or process. Asynchronous logging can improve your application’s performance.

Appender Additivity

Multiple Appenders can be attached to a Logger. Each enabled logging request for a given Logger will be forwarded to all the Appenders in that Logger as well as the Appenders higher in the hierarchy. This is a default behavior known as Appender additivity and can easily be disabled by setting the Additivity flag to false in the log4j.properties configuration file.

Consider the example in Figure 6-6. If a console Appender is added to the root Logger, then all enabled logging requests will display on the console. In addition, if a file Appender is added to the Loggers L1, L1.L2, and L1.L2.L3, then logging requests for L1, L1.L2, and L1.L2.L3 will be written to the appropriate files and displayed on the console. Now suppose you set Logger L4’s Additivity flag to false. This effectively disconnects L4 and its children from the upward propagation of log output. Because the parent of Logger L4.L5 (which is L4 in the example) has its Additivity flag set to false, L4.L5’s output will be directed only to the Appenders in L4.L5 (in this case none) and its ancestors up to and including L4 (File4), but will not propagate to L1, L2, or L3. Figure 6-6 tabulates the results.

Figure 6-6. Appender additivity for Log4j framework

The Appenders frequently used by the major Hadoop distributions are:

Console Appender: Displays log messages on the console
File Appender: Writes log messages to a specific file, which you define in log4j.properties
Rolling file Appender: Writes log messages to files and rolls them based on size
Daily rolling file Appender: Writes log messages to files and rolls them on a daily basis

Using the same entry as for the HDFS Audit Logger (Figure 6-5), consider the Appender section presented in Figure 6-7.

Figure 6-7. Rolling file Appender for HDFS Audit Logger

In Figure 6-7, I used the RollingFileAppender with HDFS audit Logger. The output is formatted as per the Layout (PatternLayout) and the defined conversion pattern (I will discuss Layout and conversion patterns shortly), and looks like this:

2014-02-09 16:00:00,683 INFO FSNamesystem.audit: allowed=true   ugi=hdfs (auth:SIMPLE)  ip=/127.0.0.1   cmd=getfileinfo src=/user/sqoop2/.Trash/Current dst=null        perm=null

Note HDFS audit output may result in a large file. Therefore, it is a good idea to have it roll off to a new file on a daily basis or by size.

Layout

A Layout is an output format for a log entry. It can be associated with an Appender and can format the logging request as per your specifications before that request is delivered via an Appender.

It’s important to structure and present information in a way that makes reading and interpretation easy. Often it is necessary to pass logging information to another error-processing program running on a remote machine. So, it is important to decide on a structure for logging information. This is what the Layout objects provide.

Layouts use conversion patterns to format and output the log message. A conversion pattern consists of a format modifier and conversion characters. For example, the modifier t outputs the name of the thread that generated the logging event, and the conversion characters %5p display (or write) the log level using five characters with space padding on left. So, log level INFO is displayed (or written) as "INFO".

A Layout can be specified for an Appender in the log4j.properties file. For example, I specified PatternLayout as a layout (for our HDFS audit log Appender) in Figure 6-8.

Figure 6-8. PatternLayout for HDFS Audit Logger

The conversion pattern %d{ISO8601} %p %c{2}: %m%n from Figure 6-8 outputs as:

2014-01-27 20:34:55,508 INFO FSNamesystem.audit: allowed=true   ugi=mapred (auth:SIMPLE)        ip=/127.0.0.1   cmd=setPermission      src=/tmp/mapred/system/jobtracker.info  dst=null        perm=mapred:supergroup:rw-------

The first field is the date/time in ISO8601 (YYYY-MM-DD HH:mm:ss,SSS) format. The second field is the level or priority of the log statement. The third is the category, the fourth field is the message itself, and the fifth field is the line separator (newline or /n).

Apache Log4j offers several Layout objects:

Simple Layout: org.apache.log4j.SimpleLayout provides a very basic structure for the logging message. It includes only the level of the logging information and the logging message itself. This is how the log message for HDFS Audit Logger (from Figure 6-8) will be output if Simple Layout is used instead of PatternLayout:
```
INFO allowed=true ugi=hdfs (auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo src=/user/sqoop2/.Trash/Current dst=null perm=null
```

Thread-Time-Category-Context Layout (TTCCLayout): This Layout outputs the invoking thread, time (in milliseconds since application started), the category or Logger used to create this logging event, and nested diagnostic context. All these properties are optional and if they are all disabled, the Layout will still write out the logging level and the message itself, just like Simple Layout. If you specify the following options in log4j.properties:

#configuring the Appender CONSOLE
log4j.appender.CONSOLE=org.apache.log4j.ConsoleAppender
log4j.appender.CONSOLE.layout=org.apache.log4j.TTCCLayout
#configuring the Layout TTCCLayout
log4j.appender.CONSOLE.layout.ThreadPrinting=false
log4j.appender.CONSOLE.layout.ContextPrinting=false
log4j.appender.CONSOLE.layout.CategoryPrefixing=false
log4j.appender.CONSOLE.layout.DateFormat= ISO8601

You get the following output:

INFO allowed=true ugi=hdfs (auth:SIMPLE) ip=/127.0.0.1 cmd=getfileinfo src=/user/sqoop2/.Trash/Current dst=null perm=null

DateLayout: As the name suggests, this Layout provides date formats such as NULL (no date/time displayed), RELATIVE (displays time elapsed after application start), DATE (dd MMM YYYY HH:mm:ss,SSS pattern; final SSS is time elapsed after application start), ABSOLUTE (HH:mm:ss,SSS pattern), and ISO8601 (yyyy-MM-dd HH:mm:ss,SSS pattern).
HTMLLayout: Your application might need to present log information in a nice, visually appealing HTML-formatted file. org.apache.log4j.HTMLLayout is the relevant object. A big advantage of having the log file in HTML format is that it can be published as a web page for remote viewing.
XMLLayout: To render logging information in a portable (across multiple application modules) format, Log4j provides the org.apache.log4j.xml.XMLLayout object. It is important to note that the final output is not a well-formed XML file. This Layout object produces logging information as a number of <log4j:event> elements.
PatternLayout: You can use this Layout to “format” or output log messages using a consistent pattern to facilitate their use by an external entity. The relevant Layout object is org.apache.log4j.PatternLayout. The formatting is specified by format modifiers (e.g. m writes the log message, p writes the log level information) in a conversion pattern such as %d{ISO8601} %p %c{2}: %m%n. The display (or write) information is specified by conversion characters. For example, %10c instructs that the Logger name must be 10 characters, and if it’s shorter, to add space padding on left. Specifying %-10c indicates space padding should be added to the right. For more details on the PatternLayout class and conversion characters, see: http://logging.apache.org/log4j/1.2/apidocs/org/apache/log4j/PatternLayout.html.

Filters

Filters evaluate log events and either allow them to be published or not. There are several types of Filters, and they screen out events based on such criteria as number of events (BurstFilter); a log-event message matching a regular expression (RegexFilter); or the event ID, type, and message (StructuredDataFilter). The type of filter determines where you need to specify it:

Context-wide Filters are configured as a part of the configuration (LoggerConfig) and evaluate events before passing them to Loggers for further processing.
Logger Filters are configured for a Logger and are evaluated after the Context-wide Filters and the log level for the Logger.
Appender Filters are configured for an Appender and determine if a specific Appender should publish the event.
Appender Reference Filters are configured for a Logger and determine if a Logger should route the event to an Appender.

Please note that all of these Filters need to be specified in the appropriate section (for a Logger or an Appender) in your log4j.properties file. For example, Figure 6-9 shows a section from log4j.properties that defines a RegexFilter to capture HDFS auditing events for login root only:

Figure 6-9. RegexFilter for HDFS Audit Logger

You can similarly use other types of Filters to prevent capture of unwanted events, which will help keep the size of audit log small and make focusing on specific issues easier.

Reviewing Hadoop Audit Logs and Daemon Logs

As you’ve learned, you can use the Log4j component to generate log output for many purposes (e.g., debugging, operational stats, auditing). The logging data Log4j outputs is, in turn, generated by system daemon processes, and a particular type of data may exist in multiple places. How do you connect and analyze data from disjoint sources to get the total view of system operations, history, and state? The key is Hadoop’s audit logs. This section will discuss which daemon processes generate which data, what kind of data is captured by auditing, and how you can use Hadoop audit logs for security proposes.

To get a complete system picture, you need to understand what kind of data is logged by Hadoop daemons or processes (that generate logs) and where these log files reside. You also need to understand how the captured data differs with configured logging level. The auditing data from HDFS, for example, doesn’t have details of jobs executed. That data exists elsewhere, so connecting a job with HDFS access audits requires some work. You have to know where logs for JobTracker, TaskTracker (MapReduce V1), and ResourceManager (MapReduce V2) are or where log data for Task attempts is stored. You will need it for a complete audit of data access (who/what/where), and you certainly may need it in case of a security breach.

It is a major issue with Hadoop auditing that there is no direct or easy way to relate audit data with Job data. For example, JobTracker and TaskTracker logs (along with task attempt log data) can provide details of jobs executed and all the statistics related to jobs. But how can you relate this data with audit data that only has details of all HDFS access? You will learn a couple of possible ways later in this chapter.

Audit Logs

Auditing in Hadoop is implemented using the Log4j API, but is not enabled by default. Hadoop provides an HDFS audit log that captures all access to HDFS and the MapReduce audit log, which captures information about all submitted jobs for a Hadoop cluster. The location of audit logs is specified using the environment variable HADOOP_LOG_DIR defined in the hadoop-env.sh configuration file located in $HADOOP_INSTALL/hadoop/conf directory ($HADOOP_INSTALL is the directory where Hadoop is installed). The audit log file names are defined in the log4j.properties file, and the defaults are hdfs-audit.log (for the HDFS audit log) and mapred-audit.log (for the MapReduce audit log). You can’t define audit logging for YARN using log4j.properties yet; this is still being worked on (see “Add YARN Audit Logging to log4j.properties,” https://issues.apache.org/jira/browse/HADOOP-8392).

To enable auditing, you need to modify the log4j.properties configuration file by changing the logging level of the appropriate Logger from WARN to INFO. You’ll find the file in the /etc/Hadoop/conf directory or the $HADOOP_INSTALL/hadoop/conf directory, where $HADOOP_INSTALL is the Hadoop installation directory. log4j.properties defines the logging configuration for NameNode and the other Hadoop daemons (JobTracker, TaskTracker, NodeManager, and ResourceManager). For example, to enable HDFS auditing, look for this line in the log4j.properties file:

log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=WARN

Replace WARN with INFO to enable HDFS auditing and ensure a log line written to the HDFS audit log for every HDFS event.

Likewise, to enable MapReduce auditing, set its Logger to the INFO level:

log4j.logger.org.apache.hadoop.mapred.AuditLogger=INFO

Figure 6-10 shows a section from log4j.properties defining the HDFS auditing configuration.

Figure 6-10. HDFS audit logging configuration

Hadoop Daemon Logs

Hadoop daemon logs are logs generated by Hadoop daemons (NameNode, DataNode, JobTracker, etc.) and located under /var/log/hadoop; the actual directories may vary as per the Hadoop distribution used. The available logs are as follows:

NameNode logs (hadoop-hdfs-namenode-xxx.log) containing information about file opens and creates, metadata operations such as renames, mkdir, and so forth.
DataNode logs (hadoop-hdfs-datanode-xxx.log) containing information about DataNode access and modifications to data blocks.
Secondary NameNode logs (hadoop-hdfs-secondarynamenode-xxx.log) containing information about application of edits to FSimage, new FSimage generation, and transfer to NameNode.
JobTracker logs (hadoop-xxx-mapreduce1-jobtracker-xxx.log), containing information about jobs executed. JobTracker creates an xml file (job_xxx_conf.xml) for every job that runs on the cluster. The XML file contains the job configuration. In addition, JobTracker creates runtime statistics for jobs. The statistics include task attempts, start times of tasks attempts, and other information.
TaskTracker logs (hadoop-xxx-mapreduce1-tasktracker-xxx.log), containing information about tasks executed. TaskTracker creates logs for task attempts that include standard error logs, standard out logs, and Log4j logs.
ResourceManager (yarn-xxx-resourcemanager-xxx.log) and Job History server logs (mapred-xxx-historyserver-xxx.log), containing information about job submissions, views, or modifications. These are available only if you use MapReduce V2 or YARN.

As with audit logs, you can specify the logging level of the Hadoop daemons in the configuration file log4j.properties, and each daemon can have a different level of logging if required. For example, you could set the Audit Logger for HDFS to the INFO level and instruct TaskTracker to log at level TRACE:

log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=INFO
log4j.logger.org.apache.hadoop.mapred.TaskTracker=TRACE

Please note that other components (e.g., Hive, HBase, Pig, Oozie, etc.) have corresponding log4j.properties files in their own configuration directories.

Any operational Hadoop cluster has a number of scheduled (and unscheduled or ad hoc) jobs executing at various times, submitted by any of the approved users. As mentioned, it is challenging to correlate job logs with the HDFS access logs captured via auditing. For example, consider this typical row found in audit records:

2013-10-07 08:17:53,438 INFO FSNamesystem.audit: allowed=true ugi=hdfs (auth:SIMPLE) ip=/127.0.0.1 cmd=setOwner src=/var/lib/hadoop-hdfs/cache/mapred/mapred/staging dst=null perm=mapred:supergroup:rwxrwxrwt

All this row says is that a command (setOwner in this case) was executed on a source file, but it doesn’t indicate if it was executed as part of a job.

You would need to refer to the corresponding JobTracker or TaskTracker logs to see if there were any jobs executing at that time, or else assume that it was an ad hoc operation performed using a Hadoop client. Therefore, you need to maintain logs of other Hadoop daemons or processes in addition to audit logs and correlate them for effective troubleshooting.

Correlating and Interpreting Log Files

Hadoop generates a lot of logs. There are audit logs and daemon logs that separately provide a lot of information about the processing done at the sources from which they are gathered. However, they don’t form a cohesive, complete picture of all the processing performed at your Hadoop cluster. That’s the reason you need to correlate these logs while troubleshooting an issue or investigating a security breach.

Correlating Hadoop audit data with logs generated by Hadoop daemons is not straightforward and does require a little effort, but the results are well worth it. Using a username or job number as well as Linux filters (e.g. sed or stream editor utility), you can relate the data and identify security breaches.

What to Correlate?

Hadoop daemons log a lot of useful information, and you can also enable and gather audit logs. Assuming you have all these logs available, what should you correlate? Well, that depends on the event you are trying to investigate.

Consider a possible security breach in Chapter 3’s ticketing system example. As you remember, all the police stations send their ticketing data nightly to the central repository at police headquarters. The central repository holds the ticketing data in a Hive table that has partitions for each day. Every day, an IT professional runs automated process to add a new partition using the data received.

One day, one of the IT professionals decided to help out his girlfriend by removing her speeding ticket entry. He was caught due to analysis conducted using correlated logs. He removed the ticket entry from the ticketing table, but forgot to remove the corresponding entries from judiciary-related tables, and the system flagged errors when the case was due for a hearing. Subsequently, a thorough investigation was conducted. Let’s follow the trail as it unfolded; the unprofessional IT professional goes by the username RogueITGuy.

When the error was detected, the system administrator checked access to HDFS using the following:

HDFS audit log: This provided details of all commands users executed on a cluster. Because Ticket_details was the table that was missing a record, investigators focused on it and filtered out access by user root and HDFS superuser hdfs (since both are system users with controlled passwords) to get a list of users who accessed Ticket_details. To filter, investigators (the team including the system administrator) used the following shell command:
```
grep Ticket_details hdfs-audit.log | grep -v 'ugi=root' | grep -v 'ugi=hdfs'
```
(The -v option for command grep filters records with the keyword specified after the option.) The results included normal user activity plus the following suspicious activity by a user RogueITGuy:
```
2014-03-06 22:26:08,280 INFO FSNamesystem.audit: allowed=true ugi=RogueITGuy (auth:SIMPLE) ip=/127.0.0.1   cmd=getfileinfo
src=/Ticketing/Ticket_details_20140220
      dst=null        perm=null

2014-03-06 22:26:08,296 INFO FSNamesystem.audit: allowed=true ugi=RogueITGuy (auth:SIMPLE) ip=/127.0.0.1   cmd=rename
      src=/Ticketing/Ticket_details_20140220
      dst=/Ticketing/Ticket_stg/Ticket_details_20140220
      perm=RogueITGuy:supergroup:rw-r--r—

2014-03-06 22:27:02,666 INFO FSNamesystem.audit: allowed=true ugi=RogueITGuy
(auth:SIMPLE) ip=/127.0.0.1   cmd=open
      src=/Ticketing/Ticket_stg/Ticket_details_20140220       dst=null
      perm=null
```
Investigators concluded the following:
- User RogueITGuy (ugi=RogueITGuy) loaded a new version of daily staging file Ticket_details_20140220 (cmd=rename src=/Ticketing/Ticket_details_20140220 dst=/Ticketing/Ticket_stg/Ticket_details_20140220).
- File was loaded to HDFS location that points to external staging table Ticket_details_stg, which is used to load data to the Ticket_details table by creating and overwriting the partition for a particular day.
- The first entry (cmd=getfileinfo src=/Ticketing/Ticket_details_20140220) was to make sure he had the correct (modified with his girlfriend’s ticket entry removed) file uploaded from his PC.
- The third entry was to make sure that the modified file was uploaded to the staging location correctly.

Hive log: If this user overwrote a partition with the modified file, he would have done that using Hive. So, investigators looked at the Hive logs next (in /var/log/hive for Cloudera CDH4; may vary as per your distribution and configuration):

grep 'ugi=RogueITGuy' hadoop-cmf-hive1-HIVEMETASTORE-localhost.localdomain.log.out | grep 'ticket_details' | grep -v 'get_partition'

They searched for activity by RogueITGuy in the table Ticket_details and, after reviewing the output, filtered out 'get_partition' entries, since that command does not modify a partition. Here’s what they saw:

2014-03-06 22:42:36,948 INFO
org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=RogueITGuy
ip=/127.0.0.1   cmd=source:/127.0.0.1 get_table : db=default tbl=ticket_details

2014-03-06 22:42:37,184 INFO
org.apache.hadoop.hive.metastore.HiveMetaStore.audit: ugi=RogueITGuy
ip=/127.0.0.1   cmd=source:/127.0.0.1 append_partition: db=default
tbl=ticket_details[2014,2,20]

Investigators drew the following conclusions:

The partition for 2/20/14 was overwritten (ugi=RogueITGuy ip=/127.0.0.1 cmd=source:/127.0.0.1 append_partition: db=default tbl=ticket_details[2014,2,20]) for table Ticket_details by RogueITGuy.
The file Ticket_details_20140220 was uploaded on 3/6/14 22:26 and the Hive partition was overwritten on 3/6/14 22:42 by the same user—RogueITGuy. Case closed!

Last, investigators checked the jobs submitted by RogueITGuy. Several job-related logs provided details of jobs users executed. Investigators started by reviewing the MapReduce audit logs, which contain all the user, date/time, and result details of submitted jobs. For Cloudera their location is /var/log/hadoop-0.20-mapreduce/mapred-audit.log. Investigators next issued the following command:

grep 'RogueITGuy' mapred-audit.log

It yielded a couple of jobs:

2014-03-06 22:28:01,590 INFO mapred.AuditLogger: USER=RogueITGuy       IP=127.0.0.1
        OPERATION=SUBMIT_JOB    TARGET=job_201403042158_0008   RESULT=SUCCESS
2014-03-06 22:42:07,415 INFO mapred.AuditLogger: USER=RogueITGuy       IP=127.0.0.1
        OPERATION=SUBMIT_JOB    TARGET=job_201403042158_0009   RESULT=SUCCESS
2014-03-06 22:45:55,399 INFO mapred.AuditLogger: USER=RogueITGuy       IP=127.0.0.1
        OPERATION=SUBMIT_JOB    TARGET=job_201403042158_0010   RESULT=SUCCESS
2014-03-06 22:47:39,380 INFO mapred.AuditLogger: USER=RogueITGuy       IP=127.0.0.1
        OPERATION=SUBMIT_JOB    TARGET=job_201403042158_0011   RESULT=SUCCESS
2014-03-06 22:48:46,991 INFO mapred.AuditLogger: USER=RogueITGuy       IP=127.0.0.1
        OPERATION=SUBMIT_JOB    TARGET=job_201403042158_0012   RESULT=SUCCESS

Investigators checked the JobTracker and TaskTracker logs using the web interface for JobTracker at http://JobTrackerHost:50030/JobTracker.jsp. Jobs job_201403042158_00010, job_201403042158_0011, and job_201403042158_0012 were Select statements that didn’t modify any data, but jobs job_201403042158_0008 and job_201403042158_0009 led to conclusive proof! Investigators reviewed the hive.query.string property in the job.xml file for these jobs and retrieved the query that was executed, which was:

FROM Ticket_details_stg INSERT OVERWRITE TABLE Ticket_details PARTITION (Yr=2014,Mo=2,Dy=20) SELECT TicketId,DriverSSN,Offense,IssuingOfficer

The query used the data from the Ticket_details_stg table (a daily staging table) to overwrite a partition for date 2/20/14 for table Ticket_details. The HDFS audit logs already established that RogueITGuy had loaded a temporary data file to staging table.

Together, the logs made clear that RogueITGuy edited the daily temporary data file and removed the record that contained ticket entry for his girlfriend. Then he uploaded the new file to the staging table and used the staging table to overwrite a partition for the Ticket_details table to make sure that the ticket entry was removed. Using the HDFS audit log, Hive log, MapReduce audit log and job.xml files, investigators obtained conclusive evidence of unauthorized activities performed by RogueITGuy and were able to successfully conclude the investigation.

As a result, RogueITGuy lost his job, and his girlfriend had to pay the ticket. She was so touched by his devotion, however, that she agreed to marry him. So, in the end, even RogueITGuy thanked correlated logs!

How to Correlate Using Job Name?

There are several ways you can correlate the logs. The easiest is using login name or job name, because log messages contain this information. You saw how the RogueITGuy username led to correlating the various log files to investigate unauthorized activities. Relating the logs using job names was an important step, as well. To track down the security breach, investigators had to extract relevant information from the logs and use job name to relate multiple logs to get details of what activities were performed for a particular job.

I will walk you through that process now, starting with the MapReduce audit log (mapred-audit.log), which has entries as shown in Figure 6-11.

Figure 6-11. MapReduce audit log

Notice the highlighted entry with the job name job_201403042158_0008. The HDFS audit log has multiple entries for this job. How do you filter them out?

If you look at the first occurrence of an entry for this job (in hdfs-audit.log), you will observe that it has the pattern cmd=getfileinfo along with job name job_201403042158_0008. This holds true for Cloudera’s Hadoop distribution (CDH4) and if you have use a different distribution, you will need to identify a unique pattern for the first and last occurrences of a particular job. The good news is that you only have to perform this exercise once for a Hadoop distribution. You simply have to establish a unique pattern for the first and last occurrence of the job name that separates it from subsequent occurrences; then you can use it for all your searches.

Subsequently, you can use the Linux utility awk to get the line number for first occurrence of this pattern:

awk '/cmd=getfileinfo/ && /job_201403042158_0008	/ { print NR }' hdfs-audit.log

The awk utility looks for the first line that matches the patterns cmd=getfileinfo and job_201403042158_0008 and uses the built-in variable NR to output line number.

Also, you can get the line number for last occurrence of a job name by using the patterns cmd=delete and /src=/tmp/mapred/system/ job_201403042158_0008 like:

awk '/cmd=delete/ && /src=/tmp/mapred/system/job_201403042158_0008/ { print NR }' hdfs-audit.log

After that, you can just use a stream editor, such as sed, to print lines starting with the first pattern and ending with the second pattern. For example, sed –n 1,20p hdfs-audit.log will display lines 1 to 20 from file hdfs-audit.log on the screen.

sed -n `awk '/cmd=getfileinfo/ && /job_201403042158_0008	/ { print NR }' hdfs-audit.log`,`awk '/cmd=delete/ && /src=/tmp/mapred/system/ job_201403042158_0008/ { print NR }' hdfs-audit.log`p hdfs-audit.log

The sed command uses line numbers obtained in earlier steps (marked with bold and italic) as start and end to print all the lines in between. You can redirect the output of command sed to a file and review the HDFS audit records, instead of watching them on the screen (as implied by the last sed command). You can use this sed command to extract job details from hdfs-audit.log for any jobs (for CDH4)—just substitute the job name!

Now, in this case, you didn’t get much information from hdfs-audit.log entries, except that this job did Hive-related processing and also showed the location of job.xml:

2014-03-06 22:27:59,817 INFO FSNamesystem.audit: allowed=true   ugi=RogueITGuy (auth:SIMPLE)    ip=/127.0.0.1   cmd=create      src=/user/RogueITGuy/.staging/job_201403042158_0008/libjars/hive-builtins-0.10.0-cdh4.4.0.jar   dst=null        perm=RogueITGuy:supergroup:rw-r--r—

2014-03-06 22:28:02,184 INFO FSNamesystem.audit: allowed=true   ugi=RogueITGuy (auth:SIMPLE)    ip=/127.0.0.1   cmd=getfileinfo src=/user/RogueITGuy/.staging/job_201403042158_0008/job.xml     dst=null        perm=null

2014-03-06 22:28:02,324 INFO FSNamesystem.audit: allowed=true   ugi=RogueITGuy (auth:SIMPLE)    ip=/127.0.0.1   cmd=getfileinfo src=/tmp/hive-RogueITGuy/hive_2014-03-06_22-27-55_562_981696949097457901-1/-mr-10004/164c8515-a032-4b6f-a551-9bc285ce37c4       dst=null        perm=null

Why not just use grep command to retrieve the job details for job job_201403042158_0008 in hdfs-audit.log? The reason is that all the lines pertaining to job job_201403042158_0008 may not contain the job name pattern and you want to make sure you don’t miss any relevant lines from log file hdfs-audit.log.

Using Job Name to Retrieve Job Details

You can use the same technique of finding a unique pattern for the first occurrence to retrieve records relevant to a job from the JobTracker or TaskTracker logs. For example, to look for a pattern in the JobTracker log file and get the line number of the first occurrence of a job, such as job_201403042158_0008, use:

awk '/job_201403042158_0008/ && /nMaps/ && /nReduces/ { print NR }' hadoop-cmf-mapreduce1-JOBTRACKER-localhost.localdomain.log.out

To retrieve the line number for last occurrence for ‘job_201403042158_0008’, use:

awk '/job_201403042158_0008/ && /completed successfully/ { print NR }' hadoop-cmf-mapreduce1-JOBTRACKER-localhost.localdomain.log.out

You can use command sed to get details from the JobTracker log file for CDH4 by specifying the job name. For example, the sed command to print out all records for job_201403042158_0008 is:

sed -n `awk '/job_201403042158_0008/ && /nMaps/ && /nReduces/ { print NR }' hadoop-cmf-mapreduce1-JOBTRACKER-localhost.localdomain.log.out`,`awk '/job_201403042158_0008/ && /completed successfully/ { print NR }' hadoop-cmf-mapreduce1-JOBTRACKER-localhost.localdomain.log.out`p hadoop-cmf-mapreduce1-JOBTRACKER-localhost.localdomain.log.out

The command’s output provides valuable details such as the nodes tasks were executed on or where the task output is located:

2014-03-06 22:28:01,394 INFO org.apache.hadoop.mapred.JobInProgress: job_201403042158_0008: nMaps=1 nReduces=0 max=-1

2014-03-06 22:28:01,764 INFO org.apache.hadoop.mapred.JobInProgress: Input size for job job_201403042158_0008 = 74. Number of splits = 1

2014-03-06 22:28:01,765 INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201403042158_0008_m_000000 has split on node:/default/localhost.localdomain

2014-03-06 22:28:01,765 INFO org.apache.hadoop.mapred.JobInProgress: Job job_201403042158_0008 initialized successfully with 1 map tasks and 0 reduce tasks.

2014-03-06 22:28:02,089 INFO org.apache.hadoop.mapred.JobTracker: Adding task (JOB_SETUP) 'attempt_201403042158_0008_m_000002_0' to tip task_201403042158_0008_m_000002, for tracker 'tracker_localhost.localdomain:localhost.localdomain/127.0.0.1:47799'

Using Web Browser to Retrieve Job Details

You can also review the JobTracker and TaskTracker log records easily using the browser interface. The runtime statistics for a job or XML file for a job are best reviewed using the browser interface. The URL for the records is composed of the tracker’s name and web access port. If your JobTracker host is called 'MyJobHost' and uses port 50030 for web access, for example, then the JobTracker logs can be reviewed at http://MyJobHost:50030/logs/. Likewise, logs for a TaskTracker running on host 'MyTaskHost' and using port 50060 can be reviewed at http://MyTaskHost:50060/logs/. Check your configuration file (mapred-site.xml) for particulars of hosts running specific daemons and ports. Filenames may vary by distributions, but log files will have TaskTracker or JobTracker in their names, making them easy to identify.

Figure 6-12 shows a logs directory and various MapReduce log files for a cluster using MapReduce 1.0.

Figure 6-12. MapReduce log files for MapReduce Version 1

If you are using YARN, then the corresponding daemons are ResourceManager (instead of JobTracker) and NodeManager (instead of TaskTracker). Please check the YARN configuration file (yarn-site.xml) for web access ports (values of mapreduce.johistory.webapp.address and yarn.resourcemanager.webapp.address). For example, in Figure 6-13, the ResourceManager uses port 8088.

Figure 6-13. ResourceManager web interface for YARN

The NodeManager uses port 8042, as shown in Figure 6-14.

Figure 6-14. NodeManager web interface for YARN

Last, the Historyserver uses port 19888 (Figure 6-15).

Figure 6-15. HistoryServer web interface for YARN

The YARN logs for NodeManager and ResourceManager should be used to get job details when YARN is used. Historyserver holds logs for archived or “retired” jobs. So, if you need to access older job details, that’s what you need to check. The patterns to locate first and last lines may change slightly and might need to be adjusted; but you can easily browse through the log files to make those adjustments. An easy way to find out location of the YARN log files is to refer to the log4j.properties file located in /etc/hadoop/conf and see where the appropriate Appenders are pointing.

A thought before I conclude the chapter. You have seen how to relate logs for a job, but what if you want to trace all the activity for a user or you want to trace activity for a whole day? Defining and using awk patterns would be cumbersome, difficult, and error-prone. Instead, try defining Log4j Filters for Appenders, as well as defining additional Appenders to direct relevant output to separate files for an issue, and consolidate all the files for an issue. You can either use Flume for that purpose or simply have your shellscripts do the consolidation for you.

Important Considerations for Logging

Some additional factors will help you make effective use of logging. Although they are not directly relevant to security, I will mention them briefly in this section and you can decide how relevant they are for your individual environments.

Time Synchronization

Hadoop is a distributed system with multiple nodes—often a large number of them. Therefore, Hadoop logs are also distributed across the various nodes within your cluster. Individual log messages are timestamped, and while you are troubleshooting, you need to be sure that 12:00 PM on one node is the same moment of time as specified by 12:00 PM on another node.

For a network, clock skew is the time difference in the clocks for different nodes on the network. Usually, a time difference in milliseconds is acceptable; but a larger clock skew needs to be avoided. A number of protocols (e.g., Network Time Protocol, http://en.wikipedia.org/wiki/Network_Time_Protocol) can be used to make sure that there is negligible time skew. It is certainly important to make sure that the generated logs for your cluster are time synchronized.

Hadoop Analytics

In the section “Correlating and Interpreting Log Files,” I have discussed how a combination of the Linux stream editor sed and the powerful text processor awk can be used to search for a pattern and print the appropriate lines. You can easily extend this method to counting the lines that match a pattern. You can make multiple passes on the log files and aggregate the matches to analyze the usage patterns. Analytics so generated might not be useful for security investigations, but they can certainly provide useful statistics for your Hadoop cluster.

For example, the following command can tell you how many times the user RogueITGuy accessed your Hadoop cluster since it was started (you can of course easily extract the date range for access as well):

grep 'ugi=RogueITGuy' hdfs-audit.log | wc -l

The following command tells you how many jobs were executed by RogueITGuy since your cluster restarted:

grep 'USER=RogueITGuy' mapred-audit.log | wc -l

The following script extracts the start and end date/time for job job_201403042158_0008 (you can then compute the job duration):

awk –F ',' '/cmd=getfileinfo/ && /job_201403042158_0008	/ { print $1 }' hdfs-audit.log
awk –F ',' '/cmd=delete/ && /src=/tmp/mapred/system/job_201403042158_0008/ { print $1 }' hdfs-audit.log

You can develop automated scripts that write all the daily job analysis or HDFS access analysis to files and add them as partitions for appropriate Hive tables. You can then perform aggregations or use other statistical functions on this data for your own analytical system.

Of course, the analytics that are more relevant to you may vary, but I am sure you understand the method behind them.

This historical data (stored as Hive tables) can also be used for generating security alerts by defining variation thresholds. For example, you can write a Hive query to generate an alert (through Nagios) if a user executes twice (or more) the number of jobs as compared to his monthly average. The use of historical data for security alerts will always rely on sudden change in usage, and you can use the concept as applicable to your environment.

Splunk

Splunk is a very powerful tool for analyzing Hadoop data. Using the Hadoop Connect module of Splunk, you can import any HDFS data and use the indexing capability of Splunk for further searching, reporting, analysis, and visualization for your data. You can also import Hadoop logs, index them, and analyze them.

Splunk provides a powerful search processing language (SPL) for searching and analyzing real-time and historical data. It can also provide capability of real-time monitoring of your log data (for patterns/thresholds) and generate alerts when specific patterns occur (within your data). For example, if you are using Hive, you want to know when a partition (for one of your Production tables) is overwritten or added. You might also want to alert your system administrator when one of the users connects to your cluster.

Splunk’s most important capability (from security logging perspective) is its ability to correlate data. Splunk supports the following ways to correlate data:

Time and geolocation based: You can correlate data for events that took place over a specific date or time duration and at specific locations. So, if I had used Splunk to conduct investigation for RogueITGuy, I could have asked Splunk to give me all the log data for 3/16/14 (the specific date when the issue occurred).
Transaction based: You can correlate all the data for a business process (or series of business processes) and identify it as a single event. Even though it can’t be used for security, it can provide analytics for a job or a business process (such as duration, CPU and RAM resources consumed, etc.).
Sub-searches: Allow you to use the results of one search and use them in another. So, if I had used Splunk to conduct investigation for RogueITGuy, then I could define my sub-searches to HDFS, MapReduce, or Hive access for easier analysis.
Lookups: Allow you to correlate data from external sources. For instance, I could have checked all Hive alerts from Nagios to see if RogueITGuy was involved in any other issues.
Joins: Allows you to link two completely different data sets together based on a username or event ID field. Using Splunk, I could link monitoring data from Ganglia and Hadoop log data using username RogueITGuy and investigate what else he accessed while performing his known illegal activities.

Last, Splunk offers Hunk, which is an analytics tool specifically designed for Hadoop and NoSQL Data. It lets you explore, analyze, and visualize raw, unstructured data. Hunk also offers role-based access to limit access to sensitive data (more information at www.splunk.com/hunk). Take a look and see if it is more useful for your needs!

Summary

In this chapter, I discussed how Hadoop logging can be effectively used for security purposes. The high-level approach is to use Linux utilities and stream editors to process the text in log files and derive the necessary information, but this is, of course, very old-fashioned and hard work. There are easier ways of achieving similar results by using third-party solutions such as Splunk.

A large number of third-party products are available for reducing the work involved in troubleshooting or investigating security breaches. The disadvantage is that you won’t have as much control or flexibility while correlating or analyzing the logs. The preference is yours—and most of the times it’s dictated by your environment and your requirements. With either approach, be sure to synchronize time on all the nodes you need to consider before you can rely on the logs generated.

Last, it is worthwhile to explore the use of Hadoop logs for analytics—be it security related or otherwise. You can either buy expensive software to perform the analytics or develop your own scripts if you are sure of your requirements—and if they are small in number!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 6: Hadoop Logs: Relating and Interpretation

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 6: Hadoop Logs: Relating and Interpretation