Images

CHAPTER

8

Big Data

Big Data. You have heard this term a lot over the past few years, but what does it actually mean? No, it does not refer to very large VARCHARs, and it is not about CLOBs and BLOBs. A good starting definition of Big Data is as follows: it is potentially a large volume of data that is structured, semi-structured, or unstructured. The definition has been very fluid and has slightly changed over time. Some have said it heralds the end of the RDBMS systems that are so prevalent. But like most trends, the reality is different from the hype. What we have seen is more of a blending of the two technologies being used together rather than the Big Data revolution replacing RDBMS systems.

The Internet of Things (IoT) is another new term that fits into the Big Data world. You may have seen the TV commercials where a repairman shows up on the doorstep of a house and the owner of the house states that they didn’t order a repairman. The repairman says, “You didn’t call us, but your washing machine did.” This is part of the IoT, where machines are now connected to the Internet, communicating their status and error messages as well as ordering new parts that they know will break down. This is a world where machines can talk to each other. We are not quite at Skynet level yet, but we are moving closer. With machines talking to machines, this means that more data is sent across the network, which often means lots of data in small increments. However, if we are taking about cameras or audio, it could be lots of data in large increments. Whatever form the data is in, it has to go someplace. Oftentimes that data ends up in a database, and many data scientists want to use that data. What good is data if it is not stored and then mined for useful information?

This is where Big Data comes into play. It takes that data and stores it. Much of the data may be formatted and would be perfect for a relational database. Other data, such as machine data or audio/video data, may only be semi-structured or not structured at all. This data is perfect for a “Big Data” database. Big Data is often defined by the letter V. The three V’s of Big Data are volume, variety, and velocity. The volume of data produced by man and machines has rapidly increased over the past few years and is steadily increasing. The variety of data has also exploded. Most data that was produced in prior years was structured data that would easily fit in the relational database model. Sensory data, video data, structured data, unstructured data, and semi-structured data need new formats to be able to be analyzed. Velocity is the speed at which data is created. This, along with the volume of data, has only increased as time has moved forward. These three V’s are often combined with some other V’s, such as veracity, visualization, and value. Veracity is important because if the data itself is not valid, it is often worthless. Visualization involves making all that data usable in the form of graphs, by taking the raw data and forming it into something that is meaningful to humans. Value, of course, might be the V companies care about the most—making sure they derive value from all that data.

It takes that data and stores it. Much of the data may be formatted and would be perfect for a relational database. Other data, such as machine data or audio/video, may only be semi-structured or not structured at all. This data is perfect for a “Big Data” database. Or you may have heard the term “NoSQL” database. Other information might just be stored as files on an operating system. When people talk about Big Data, it is important to know that they are not just referring to the data itself. They are talking about the data as well as the software tools involved, the infrastructure, and in some cases the hardware, too. It is similar to the way people talk about data warehouses.

The Big Data field is new, exciting, and dynamic. It is an ever-changing target. Some of the first tools used for Big Data are already out of favor, and others keep getting enhanced. Oracle, its partners, and its competitors are constantly coming out with new features and products in this fast-moving environment. Enhancements will keep coming in this field for the foreseeable future. This chapter is intended as a quick introduction to some of the new tools from Oracle that are incorporating databases and Big Data.

But before we get started, let’s talk a bit about how this data is used. Some may use a Big Data database on its own. Others may use it in place of a data warehouse. The trend recently seems to be that it is not an either/or solution, but a partnership of both methods—using Big Data as needed and using RDBMS as needed. Indeed, the new trend is that of a data lake, data pool, or data reservoir that is used outside of a data warehouse, and the data is moved into or out of it as needed (see Figure 8-1). Of course, tools are needed to move the data, and this is where some of the data integration tools come into play.

Images

FIGURE 8-1. Data reservoir

We will start off our discussion with Oracle’s product known as the Oracle Big Data Appliance. Yes, Oracle is continuing in the tradition of purpose-built machines. After we discuss the Big Data Appliance, we will talk about the aforementioned NoSQL databases. We’ll discuss what the term means in general as well as Oracle’s application of NoSQL. We will also take a look at Hadoop and how it plays a role in the Big Data space. In previous chapters, we talked about Oracle Data Integrator (ODI) and Oracle GoldenGate (OGG). They both have a role to play in Big Data. We delve into how these two tools work with Big Data and what the future has in store. Big Data Connectors will be looked at next, as they move data between an Oracle Database and Big Data.

Oracle Big Data Appliance

At Oracle Open World, in October 2011, Oracle announced the purpose-built machine or engineered system, the Oracle Big Data Appliance. Like its big brother Exadata, the Oracle Big Data Appliance is built with hardware designed and configured for a specific purpose. Combine that with software designed around Big Data, and you have the full package ready to deploy. Because Oracle has put together such a nice package of Big Data–related hardware and software, it makes a great starting point when talking about all things Big Data. Let’s look at some of the topics that we will discuss in more detail further in the chapter, as well as at what the Big Data Appliance is all about.

Many different software bundles come with the Big Data Appliance. The software is only licensed for this machine, and cannot be used elsewhere. The first thing you should know is that the Oracle Big Data Appliance runs on Oracle Linux. There is also a large Cloudera bundle. Cloudera has been a partner of Oracle’s for quite some time in the Big Data space. A MySQL database is also included. Finally, Oracle NoSQL database and Oracle R Distribution round out the package. There are also some other products that come with the Big Data Appliance, but they must be licensed separately. These include Oracle Big Data SQL, Oracle Loader for Hadoop, Oracle Data Integrator (ODI) Application Adapter for Hadoop, Oracle SQL Connector for Hadoop, and some others. So as you can see, quite a bit of software is packed in the Big Data Appliance. This software can also run independently of the Big Data Appliance as well. Let’s look at some of these tools.

Cloudera

Cloudera is one of Oracle’s key partners in the Big Data space. Cloudera has a whole portfolio of tools included in the Big Data Appliance. The version of Hadoop that is bundled with the Big Data appliance comes from Cloudera. Cloudera Manager, Cloudera’s tool, is also included along with a bundle of Apache products (Pig, Hive, Sqoop, HBase, Spark, and others), and the Cloudera Data Hub Edition, which includes Impala, Search, and Navigator. As you can see, Cloudera and Oracle have bundled a number of Cloudera products in the machine.

Oracle NoSQL

Oracle has come out with its own version of NoSQL. Before we get into Oracle’s version, however, it would be good to talk about what NoSQL is. The term can be quite confusing for DBAs. Here they spend lots of time learning SQL and now they are told about NoSQL. NoSQL does not mean the lack of SQL but is more commonly defined as “not only SQL.” As you’ll recall, SQL is the language of RDBMS systems, which deals mostly with objects in table format with relationships to other tables. NoSQL extends that to other objects that are not tables but can be columnar, graphs, documents, or other formats. We will delve into NoSQL in a later section of this chapter.

Oracle R Distribution

Oracle has repackaged the open-source R package. It is 100 percent supported and maintained by Oracle. The R programming language is primarily used by data scientists and by developers dealing with large volumes of statistics, hence why it is often used in the Big Data space. They can continue programing in R and will not have to learn MapReduce or Hadoop. Therefore, they can focus on the pieces they already know.

Oracle XQuery for Hadoop

Oracle XQuery allows you to write XML queries and have the connector do the translation into MapReduce jobs. The data to be inputted must be located in a file system that is accessible via the API, such as HDFS or Oracle NoSQL. Oracle XQuery for Hadoop can write the transformation results to HDFS, Oracle NoSQL Database, or the Oracle Database.

Oracle Loader for Hadoop

Oracle Loader for Hadoop allows the loading of files from Hadoop into the Oracle database. We will get into Hadoop in more detail a bit further in the chapter. Although many have talked about pulling from Oracle into Hadoop, the business needs have seen that coexistence is needed, and that is why Oracle Loader for Hadoop is so important. The need was seen for data to move both ways, which is why Oracle Loader for Hadoop was developed. This tool allows the text files that live on Hadoop to be moved into Oracle. The files contain strings in Hadoop. Because Oracle wants the data to be structured (data types and so on), the Oracle Loader will take those strings and convert them to the proper data types. The work is done on the Hadoop system (or the Big Data Appliance, in this case), which will save the load on the Oracle target database. The other great feature is that Hadoop will do all this processing in parallel streams for faster loading. Oracle Loader for Hadoop can also handle many different file types (avro, json, text, parquet, and so on). The variety of inputs, the speed, and the fact that the work is done on the Hadoop side make the Oracle Loader for Hadoop a powerful tool.

Oracle SQL Connector for Hadoop

The Oracle SQL Connector for Hadoop is somewhat the opposite of the Oracle Loader for Hadoop. This connector allows you to create external tables on the database and query data on the Hadoop cluster. It uses the same technology infrastructure that is based on Oracle external tables and will point to Hadoop files. Remember from Chapter 2 when we talked about external tables pointing to flat files on the operating system? The syntax is very similar:

Images

Notice the line that is different: PREPROCESSOR. This tells the table definition that it is going to be a Hadoop file. HDFS, mentioned previously, stands for Hadoop Distributed File System. We will discuss Hadoop a bit further in this chapter. However, there is another step that still needs to be taken. Like the “regular” external tables, the syntax just shown will just create the metadata in the Oracle database. The Direct Connector tool will populate the location with the Universal Resource Identifier for where the data files are located in HDFS. When a query is made against the table, the Oracle Direct Connector will then take that information to find the data, send it to the database, and then send it on to the users. The bad news about using this method is that it will result in full table scans, and then any filtering will occur on the database side and not on the HDFS side. This should be taken into consideration before doing complex joins with this method.

The SQL Connector can also be used for loading files from HDFS into Oracle. The tool will generate one table file, and then at the end perform a SQL command, UNION ALL, to get all the data to put into Oracle. This method may not be the best one, and comparisons should be made with the Oracle Loader for Hadoop before deciding which tool to use. The SQL Connecter is not primarily about loading data into the database but rather for analyzing the data from the Hadoop cluster through the means of SQL while physically keeping all the data on the Hadoop cluster.

Oracle Big Data SQL

Big Data SQL is a tool that allows you to have multiple repositories of data across disparate systems but have one method to query the data. This may be the most important piece of software Oracle has produced so far concerning Hadoop clusters. By using regular SQL and having the tool find the appropriate methods to translate that data can be a big help. Hadoop has an amazing feature called Smart Scan, which is based on the same technology used on Exadata. This smart scanning technique is unique to Oracle and provides dramatic performance improvements. The data is scanned, read, and processed on the local nodes where the data is stored. WHERE clauses, filtering, and other work are also done on the local node, and only the needed data is shipped off of the Hadoop machine to the Oracle database. This allows the massive computer power of the Hadoop cluster to be used, reduces network traffic, and allows the work do be done off of the Oracle database. All of the security and redaction mechanisms of the Oracle database in effect are now applied to the Hadoop clusters. Oracle Big Data makes use of external tables much like Oracle SQL Connector for Hadoop.

The Big Data Appliance has all the software (and hardware) you need to get started developing a Big Data project. Of course, as mentioned earlier, you don’t have to buy the BDA to use the software. All of the projects can be used independently of the BDA, and many of the products are open source. By looking at what is included with the BDA, because it is a complete solution, you can determine what software and tools will be required. Let’s now look into what Hadoop is all about.

Hadoop

You may have heard of Hadoop used in conjunction with Big Data. Although a relatively new technology, Hadoop has exploded onto the scene, and its adoption rate has been quite fast. Hadoop is an open-source development program officially called Apache Hadoop, from the Apache community. The name Hadoop comes from the lead programmer Doug Cutting. He named Hadoop after his son’s stuffed elephant. Like most things in the computer science field, Hadoop is built on the shoulders of those that came before it. Google Corporation struggled with large, complex, and expensive file systems. There was nothing on the commercial market that would solve the company’s problems, so it launched an internal project called Nutch. Two important papers published by Google—Google’s File System and MapReduce—were then incorporated into Nutch. Yahoo! then built on the Nutch framework to create Hadoop. Apache then took over the project in January 2008 as an open-source project.

Hadoop typically consists of two main parts: the storage portion called Hadoop Distributed File System (HDFS), and MapReduce, the processing portion. The file system is a Java-based solution that requires a Java Runtime Environment (JRE). MapReduce does the computer processing. Let’s look at each of these in turn.

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS), or sometimes just Hadoop, stores files across multiple machines. Hadoop is designed to run on commodity hardware. It was also intended to be used in more of a batch use-case as well as intended for very large data sets. Files can only be appended to. There is no updating or removing of rows. This feature allows data to be input quickly because appends are typically much faster than searching for data to update. It is quite easy to scale out by just adding more nodes. Adding files is very quick and easy, but querying can be very slow compared to relational databases. Because these are just files, some complex queries are better suited to relational databases or NoSQL databases.

HDFS consists of a NameNode, often referred to as the master mode, and a number of DataNodes. The NameNode manages the file system and access to the files by client processes. The NameNode determines which blocks get mapped to which DataNodes. The DataNodes will perform the read/write requests from clients as well as block creation, deletion, and replication commands from the NameNode. Data blocks for the files are replicated for a couple reasons. One is for fault tolerance, and the second is all about access. Because Hadoop was designed to run on commodity hardware, by replicating the data blocks to other machines/nodes, the loss of a machine/node is not as critical. The NameNode will try to fulfill requests for blocks from the nearest node closest to the reader, for a faster return of data.

MapReduce

MapReduce is a framework that allows programmers to take advantage of the distributed data across parallel nodes. As the name implies, two distinct operators are involved with MapReduce. The first is the Map part, the job of which is to take the data as input and resolve it into key/value pairs. The Reduce part then takes those key/value pairs and aggregates them to provide the end result. Think of it as a production assembly line where some production lines do the low-level work and then another production line takes the output of the previous lines and assembles the final result.

Because Hadoop systems are meant to be large, MapReduce can help perform these calculations on hundreds of nodes. The computations are done on the local nodes, thereby using their processing power for an extremely efficient method: parallel processing. Programmers can use the MapReduce programs without needing to worry about how to parallelize the queries or which nodes are down. MapReduce will do that for them. Using pure MapReduce requires programming in Java. There are lots of people who want to use MapReduce without learning to program in Java, and for this reason, other programs can be written so that developers don’t even have to learn MapReduce: They can use tools that are MapReduce-aware and can stay one step removed from MapReduce. In fact, many of the new tools will be a further step removed by using abstract APIs such as Hive, Pig, and other Big Data APIs.

NoSQL

As mentioned earlier, NoSQL stands for Not Only SQL. There are many different flavors of NoSQL databases out there right now. With relational databases, we are used to data being stored “orderly” in tables. With the massive amount of data that is being generated in non-orderly fashion, NoSQL databases are becoming more prevalent. There are many different models of NoSQL databases: graphs, key/value pairs, columnar, and others. Dozens of companies are trying to take advantage of this seismic shift in databases. Each of these different models caters to different needs as well as the speed of the problems they are trying to solve. Many of these new models also do not conform to ACID when referring to transactions. ACID stands for Atomicity, Consistency, Isolation, and Durability:

Images   Atomicity Transactions are committed as “all or nothing.” If one part of the transaction fails, it all fails. The transaction is a whole unit.

Images   Consistency Data must be valid following all rules (constraints of the database).

Images   Isolation Transactions are independent of each other. A transaction in process is isolated from other transactions.

Images   Durability Once a transaction has been committed, it stays committed.

Some have called the NoSQL database “BASE.” This clever play on words does have a meaning behind it. Here’s what BASE stands for:

Images   Basically Available Using basic replication across nodes or sharding to reduce the chance of data not being available.

Images   Soft state Data consistency is not guaranteed.

Images   Eventually consistent Unlike ACID databases, where the data must be consistent upon commit, BASE assumes that the data will eventually be consistent.

BASE consistency is good in some areas where it is important that the data be eventually consistent, but there are certain areas where this could be a problem. Think of a ticketing system for a concert. Suppose there are only 300 seats in the venue. If those seats are confirmed taken, you could end up with an oversold theater and many upset patrons.

Oracle NoSQL Database is based on the popular BerkleyDB. One great feature is that the ACID/BASE setting can be configured. The Oracle NoSQL Database provides several different consistency policies. This approach allows for the best of both worlds, letting the application developers determine what fits their needs best. The Oracle NoSQL Database is a key/value pair system. Similar to what you learned earlier with HDFS, Oracle stores the key/value pairs to distributed storage nodes. It does this by hashing the value of the primary key. Replication occurs among storage nodes so that there is no single point of failure and to allow for fast retrieval of data.

Using the BerkleyDB as the underlying system for the NoSQL Database provides a large level of comfort for developers and DBAs alike. It also means that many of Oracle’s tools are already compatible with the Oracle NoSQL Database.

HBase

HBase is another open-source columnar database management system that runs on top of HDFS. HBase is based on Google’s BigTable model. HBase does not support SQL queries, and it’s typically accessed through Java and other APIs such as Avro and Thrift. HBase is typically used when you are dealing with tables with a large volume of rows.

Big Data Connectors

Now that we have seen some places that can store the data (HDFS, a NoSQL database, or both), let’s look a bit further into how to move data around. As mentioned, data can be stored in a relational Oracle database, in Hadoop, or in an Oracle NoSQL database. Of course, enterprises would like to use the platform that suits that data best. Data architects can choose the correct platform, and with the help of Big Data Connectors, they can use all the data sources at the same time. A variety of tools have been developed for the Hadoop ecosystem. Many of the names seem funny—Hive, Pig, Spark, Parquet, Kafka, and Impala. We will look at how a few of these are connected with and being used by some of the Oracle products.

Oracle Data Integrator and Big Data

In an earlier chapter you learned how Oracle Data Integrator (ODI) was an essential tool for moving data from other databases into Oracle as well as from Oracle into other databases. Some large advances have been made in the last several years that have extended the flexibility of ODI to make it an essential part of a Big Data strategy. Some might say that ODI is the lynchpin in moving data around the Big Data space. What makes the tool great is that existing users of ODI don’t have to learn anything special: it is the same ODI tool that they are used to dealing with. Also, the use of special Knowledge Modules greatly extends the reach of ODI.

In April 2015, Oracle released ODI 12.1.3.0.1. This release of ODI greatly enhanced its features to extend the Big Data capabilities of ODI. Let’s look at some of these new enhancements.

First, ODI has added dozens of Knowledge Modules (KMs). Many of the Big Data Knowledge Modules fall under the Loading Knowledge Module (LKM) category and are labeled as such in ODI. Many of these LKMs have also been enhanced as “direct load” LKMs with all loading straight into the target tables without intermediate staging tables. Among the many new KMs are some new Hive KMs. The Hive KMs have been upgraded to use the new fully compliant JDBC drivers, which has improved performance. Hive is yet another new tool in the Big Data space. Hive allows developers to write SQL-like commands using Hive Query Language (HQL). HQL commands are then translated by Hive into MapReduce jobs. This allows developers to focus on the tools/languages they do know (such as SQL) and have other tools translate into MapReduce.

Another Big Data enhancement is the introduction of Spark. ODI now allows for mapping in Spark. Spark is typically used as a transformation engine for large data sets. ODI mappings can now generate PySpark, which allows for custom Spark programming in Python. Custom PySpark code can be defined by the user or via table components. Pig is also now available. Apache Pig is used for analyzing large data sets in Hadoop. The language of Pig is, of course, Pig Latin. Like Spark, Pig code can be defined by the user or via table components.

Yet another great addition is integrations with Apache Oozie. Typically, jobs are run via the ODI agent. With the integration of Apache Oozie, Apache Oozie can now be the orchestration engine for jobs such as mappings, scenarios, and procedures. This means that the ODI agent would not be installed on any of the Hadoop clusters, and Oozie would run natively on Hadoop.

Sqoop (and, yes, that’s how it is spelled) is an application that can move data from relational databases to Hadoop, and vice versa. ODI has added more KMs that feature Sqoop integrations.

Like most tools and databases, Hadoop has audit logs to let users track error, warning, and informational messages. ODI can now integrate with the output of the Hadoop Audit Logs. This allows ODI users to utilize the MapReduce statistics as well as find out the executions of many of the tasks mentioned earlier (Oozie, Pig, and so on).

As you can see, as the Big Data world changes via the addition of new languages, tools, and methods, ODI is right there changing with it. As you saw in a previous chapter, ODI helps load data from a variety of sources into a variety of targets. All these new features just extend the breadth of adding more sources and targets and making Big Data easier to use. Using a tool you are already familiar with can take much of the complexity out of Big Data.

Oracle GoldenGate and Big Data

Oracle GoldenGate, as mentioned earlier, offers great capabilities in regard to the real-time capturing of transactional data from the Oracle database as well as many other relational databases, such as MySQL, DB2, and Microsoft SQL Server. Of course, we know that Oracle GoldenGate can move data into an Oracle database as well as a host of other relational databases. GoldenGate also has the capability to send data to flat files and to JMS queues and has a unique Java Adaptor piece. In February 2015, Oracle announced that it has extended the Java Adaptor piece and introduced the Oracle GoldenGate Adaptor for Big Data.

Four different adaptors have been introduced:

Images   Oracle GoldenGate Adaptor for Apache Flume

Images   Oracle GoldenGate Adaptor for HDFS

Images   Oracle GoldenGate Adaptor for Hive

Images   Oracle GoldenGate Adaptor for HBase

These four adaptors are based on the Java Adaptor piece for GoldenGate. Having four different adaptors allows for the best tool to be used for the particular job at hand. Indeed, customers may end up trying different methods to see which one is the most efficient.

Let’s take a deeper look to see how this is done. The capture or extract process is configured as normal. You then set up a GoldenGate pump process that reads the output trail from the primary extract. The pump parameter file would look something like this:

Images

Now, one more file needs to be configured, pumphdfs.properities, in the dirprm directory of GoldenGate. This properties file, much like the Java Adaptor properties file, looks quite complex at first but is really a list of parameters that need to be set to match your operating system and setup.

Once those properties are set, you can start the pump process as normal. The pump process will read the trailfile; then, rather than send a trail across the network to be read from a replicat process, the pump process will transform the data into a file that will be configured for the appropriate Big Data connector, whether it be HDFS, HBase, or something else.

These Big Data extensions to GoldenGate are very exciting because data can now be streamed in real time from a source Oracle database (or other relational database) into a Big Data platform. From there, the Big Data platform can do what it does best, and possibly combined with other Big Data tools, the data can be massaged and analyzed immediately after landing on the Big Data platform. No more waiting hours or even days to load the data via batch—the data loads can now be performed in real time.

Oracle is currently working on these GoldenGate adapters, so we’re sure to see some evolving architectures in this space.

Summary

As you can see, a wide variety of tools are available when it comes to the Big Data space. Although the Big Data space is relatively new in the computing world, it is fast growing, and new tools seem to come out all the time. This is because companies struggle to make use of the huge volumes of data their enterprises are generating. Businesses can derive more business value from their data by utilizing the power of the open-source products that were designed to handle these large volumes.

The Big Data Appliance is Oracle’s solution, with all the hardware and software needed for an effective Big Data platform. However, as mentioned earlier, you don’t need to purchase the appliance to use any or all of the tools it contains. (Of course, the BDA does contain that amazing Smart Scan feature.) You can build your own platform based on your Big Data strategy. So, if you are looking for an efficient method to store large volumes of data, you might look to Hadoop. As demand grows, you can look to using the Big Data Connectors to move the data into Hadoop from your Oracle database or from your Oracle database into Hadoop. Or you may need to stream real-time data from Hadoop with Oracle Data Integrator along with Oracle GoldenGate to move it into a third-party data warehouse.

The Big Data area is rapidly evolving, so paying attention to Oracle, its partners, and to third-party vendors such as HortonWorks and Cloudera can help you make the most of your Big Data strategy.

image

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset