Chapter 5

BIG DATA PLATFORMS AND OPERATING TOOLS

LEARNING OBJECTIVES

After completing this chapter, you should be able to do the following:

    Recognize which Big Data software tools are available for use.

    Identify the open-source software known as Hadoop.

    Recall the role of map reduce and R software.

INTRODUCTION

This chapter identifies a variety of Big Data platforms as well as the operating tools that can be used on those platforms. Chief among the tools is the operating system known as Hadoop. Hadoop is an open-source framework that many organizations have chosen to support their Big Data efforts. This chapter will concentrate on information technology terms that are necessary for accountants to have foundational understanding in Big Data applications.

BIG DATA CAPABILITIES

The first step in all Big Data is understanding what the organization hopes to achieve. There should be two discussions that occur.

First, the organization should conduct a strategic planning retreat. The main question that should be asked is: What is the long-term vision for the company as it relates to Big Data?

Next, the organization should conduct an information planning retreat. This discussion should focus on how the organization can achieve the strategy for the first step with existing resources (hardware, software, staff, and future budget).

Both of these conversations are necessary and should take place in two different planning meetings. One way to approach the needs of the organization is by examining some of the capabilities of Big Data and then determining if any of them complement the corporate or IT strategy.

Table 5-1

image

Source: Adapted from "The Real-World Use of Big Data," IBM Institute for Business Value, accessed March 18, 2016, www-935.ibm.com/services/us/gbs/thoughtleadership/ibv-big-data-at-work.html

Data analytics (DA) is the study of analyzing raw information with the goal of achieving inferences about that data. DA is used as a part of numerous commercial ventures to help organizations make better business choices and to confirm or refute existing models or speculations. DA is different from "data mining" in that it includes an evaluation process that data mining does not necessarily have. Data mining involves searching large information sets to discover patterns and relationships. DA focuses on deducing an answer based on what the analyst knows.

Data analysis involves inspection, cleaning, revising, and modeling with the objective of finding valuable data, proposing conclusions, and supporting wise choices. Data analysis has different features and methodologies, with many techniques and applications of business, science, and sociology areas.

Data mining is a technique that focuses on modeling and discovery for predictive purposes. Business intelligence focuses on aggregating enterprise data. In statistical applications, there are descriptive statistics and the following main types of data analysis:

     Exploratory: Finds new characteristics in data.

     Confirmatory: Affirms or denies existing beliefs.

     Predictive: Concentrates on statistical models for forecasting purposes.

     Text: Extracts and classifies information from unstructured data (such as email) using statistical, structural, and linguistic techniques.

Predictive analytics focuses on predicting future results or patterns based on extracted data from existing data sets. It does not guarantee the results it only forecasts what might happen with some degree of reliability and incorporates "What-if" scenarios and risk or sensitivity analysis. Predictive analytics can include practices such as data mining, statistical modeling, and machine learning.

The following chart depicts the applications in business that are most aligned with predictive analytics.

Table 5-2

image

Source: Adapted from http://tdwi.org/articles/2007/05/10/predictive-analytics.aspx (Accessed March 18, 2016)

How do these various preceding concepts relate to business intelligence?

Figure 5-1

image

Note that the outside, white area beyond "Prediction" is the area where prescriptive analytics or "How can we make this happen?" occurs.

KNOWLEDGE CHECK

1.     What is exploratory data analysis?

a.     Using statistical models for forecasting purposes.

b.     Affirming existing beliefs.

c.     Finding new characteristics in data.

d.     Prescribing actions to take.

WHAT PLATFORMS CAN BE USED FOR BIG DATA?

Figure 5-2: Big Data Architecture

image

Let’s begin with a review of the overall architecture of a Big Data system. This section will outline general concepts so that readers can relate their organization systems to the generic Big Data model.

Although figure 5-2 does not begin with strategy, it is at the forefront of all IT vision, objectives, hardware, and applications. It is foolish to acquire hardware or software or to discuss Big Data without addressing the strategic goals relating to information technology for the company. Therefore, the strategy is an essential component of a Big Data system.

Hardware and OS Selection

Hardware selection is at the core of a Big Data system. Most organizations will have established IT architecture. Big Data will address what is available, where the company would like to end up, and then create the plan to acquire the necessary hardware. One of the major tenants of Big Data is to use commodity computers (that regularly fail) that are connected to create distributed files and distributed applications. Based on your organization’s Big Data approach, hardware will incorporate many commodity type computers. These will work with Hadoop (or selected system software)

Once you have selected hardware, you must select the operating system that will run on the hardware. The operating system is the main software that supports the computer’s primary functions. Examples of operating systems are Windows, Linux, Unix, and iOS.

Software Selection

The next steps involve selecting the software programs that runs on the operating system. System programs have direct control of the computer and perform I/O memory operations. Examples of system programs are Device Drivers, BIOS software, HD Sector Boot Software, Assembler and compiler software. Hadoop, the program that enables Big Data, will be described in the following section.

Application programs are traditional accounting applications such as accounting packages, CRM, ERP, MS Office, iTunes, Adobe Photoshop, and the like. Big Data programs running in conjunction with Hadoop (also programs to reduce, curate, save, analyze, predict, report):

In the following section, we will explore application data that comes from several programs. Application data can be structured, semi-structured, or unstructured. Data from traditional programs are usually structured. Data from outside sources (government, industry, science) or from other media sources (pictures, images, video, audio) are usually semi-structured. Data from social media or from streaming sources such as machines, appliances, or sensors are usually unstructured. More confusingly, some software can function as more than one of the preceding applications. Figure 5-3 attempts to align software applications with the various segments of Big Data. This image is not meant to be all-encompassing but to show that the selection of software will be dependent on the applications an organization would like to perform.

Figure 5-3: Software Vendor Capabilities

image

Businesses are confronted with growing quantities of data and increased expectations for analysis. In response, vendors are providing highly distributed architectures and new levels of memory and processing power. New entrants into the market are capitalizing on the open-source licensing model, which is becoming an essential component of Big Data software or architecture.

Apache Hadoop, an established open-source data processing platform was first used by Internet giants such as Yahoo and Facebook in 2006. Cloudera introduced commercial support for enterprises in 2008, and MapR and Hortonworks entered the market in 2009 and 2011, respectively. Among data-management veterans, IBM and EMC-Pivotal introduced their Hadoop distribution. Microsoft and Teradata offer complementary software and support for Hortonworks’ platform. Oracle resells and supports Cloudera while HP, SAP, and others work with multiple Hadoop providers.

Real-time stream processing and stream-analysis are more achievable with Hadoop because of advances in bandwidth, memory, and processing power. However, this technology has yet to see broad adoption. Several vendors have complex event processing (complex event processing, or CEP for short, is not as complex at the name might suggest; fundamentally, CEP is about applying business rules to streaming event data),1 but outside of the financial trading, national intelligence, and security communities, it has rarely been installed. There may be movement in applications in ad delivery, content personalization, logistics, and other areas as Big Data has broader adoption.2

KNOWLEDGE CHECK

2.     Data from outside sources such as government or industry was listed as

a.     Structured.

b.     Unstructured.

c.     Semi-structured.

d.     Non-relational.

3.     Which of the following was not listed as an infrastructure resource for Big Data?

a.     SAP.

b.     Oracle.

c.     IBM.

d.     HP.

Vendor Selection

Next, let’s look at some well-known vendor choices for Big Data software along with brief comments about the software.

1.     1010data Facts is a hosted suite of data sets that allow access to disparate Big Data information sources. Also, it seamlessly integrates with company data. 1010data has access to a wide variety of external data, including consumer spending, e-commerce, weather, econometrics, transportation, and demographics. Data are granular, current, and can be manipulated very quickly.3

2.     Actian Vortex provides capabilities for realizing business value from Hadoop.4 Its best in class data preparation and the broadest analytics support the following:

     Elastic data preparation: Bring in all data quickly with the fastest analytic engines, data ingestion technology, and Konstanz Information Miner (KNIME) user interface

     SQL analytics: Use SQL skills, applications, and tools with Hadoop for fully industrialized SQL support.

     Predictive analytics: Uncover trends and patterns with hyper-parallelized Hadoop analytic operators powered by KNIME.

3.     Amazon Web Services (AWS), began offering IT infrastructure services to businesses via cloud computing in 2006.5 Cloud computing allows businesses to replace up-front infrastructure expenses (with no need to order servers and other infrastructure in weeks or months in advance) with lower and more variable costs that scale with the business within minutes.

      For example, AWS is a cloud service that is scalable and low-cost. There are hundreds of thousands of businesses in 190 countries around the world using AWS. They have data center locations in the United States, Europe, Brazil, Singapore, Japan, and Australia, and deliver the benefits such as low cost, agility, flexibility, and security to their customers:

4.     Cloudera offers a unified platform for Big Data—the Enterprise Data Hub. Enterprises now have one place to store, process, and analyze all their data, empowering them to extend the value of existing investments while enabling fundamental new ways to derive value from their data.

      Founded in 2008, Cloudera was the first and is currently the leading provider and supporter of Apache Hadoop for the enterprise. Cloudera also offers software for business-critical data challenges including storage, access, management, analysis, security, and search.6

5.     HP Big Data Services can help IT infrastructure to process increasing volumes of bytes—from emails, social media, and website downloads—and convert them into beneficial information. HP Big Data solutions encompass strategy, design, implementation, protection and compliance as follows:

     Big Data Architecture Strategy: Define the functionalities and capabilities needed to align IT with Big Data initiatives. Through transformation workshops and roadmap services, learn to capture, consolidate, manage and protect business-aligned information, including structured, semi-structured and unstructured data.

     Big Data System Infrastructure: HP will design and implement a high-performance, integrated platform to support a strategic architecture for Big Data. Services include design and implementation, reference architecture implementations and integration. A flexible, scalable infrastructure will support Big Data variety, consolidation, analysis, share and search on HP platforms.

     Big Data Protection: Ensure availability, security, and compliance of Big Data systems. HP can help safeguard data, achieve regulatory compliance and lifecycle protection across the Big Data landscape, as well as improve backup and continuity measures.

6.     Hortonworks Hadoop data platform (HDP) is the only completely open HDP available.7 All solutions in HDP are developed as projects through the Apache Software Foundation (ASF). There are no proprietary extensions in HDP. HDP offers linear scale storage and computing across a wide range of access methods from batch to interactive, to real time, search, and streaming. It includes a comprehensive set of capabilities across governance, integration, security, and operations. HDP integrates with existing applications and systems to take advantage of Hadoop with only minimal changes to existing data architectures and skillsets. Deploy HDP in-cloud, on-premise, or from an appliance across both Linux and Windows.

7.     IBM includes the following types of information management data and analytics capabilities:8

     Data management and warehouse: Provide effective database performance across multiple workloads with lower administration, storage, development, and server costs; realize extreme speed with capabilities optimized for analytics workloads such as deep analytics; and benefit from workload-optimized systems that can be up and running in hours.

     Hadoop system: Bring the power of Apache Hadoop to the enterprise with application accelerators, analytics, visualization, development tools, performance and security features.

     Stream computing: Efficiently deliver real-time analytic processing on constantly changing data in motion and enable descriptive and predictive analytics to support real-time decisions. Capture and analyze all data, all the time, just in time; and, with stream computing, store less, analyze more, and make better decisions faster.

     Content management: Enable comprehensive content lifecycle and document management with cost-effective control of existing and new types of content with scale, security, and stability.

     Information integration and governance: Build confidence in big data with the ability to integrate, understand, manage, and govern data appropriately across its lifecycle.

8.     Infobright is an analytic database platform for storing and analyzing machine-generated data.9

     Data compression ratios of 20:1 to 40:1

     Fast, consistent query performance even when data volumes increase dramatically.

     Scale to hold terabytes and petabytes of historical data needed for long-term analytics.

     Load speeds of terabytes per hour to provide for real-time query processing or alerting

9.     Kognitio software which interoperates seamlessly with existing business integration and analytics reporting tools10 and "data lakes" (large object-based storage repositories that hold data in native formats until needed)11 and Hadoop storage. It complements the pre-existing technology stack, bridging the usability gap to the new large-volume data stores, helping achieve timely value from Big Data. The Kognitio Analytical Platform is a scale-out in-memory, massively parallel processing (MPP), not-only-SQL, software technology optimized for low-latency, large-volume data load, and high-throughput complex analytical workloads.

10.   MapR is the only distribution system that is built from the ground up for business-critical production applications.12

MapR is a complete distribution for Apache Hadoop that includes more than a dozen projects from the Hadoop ecosystem to provide a broad set of Big Data capabilities. The MapR platform includes high availability, disaster recovery, security, and full data protection. In addition, MapR allows Hadoop to be easily accessed as traditional network attached storage with read-write capabilities.

11.   Microsoft’s vision is to enable all users to gain actionable insights from virtually any data, including insights previously hidden in unstructured data.13 To achieve this, Microsoft is a comprehensive Big Data solution.

     A modern data management layer that supports all data types—structured, semi-structured and unstructured data. It is easier to integrate, manage, and present real-time data streams, providing a more holistic view of the business and foster rapid decisions.

     The software also has an enrichment layer that enhances data discovery, which combines the world’s industry data with advanced analytics. The software can connect and import data, create visualizations, and run reports on a regular basis on the go.

     The software has an insights layer using tools such as MS Office with rich 3D visualizations and storytelling built into its Excel program, which makes it easier to visualize multiple data sources and modify them on the fly while presenting in PowerPoint.

     HD Insight is Microsoft’s new Hadoop-based service built on the Hortonworks Data Platform that offers 100 percent compatibility with Apache Hadoop.

12.   Oracle is a complete suite of infrastructure and software tools to address an organization’s Big Data needs.

13.   Pivotal Big Data Suite provides a broad foundation for agile data.14 It can be deployed as part of the Pivotal Cloud Foundry or as PaaS (platform as a service) technologies, on-premise and in public clouds, in virtualized environments, on commodity hardware, or delivered as an appliance. The suite offers the following:

     SQL analytics-optimized Hadoop based on ODP core

     Leading analytical massively-parallel processing database

     Massively-parallel processing, ANSI-compliant SQL on Hadoop query engine

14.   Software such as SAP HANA can simplify IT architecture.15 It combines in-memory processing with an enterprise data warehouse (EDW) and Hadoop to help harness Big Data. It purports to do the following:

     Run business processes 10,000 to 100,000 times faster.

     Use Big Data analytics with SAP IQ, a data warehouse solution.

     Virtualize data across a logical Big Data warehouse and gain insight without moving data.

15.   Teradata Aster has an analytic engine that is a native graph processing engine for Graph Analysis across Big Data sets.16 Using this next generation analytic engine, organizations can solve complex business problems such as social network or influencer analysis, fraud detection, supply chain management, network analysis and threat detection, and money laundering.

16.   A new generation of data analyst has made R the most popular analytic software in today’s market. Teradata Aster R lifts the limitations of open-source R with pre-built parallel R functions, parallel constructors, and integration of open-source R in the Aster SNAP Framework.

What is R? R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment developed at Bell Laboratories (formerly AT&T, now Lucent Technologies). R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering …) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an open-source route to participation in that activity.

One of R’s strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.

R is available as free software under the terms of the Free Software Foundation’s GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.17

Aster Discovery Platform also provides SQL and SQL-MapReduce analytic engines that enable a variety of analytics best suited to these engines, such as SQL analysis, path or pattern analysis, statistical analysis and text analysis.

KNOWLEDGE CHECK

4.     What is MapR?

a.     A program to reduce the size of Big Data analyzed.

b.     Incomplete distribution of Apache Hadoop.

c.     Complete distribution for Apache Hadoop that packages more than a dozen projects.

d.     A relational database for Big Data.

5.     What is Teradata?

a.     Agriculture applications of Big Data.

b.     Data analytics software.

c.     A native graph processing engine for graph analysis.

c.     A relational database.

TOP DATA ANALYSIS TOOLS FOR BUSINESS

The following list of top data analysis tools for business was based on guidelines set by Alex Jones of KDNuggets, a business analytics site. He suggests tools based on their "free availability (for personal use), ease of use (no coding and intuitively designed), powerful capabilities (beyond basic Excel), and well-documented resources," (such as simple Google searches to support business needs).18

1.     Tableau, according to its website, is "business intelligence software that allows anyone to connect easily to data, and then visualize and create interactive, shareable dashboards. It’s easy enough that any Excel user can learn it, but powerful enough to satisfy even the most complex analytical problems. Securely sharing your findings with others only takes seconds." The tool is simple and intuitive, and the public software version has a million-row limit that allows for extensive data analytics.19

2.     OpenRefine (formerly Google Refine), according to its website, "cleans ’messy data and transforms it from one format into another; extending it with web services; and linking it to databases like Freebase." With the software, a user can do the following:

     Import various data formats.

     Explore datasets in a matter of seconds.

     Apply basic and advanced cell transformations.

     Deal with cells that contain multiple values.

     Create instantaneous links between datasets.

     Filter and partition data easily with regular expressions.

     Use named-entity extraction on full-text fields to automatically identify topics.

     Perform advanced data operations with the general refined expression language.20

3.     KNIME can manipulate, analyze, and model data with visual programming. The user drags connection points between activities instead of writing blocks of code. The software can be extended to run R, Python, text mining, chemistry data, and the like, which provides the option to work in more advanced code-driven analysis.

4.     RapidMiner operates through visual programming and can manipulate, analyze, and model data.

5.     Google Fusion Tables is a versatile tool for data analysis, large data-set visualization, and mapping. Google has the leading mapping software. (See Google maps in the following sections of text.) Table 5-3 illustrates 2013 crime statistics from the FBI.21 This image shows an example of the data that was available.

Table 5-3

image

      Using Google Fusion Tables, the data was uploaded, and a map was created for the violent crime by city.

image

      The author then clicked on the map to show the violent crime in the city of Green Bay for 2013.

6.     NodeXL is a visualization and analysis software for networks and relationships. Chapter 6 illustrates an example of linked-in connections that is similar in concept. NodeXL takes that a step further by providing exact calculations. A simpler tool is also available. See the node graph on Google Fusion Tables, or (for a little more visualization) try out Gephi.

7.     Import.io provides quick access to web data. The software highlights relevant data and (in a matter of minutes) "learns" what for the user is looking for. From there, Import.io will pull data for the user to analyze or export.

8.     Google search operators are often an underused research tool. Operators allow the quick filtering of Google results to get to the most useful and relevant information. For instance, it is possible to obtain CFO survey information by accessing any of the major CPA firms’ websites as follows:

image

      Do not forget to leverage the power of the search by using additional tools such as the time search feature.

9.     Solver is an optimization and linear programming tool in Excel that allows users to set constraints. It is not the strongest of optimization packages but will be most helpful if the company has never explored optimization analysis. For advanced optimization, consider a program such as R’s optim package.

10.   WolframAlpha’s search engine is one of the web’s hidden gems and helps to power Apple’s Siri. WolframAlpha provides detailed responses to technical searches and makes quick work of calculus homework. WolframAlpha has been referred to as the "nerdy Google" for business users, as it presents information charts and graphs and is excellent for high-level pricing history, commodity information, and topic overviews.22

11.   Google Maps can be accessed at https://www.google.com/maps/d/, where the user selects "Create a new map"; then clicks "Import."

image

      There will be an option to select a file from Google Drive or your computer. The author selected a recent CSV file that showed his speaking engagement cities over the last five years. Google Maps asked the author to identify which columns should be chosen for placemarks in the graph.

image

      Next, the application asked the user to identify a column to use as the title for the place-markers and "Course" was checked. The exported PDF looked like this:

image

      There was an option to customize the labels so that they could be highlighted differently by any of the columns listed previously. In the next image, the clients are labeled with different color place-markers.

image

      In the next image, the Midwest is enlarged.

image

      In the next image, another option was activated. The option overlaid the place-markers with the course acronym taught. Some of the information is hard to read because of multiple dates and multiple courses delivered in a particular city.

image

      The tool has many options including driving paths, adding place-markers, adding more layers, sharing with others, posting for public consumption, and the like.

KNOWLEDGE CHECK

6.     What is Wolfram Alpha?

a.     Data analytics software.

b.     The nerdy Google.

c.     Predictive analytics software.

d.     A subprogram within the MapR framework.

7.     Google Maps was illustrated using

a.     Consulting engagements.

b.     Crime statistics.

c.     Vendor dispersion.

d.     Post offices across the U.S.

HADOOP—WHAT IS IT ALL ABOUT?

The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing. Hadoop is a framework with the ability to store large data sets. The data sets are distributed across clusters of computers using simple programming models and written in Java to run on a single computer or large clusters of commodity hardware computers. The software derives from papers published by Google and incorporates the features of the Google File System (GFS) and MapReduce paradigm, reflected in the names Hadoop Distributed File System and Hardtop MapReduce.

Hadoop technology was designed to eliminate the data handling problems on big data in companies and achieved great success. It can process huge amounts of data quickly from a variety of sites such as Facebook, Twitter, and the like, and automated sensors.

Hadoop Terminology

     Open-source software: Works on the open network of developers that create and manage the programs.

     Framework: Everything that enables users to develop and run software applications and is done through programs, tool sets, connections, and the like.

     Distributed data: Divided and stored on multiple computers, and computations can run simultaneously on multiple connected machines.

     Massive storage: The Hadoop framework can store enormous amounts of data in blocks for storage on clusters of low-cost commodity hardware.

     Faster processing: Processes large amounts of data in parallel across clusters of tightly connected low-cost computers for quick results.

KNOWLEDGE CHECK

8.     Which of these describes Hadoop?

a.     It is proprietary.

b.     It is open source.

c.     It is proprietary but available for reduced costs to nonprofit organizations.

d.     It is proprietary and must run in a Unix environment.

History of Hadoop

Larger data demands have resulted in users wanting quicker searches and faster processing time. Doug Cutting and Mike Caferella worked on these issues with an open-source web search engine project called Nutch. They used distributed data and calculations across low-cost computers just to accomplish multiple tasks simultaneously. During the same period, Google was working on a similar project to achieve data storage and data processing in a distributed fashion to return faster and more relevant searches. In 2006, Cutting moved to Yahoo and continued with the Nutch project that was divided into two projects—the web crawler portion and the distributed processing portion (which became known as Hadoop).

Hadoop was released in 2008 as an open-source project that is managed and maintained by a non-profit Apache Software Foundation (ASF). The project is developed by a global community of software developers and contributors.

Hadoop Core Components

image

Source: "Big Data Basics," MSSQLTIPS, www.mssqltips.com/sqlservertip/3262/big-data-basics–part-6–related-apache-projects-in-hadoop-ecosystem/

These are the core components of Apache Software Foundation.

     HDFS: Java-based distributed file system that stores data such as structured, unstructured, and the like, without prior organization.

     MapReduce: Software model that allows large sets of data to be processed in parallel.

     YARN: Resource management framework to schedule and handle resource requests from distributed apps.

     Pig: Platform for manipulating data stored in HDFS. All this is done through a compiler called Pig Latin, for MapReduce programs and a high-level language. The user can avoid writing MapReduce programs to perform data extractions, transformations, loading, and basic analysis.

     Hive: Similar to database programming, it creates data in the form of tables. Hive is data warehousing and a query language.

     HBase: Runs on top of Hadoop and serves as input and output for MapReduce jobs. It is a nonrelational distributed database.

     Zookeeper: Application meant for coordination of distributed processes.

     Ambari: Web interface meant for managing, configuring, and testing in the Hadoop environment.

     Flume: Software that collects, aggregates, and streams data into HDFS.

     Sqoop: Transfer mechanism for data moved between Hadoop and its relational database.

     Oozie: Hadoop job scheduler.

KNOWLEDGE CHECK

9.     What is Hive?

a.     Data cleansing program.

b.     Data analytics tool.

c.     Data warehousing and a query language.

d.     An application meant for coordination of distributed processes.

Practice Questions

1.     Based on the IBM survey, list several capabilities of Big Data.

2.     What is the purpose of Hadoop?

3.     What is the purpose of Map Reduce?

4.     What is "R"?

Notes

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset