CHAPTER 4

Big Data Infrastructure—A Technical Architecture Overview

Four elements composed of processing capability, storage capacity, data transport bandwidth, and visualization capability provided by systems and analytical software techniques constitute the basic infrastructure of Big Data from our perspective. We will address each of these components in this chapter.

First, we view data processing as having two basic paradigms: batch and stream processing. Batch processing has high-latency, whereas stream processing analyzes small amounts of data as they arrive. It has low-latency and, depending on the arrival rate, volume can mount up very quickly. If you try to process a terabyte or more of data all at once, you will not be able to do it in less than a second with batch processing. On the other hand, smaller amounts of data can be processed very fast—even on the fly.

As Big Data scales upwardly to exabyte and much larger volumes, it is clear that single processors, and even small multiprocessors, cannot provide the computational power to process all the data. Large multiprocessor systems have evolved—as grid architectures and cloud-based systems—to handle the large volumes of data. Having powerful computers providing trillions of instructions per second is not enough. The computer system must therefore be balanced across processing capability, and both the second and third components—storage capacity and data transport bandwidth—to ensure that Big Data can be processed in a time interval consonant with the time to decision and utilization of the extracted and derived information.

Additionally, a Big Data processing system must incorporate visualization techniques to provide the user with the ability to understand and navigate through the data and the resulting information derived from the data by analytics. These four elements, along with the systems and analytical software suites, constitute the basic infrastructure of a Big Data computing capability.

Data and Information Processing

Data processing infrastructure has evolved through several generations since the first mainframes were developed in the 1950s. The most recent manifestations have been threefold: (1) cluster computing, (2) cloud computing, and (3) processing stacks. Cluster computing and cloud computing are focused on scaling the computational infrastructure as an organization’s needs evolve. Processing stacks provide open source software (OSS) frameworks for developing applications to support a business data analytics capability. In addition, an organization must decide on a suite of programming systems, tools, and languages in order to develop custom applications compatible with the analytic suites that it may purchase or obtain through OSS.

Big Data success will ultimately depend on a scalable and extensible architecture and foundation for data, information, and analytics. This foundation must support the acquisition, storage, computational processing, and visualization of Big Data and the delivery of results to the clients and decision-makers.

Service-Oriented Architecture

Service-oriented architecture (SOA) is a paradigm for designing distributed, usually interactive, systems. An SOA is essentially a collection of services running on one or more hardware-software platforms. These services communicate with each other through established protocols, by which we say the services are interoperable. The communication can involve either simple data passing or it could involve multiple services coordinating some activity. The services are connected to each other through software mechanisms supported by the software infrastructure.

SOAs evolved from transaction processing systems as a general software architecture. A service is a self-contained software unit that performs one or a few functions. Here, by service, we mean the software module that implements the service that was previously defined in “Defining Services” section. It is designed and implemented to ensure that the service can exchange information with any other service in the network without human interaction and without the need to make changes to the underlying program itself. Thus, services are usually autonomous, platform-independent, software modules that can be described, published, discovered, and loosely coupled within a single platform or across multiple platforms. Services adhere to well-defined protocols for constructing and parsing messages using description metadata.

Web Services

One implementation of SOA is known as web services because they are delivered through the web. The advantages of web services are interoperability, functional encapsulation and abstraction, loose coupling, reusability, and composability. Because communication between two web service modules is through HTML or eXtended Markup Language (XML), the communication is independent of any particular messaging system. The message definition is embedded in the message so that each receiver, knowing how to parse HTML/XML, can readily understand the message contents. This allows any service to interoperate with any other services without human intervention and thus provides a capability to compose multiple services to implement complex business operations. Services can be reused by many other different services without having to implement a variation for each pair of business transaction interactions. Functional encapsulation and abstraction means that functions performed on the client and server sides are independent of each other. Through loose coupling, in which the client sends a message and sometime later, the server receives it, allows the client and server to operate independently of each other, and more importantly to reside separately on geographically and physically independent platforms. Web services are built on a number of components as described in Table 4.1.

Table 4.1 Typical web service components

Component

Brief description

Web browser

A software module that displays web pages encoded in HTML and provides interactive functionality through the modules below.

Javascript

https://angularjs.org/

https://nodejs.org/

A web page and applet scripting language that is similar to, but not Java. Many variations of Javascript exist, including AngularJS, NodeJS

Ajax

https://netbeans.org/kb/docs/web/ajax-quickstart.html

This (asynchronous JavaScript and XML) is a method for building interactive Web applications that process user requests immediately. Ajax combines several programming tools including JavaScript, dynamic HTML, XML, cascading style sheets, the Document Object Model and the Microsoft object XMLHttpRequest. Ajax is often embedded in a jQuery framework.

jQuery

http://jQuery.com

jQuery is a small, fast Javascript library for developing client-side applications within browsers to send and retrieve data to back-end applications. It is designed to navigate a document, select DOM elements, create animations, handle events, and develop Ajax applications.

Jersey

https://jersey.java.net/

Jersey is a Java framework for developing RESTful web services that connect a browser-based client to applications code on a server. Jersey implements the JAX-RS API and extends it to provide programmers with additional features and utilities to simplify RESTful service development.

Glassfish

https://glassfish.java.net/

An open-source application server for the Java EE platform that supports Enterprise JavaBeans, JPA, JavaServer Faces, Java Message System, Remote Method Invocation, JavaServer Pages, servlets, and so on. It is built on the Apache Felix implementation of the OSGI framework.

Cluster Computing

Cluster computing is an outgrowth of the distributed processing architectures of the 1980s but achieved its major impetus from the high performance computing community as it was applied to very large-scale scientific processing requiring trillions of computational cycles. Cluster machines can be connected through fast local area networks, but are sometimes geographically distributed. Each node runs its own instance of an operating system. The machines in a cluster may be homogeneous or heterogeneous.

As with parallel processors and cloud computing, effective and efficient use of cluster systems requires careful attention to software architecture and distribution of work across multiple machines. Middleware is software that sits atop the operating systems and allows users to “see” the cluster of machines as essentially a single, multi node machine. One common approach is to use Beowulf clusters built in commodity hardware and OSS modules.

Cloud Computing

Cloud computing is a maturing technology in which an IT user does not have to physically access, control (operate), or own any computing infrastructure other than, perhaps, workstations, routers, and switches, and, more recently, mobile client devices. Rather, the user “rents or leases” computational resources (time, bandwidth, storage, etc.) in part or whole from some external entity. The resources are accessed and managed through logical and electronic means. A cloud architecture can be physically visualized as the arrangement of large to massive numbers of computers in distributed data centers to deliver applications and services via a utility model. In a true physical sense, many servers may actually function on a high capacity blade in a single data center.

Rather than providing the user with a permanent server to connect to when application execution is required, cloud computing provides “virtualized servers” chosen from a pool of servers at one of the available data centers. A user’s request for execution of a web application is directed to one of the available servers that have the required operating environment, tools, and application locally installed. Within a data center, almost any application can be run on any server. The user knows neither the physical server nor, in many cases, where it is physically located, that is, it is locationally irrelevant.

Confusion still exists about the nature of cloud computing. Gartner asserts that a key characteristic is that it is “massively scalable” (Desisto, Plummer, and Smith 2008). Originally, cloud computing was proposed as a solution to deliver large-scale computing resources to the scientific community for individual users who could not afford to make the huge investments in permanent infrastructure or specialized tools, or could not lease needed infrastructure and computing services. It evolved, rapidly, into a medium of storage and computation for Internet users that offers economies of scale in several areas. Within the past 10 years, a plethora of applications based on cloud computing have emerged including various e-mail services (HotMail, Gmail, etc.), personal photo storage (Flickr), social networking sites (Facebook, MySpace), or instant communication (Skype Chat, Twitter).

While there are public cloud service providers (Amazon, IBM, Microsoft, Apple, Google, to name a few) that have received the majority of attention, large corporations are beginning to develop “private” clouds to host their own applications in order to protect their corporate data and proprietary applications while still capturing significant economies of scale in hardware, software, or support services.

Types of Cloud Computing

Clouds can be classified as public, private, or hybrid. A public cloud is a set of services provided by a vendor to any customer generally without restrictions. Public clouds rely on the service provider for security services, depending on the type of implementation. A private cloud is provided by an organization solely for the use of its employees, and sometimes for its suppliers. Private clouds are protected behind an organization’s firewalls and security mechanisms. A hybrid cloud is distributed across both public and private cloud services.

Many individuals are using cloud computing without realizing that social media sites such as Facebook, Pinterest, Tumblr, and Gmail all use cloud computing infrastructure to support performance and scalability. Many organizations use a hybrid approach where publicly available information is stored in the public cloud while proprietary and protected information is stored in a private cloud.

Implementations of Cloud Computing

The original perspective on cloud computing was defined by the National Institute for Science and Technology (NIST) as software-as-a-service (SaaS), platform-as-a-service (PaaS), or infrastructure-as-a-service (IaaS) (Mello and Grance 2011). As cloud computing concepts have evolved, additional perspectives have emerged. Linthicum (2009) identified those presented in Table 4.2.

Table 4.2 Linthicum’s perspectives on cloud computing

Category

Admin

Client

Example

Storage-as-a-service (SaaS)

Limited control

Access only

Amazon S3

Database-as-a-service (DBaaS)

DB management

Access only

Microsoft SSDS

Information-as-a-service (INaaS)

DB management

Access only

Many

Process-as-a-service (PRaaS)

Limited control

Access only

Appian Anywhere

Application-as-a-service (AaaS)

(software-as-a-service)

Total control

Limited tailoring

SalesForce.com

Google Docs

Gmail

Platform-as-a-service (PaaS)

Total control

Limited programmability

Google App Engine

Integration-as-a-service (IaaS)

No control

(except VM)

Total control

(except VM)

Amazon SQS

Security-as-a-service (SECaaS)

Limited control

Access only

Ping Identity

Management/governance-as-a-service (MGaaS)

Limited control

Access only

Xen, Elastra

Testing-as-a-service (TaaS)

Limited control

Access

only

SOASTA

Access to Data in a Cloud

There are three generally accepted methods for access data stored in a cloud: file-based, block-based, and web-based. A file-based method allows users to treat cloud storage as essentially an almost infinite capacity disk. Users store and retrieve files from the cloud as units. File-based systems may use a network file system, common Internet file system, or other file management system layered on top of a standard operating system.

A block-based method allows the user a finer granularity for managing very large files where the time to retrieve a whole file might be very long. For example, retrieving a multi-petabyte file might take hours. But, if only a subset of data is required, partitioning the file into blocks and storing and retrieving the relevant blocks yields substantially better performance. These two methods just treat the data as a group of bytes without any concern for the content of the file.

A different approach is to use web-based methods to access data through REST services or web servers. Here data may be stored as web pages or as semantically annotated data via XML or as linked data via resource description framework (RDF). Using these methods, a user can store or retrieve explicit data items or sets of related data.

The granularity of storage and retrieval is from high-to-low for file-based to block-based to web-based methods. However, the performance is also high-to-low as well because as the size of the data subset becomes smaller, more computation is required at the cloud computing infrastructure to locate the data units for storage or retrieval.

Moving to the Cloud

Cloud computing is an enabling technology for creating target platforms on which to host services. It is not a mechanism for specifying the explicit technologies to be used in implementing a cloud to support a particular business environment.

Today, given the popularity of cloud computing, the emergence of robust system software and application frameworks, and the significant pressure on business operations of IT infrastructure costs, many organizations are making a decision whether to move to the cloud or not. Many organizations decide to experiment with cloud computing, to implement a small low risk application to assess cloud computing. The availability of web-accessible cloud computing environments makes it easy to try cloud computing (Kaisler, Money, and Cohen 2012).

Processing Stacks

The advent of OSS has introduced a new approach to designing and developing software for commercial and business use—the so-called stack approach. A stack is a vertical hierarchy of different software modules that provide a suite of services for a comprehensive computing infrastructure. Usually, no additional software is required to provide an environment for developing and supporting applications. Several stacks have come into widespread use—originating the scientific community, but now adopted by the business community as well.

There are many extent stacks developed by and for specific communities of developers and users. Four of the most popular are described in the following sections.

The LAMP Stack

The LAMP stack (Linux-Apache-MySQL-Perl/PHP/Python) consists of four components:

  • Linux operating system: Linux, an open source operating system, provides the framework in which the other components operate. The major versions are Debian, Ubuntu, Centos, Red Hat, and Fedora. These distributions generally provide a package system that includes a complete LAMP stack.
  • Apache web server: While Apache remains the primary web server, other web servers such as Tomcat and Nginx have also been incorporated into the stack.
  • MySQL or Maria database management system: Other DBMSs, including the MongoDB NoSQL DBMS, have been integrated into the LAMP stack.
  • PHP, Perl, or Python programming systems: Scripting systems that allow relative non programmers to write some small programs on the server side to manage data and do simple analysis.

The structure of the LAMP stack is depicted in Figure 4.1.

The Leap Stack

The LEAP stack is a stack for cloud-based solution infrastructure that consists of four elements:

  • The Linux operating system
  • Eucalyptus, a free and open-source computer software for building Amazon Web Services compatible private and hybrid cloud computing environment
  • AppScale, a platform that automatically deploys and scales unmodified Google App Engine applications over public and private cloud systems
  • Python programming language

Figure 4.1 The LAMP stack

MAMP/WAMP Stacks

The MAMP/WAMP stacks are composed of four elements, of which the lower three layers are common to both stacks. The difference is that these stacks are based on either the Macintosh operating systems or the Windows operating system.

  • MacOS or Windows operating system
  • Apache Web Server
  • MySQL database management system
  • PHP, Perl, or Python programming systems

These two stacks operate in a manner similar to the LAMP stack, but are comprised of different components.

Real-Time Big Data Analytics Stack

The real-time big data analytics (RTBDA) stack is an emerging stack that focuses on predictive analytics (Barlow 2013). Figure 4.2 depicts a modified adaptation from Barlow.

In the data layer, data is provided from multiple sources including both streaming and persistent sources. The analytic layer contains the different analytics for processing the data. It also contains the local data mart (a subset of a data warehouse typically focused on a small portion of the total data stored in a warehouse) where processed data is forward deployed that is about to be processed. Depending on the context and the problems to be worked, not all data may be forward deployed, but only data necessary for immediate processing.

Figure 4.2 RTBDA stack

Source: Adapted from Barlow (2013).

The integration layer merges the different processed data stream into a single useful block of data or information. Finally, in the delivery layer, processed data and information for decision making and control is presented. Different types of high-level analytics most closely associated with the actual business decisions to be made are located in this layer. In addition, applications or modules that transmit data to other locations for use may be located here.

Fedora Repository Stack

The Fedora repository stack, initially developed at the University of Virginia but now an open community project, provides a structure for storing and archiving data sets. Figure 4.3 depicts the structure of the Fedora repository stack.

Fedora is a robust, modular, open source repository system for the management and dissemination of digital content. It is especially suited for digital libraries and archives, both for access and preservation. It is also used to provide specialized access to very large and complex digital collections of historic and cultural materials as well as scientific data. Fedora has a worldwide installed user base that includes academic and cultural heritage organizations, universities, research institutions, university libraries, national libraries, and government agencies. The Fedora community is supported by the stewardship of the DuraSpace organization. (https://wiki.duraspace.org/display/FF/Fedora+Repository+Home)

Figure 4.3 Fedora repository stack

Source:Adapted from Fedora website.

The rest framework provides access to the repository services from web services that can be implemented in a number of different web browsers. Currently, it uses the Drupal content management system running within your favorite web browser. Fedora services provide the essential set of services for archiving, management, and dissemination of data sets. Repository services manage the individual data sets and documents stored in the repository. The caching, clustering, and storage services interact with the underlying operating system to manage physical disks within the repository computer system.

Analytical Frameworks

An analytical framework is a software application framework that provides a structure for hosting applications and providing runtime services to them. Application frameworks allow the application developers to focus on the problem to be solved rather than the challenges of delivering a robust infrastructure to support web applications or distributed processing/parallel processing (Kaisler 2005). Several analytical frameworks are briefly described in Table 4.3.

The allure of hardware replication and system expandability as represented by cloud computing along with the MapReduce and MPI parallel programming systems offers one solution to solving these infrastructure challenges by utilizing a distributed approach. Even with this creative approach, significant performance degradation can still occur because of the need for communication between the nodes and movement of data between the nodes (Eaton et al. 2012).

Table 4.3 Selected analytical frameworks

System/website

Brief description

Hadoop MapReduce

http://hadoop.apache.org/

MapReduce, an idea first originated by Google, is both a programming model and a runtime environment for distributing data and supporting parallel programming on commodity hardware to overcome the performance challenges of Big Data processing.

Pig

http://pig.apache.org/

Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. It is a subset of the functionality contained in SQL, but adapted to NoSQL databases. Pig queries are amenable to parallelization which makes it an easy to learn and use query language for large data sets.

Open Service Gateway Initiative (OSGI)

http://www.osgi.org/Main/HomePage

OSGI is a specification for a distributed framework of composed and many different reusable components. It enables components to hide their implementations from other components while communicating through services, which are objects that are specifically shared between components. It is designed and specified by the OSGI Alliance, a worldwide consortium of technology innovators dedicated to open source implementations that enable the modular assembly of software built with Java technology

Message Passing Interface (MPI)

http://www.open-mpi.org/

Developed at Argonne National Laboratories, MPI is a software platform and set of libraries for enabling communication and coordination among a distributed set of processes via a message passing paradigm. It supports multiple 32- and 64-bit operating systems and several programming languages.

KNIME

https://www.knime.org/

KNIME is an open source platform for data-driven analytics that includes multiple components for data access, data transformation, analysis and data mining, visualization, and deployment.

Programming Systems

Aside from the standard suite of programming and scripting languages that are used today (such as Java, Python/Perl, PHP, C/C++/C#, VSBASIC, etc.), numerous programming systems have been developed and used that provide support for the development of analytic applications.

One class of programming systems is interactive development environments (IDEs) for supporting software development. The two leading IDEs are Eclipse and Netbeans. BlueJ, developed by the University of Kent and supported by Sun Microsystems, is an IDE primarily used in academic and small team environments. Table 4.4 presents some popular open source systems that have been used by many organizations.

Table 4.4 Selected development systems

System/website

Brief description

Python

https://www.python.org/

An interpreted software language, which has become very popular for analytics because it is object oriented and simple, and facilitates rapid programming. It has an extensive library of contributed packages that are applicable to many different types of analytic problems.

Eclipse

https://www.eclipse.org/

Eclipse is an integrated development environment (IDE), originally developed by IBM, but then converted to an open source community project. It supports a wide variety of programming languages including Ada, C/C++, PHP, Python, Ruby, FORTRAN, and Javascript among others.

Netbeans

https://netbeans.org/

Netbeans is an IDE originally developed for Java by Sun Microsystems, but then converted to an open source community project. It recently has been extended to support PHP, Ruby, C, Javascript, and HTML. Versions of Netbeans run on the major operating systems.

BlueJ

http://www.bluej.org

A simple IDE geared toward academic environments. However, if your staff has no experience with IDEs previously, this tool is easy to use, provides very good graphical support for understanding the structure of the application, and good support for debugging applications. Supports Java only.

Martin Fowler (2011) has observed that many large-scale Big Data analytics (BDA) systems exhibit polyglot programming. Polyglot programming was defined by Neal Ford (2006) as the development of applications in a mix of programming languages. While as recently as a decade ago most applications were written in a single programming language, the influence of the model-view-controller paradigm that underlies web development has led to the use of different languages for different components of the system. Each component has different performance requirements, structural representations, and functional processing needs. Thus, it makes sense to select a programming language most suitable for each of these needs.

Data Operations Flow

In order to put the data to use—or find the value in the data, it is not enough to have Big Data that one can query. A business owner needs to be able to extract meaningful, useful, and usable information for strategic, tactical, and operational decision-making in the business environment. An organization cannot begin analyzing raw Big Data right away for several reasons. It needs to be cleansed, perhaps formatted for specific analytics, even transformed or translated to different scales to yield useful results. This problem has been characterized as the ETL (extract, transform, load) problem.

Data operations flow is the sequence of actions from acquisition through storage and analysis, eventually to visualization and delivery. Table 4.5 briefly describes some of the operations to be performed on Big Data.

Data Acquisition

With Big Data, it is often easier to get data into the system than it is to get it out (Jacobs 2009). Data entry and storage can be handled with processes currently used for relational databases up to the terabyte range, albeit with some performance degradation. However, for petabytes and beyond new techniques are required, especially when the data is streaming into the system.

Many projects demand “real-time” acquisition of data. But, what they do not understand is that real-time online algorithms are constrained by time and space limitations. If you allow the amount of data to be unbound, then these algorithms no longer function as real-time algorithms. Before specifying that data should be acquired in “real time,” we need to determine which processing steps are time-bound and which are not. This implies that designing a real-time system does not mean every stage must be real-time, but only selected stages. In general, this approach can work well, but we must be careful about scaling. All systems will break down when the scale or volume exceeds a threshold maximum velocity and volume (plus a fudge factor) for which the system was designed. Over engineering a real-time system is a good idea because bursts are often unpredictable and create queuing theory problems with unpredictable queue lengths and waiting times for processing or transport of data. Table 4.6 presents a sample of the many challenges in data acquisition. Table 4.7 presents some data acquisition tools that address many of these isses.

Table 4.5 Big Data operations

Operation

Description

Acquiring/capturing

Data acquisition is the first step in utilizing Big Data. Because Big Data exists in many different forms and is available from many different sources, you will need many different acquisition methods. Many of these may already be built into your business operations systems.

Management

Data curation is the active management of data throughout its life cycle. It includes extracting, moving, cleaning, and preparing the data.

Cleansing

Data cleansing includes verifying that the data are valid within the problem parameters and removing outliers from the data set. However, there is a risk that the phenomena may not be fully understood when outliers are removed.

Storing

Data storing is a key function, but involves critical decisions about whether to move or not move data from the point of origin to the point of storage or the point of processing; how much data to keep (all or some); whether any data is perishable and must be analyzed in near real-time; and how much data must be online versus off-line—among other aspects.

Movement

Data movement is a critical problem when the volume of data exceeds the size of the applications that will be processing it. Moving a gigabyte of data across a 100 Mbit/s Ethernet takes more than 80 seconds depending on the network protocols and the handling of data at either end of the transfer route. Data movement should be minimized as much as possible. Moving programs to the data locations is a more efficient use of network bandwidth, but the tradeoff is more powerful servers at the data storage sites.

Analyzing/processing

In data analysis, there are basically two paradigms: batch and stream processing. Batch processing has high-latency, whereas stream processing analyzes small amounts of data as they arrive. It has low-latency and, depending on the arrival rate, the volume can mount up very quickly. The section “Data and Information Processing” has described some of the analytics and processing approaches and challenges for Big Data.

Table 4.6 Some challenges for Big Data acquisition and storage

Has quality control been maintained throughout the acquisition of the data? If not, are quality control parameters clearly identified for different data subsets?

What validation and verification procedures were used during data acquisition?

How to handle data lost during the data acquisition process? What if it is due to acceptance and storage problems?

Guaranteeing reliability and accessibility of data given the axiom of maintaining one copy of data, not many (Feinleb 2012).

How does one drive the acquisition of high-value data?

A corollary to the above: How does one predict what data is needed to plan, design, and engineer systems to acquire it?

How does one decide what data to keep, e.g., what data is relevant, if it cannot all be kept?

How does one access and manage highly distributed data sources?

Table 4.7 Selected data acquisition tools

Tool/website

Brief description

Storm

http://storm.apache.org/

Storm is an open-source event processing system developed by BackType, but then transferred to Apache after acquisition by Twitter. It has low latency and has been used by many organizations to perform stream processing.

IBM InfoSphere Streams

http://www-01.ibm.com/software/data/infosphere/

Streams are a platform and execution environment for stream processing analytics, which supports the diversity of streams and stream processing techniques.

Apache Spark Streaming

https://spark.apache.org/streaming/

Apache Spark is a fast and general-purpose cluster computing system overlaid on commodity hardware. It provides high-level APIs in Java, Scala, Python, and R and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

You also need to determine what is “real-time.” It can depend on the context in which it is used: options trading, missiles flying, driving down the highway at 75 mph. Typically, what we mean by “real-time” is that the system can respond to data as it is received without necessarily storing it in a database first. This means that you can process the data as it arrives as opposed to storing it and processing it later. In effect, you can respond to events as they are happening and situations as they are unfolding.

Input Processes

Data can be acquired in batches or in streams. Batches may be small, medium, or large, but as they aggregate, and the frequency with which they are acquired increases, they can amount to Big Data very easily. On the other hand, streaming data arrives continuously and, often, at high speed.

Streaming is the “continuous” arrival of data from a data source. Arrival may be periodic, but at fixed intervals with fixed or variable amounts of data. More often, streaming is assumed to be a continuous flow of data that is sampled to acquire data. In the worst case, sampling acquires every piece of data, but often it involves selecting only certain types of data or sampling on a periodic basis. A good example is social media where posts are made at varying intervals depending on the poster. In general, the flow of information appears continuous, but is really aperiodic.

Unlike a data set, a data source has no beginning and no end. One begins collecting data and continues to do so until one has enough data or runs out of patience or money or both. Stream processing is the act of running analytics against the data as it becomes initially available to the organization. The data streams in with varied speed, frequency, volume, and complexity. The data stream may dynamically change in two ways: (1) the data formats change, necessitating changes in the way the analytics process the data, or (2) the data itself changes necessitating different analytics to process it. A complicating factor is the implicit assumption that the data streams are well-behaved and that the data arrive more or less in order. In reality, data streams are not so well-behaved and often experience disruptions and mixed-in data, possibly unrelated, to the primary data of interest.

Data Growth versus Data Expansion

Most organizations expect their data to grow over their lifetime as the organization increases its services, its business and business partners and clients, its projects and facilities, and its employees. Few businesses adequately consider data expansion, which occurs when the data records grow in richness, when they evolve over time with additional information as new techniques, processes, and information demands evolve. Most data is time-varying—the same data items can be collected over and over with different values based on a timestamp. Much of this time-stamped data is required for retrospective analysis—particularly that which is used in estimative and predictive analytics.

As the volume of data grows, the “big” may morph from the scale of the data warehouse to the amount of data that can be processed in a given interval, say 24 hours, using the current computational infrastructure. Gaining insight into the problem being analyzed is often more important than processing all of the data. Time-to-information is critical when one considers (near) real-time processes that generate near-continuous data, such as radio frequency identifiers (RFIDs—used to read electronic data wirelessly, such as with EZPass tags) and other types of sensors. An organization must determine how much data is enough in setting its processing interval because this will drive the processing system architecture, the characteristics of the computational engines, and the algorithm structure and implementation.

Output Processes

A major issue in Big Data system design is the output processes. Jacobs (2009) summarized the issue very succinctly—“… it’s easier to get the data in than out.” However, the tools designed for transaction processing that add, update, search for, and retrieve small to large amounts of data are not capable of extracting the huge volumes and cannot be executed in seconds or even in a few minutes.

How to access very large quantities of semi- or unstructured data, and how to convey this data to visualization systems for assessment and managerial or executive decision making is a continuing problem. It is clear the problem may neither be solved by dimensional modeling and online analytical processing, which may be slow or have limited functionality, nor by simply reading all the data into memory and then reading it out again, although recent in-memory data analysis systems have demonstrated an ability to analyze terabytes of data in a large memory.

Technical considerations that must be factored into the design include the ratio of the speed of sequential disk reads to the speed of random memory access. The current technology shows that random access to memory is 150,000 times slower than sequential access. Joined tables, an assumed requirement of associating large volumes of disparate but somehow related data, perhaps by observations over time alone, will come at further huge performance costs (Jacobs 2009).

Data Curation

Once acquired, most data is not ready for immediate use. It must be cleansed to remove inaccuracies, transformed to canonical forms, and even restructured to make it easier to manage and access. A critical aspect is identifying and associating metadata to the data sets in order to make them searchable and to understand what is contained therein.

Data curation includes extracting, moving, cleaning, and preparing the data, which includes the process of data enrichment. The University of Illinois Graduate School of Library and Information Science defined data curation as the active management of data throughout its useful life cycle (CLIR 2012)—a more robust handling of data. It is all about improving the quality of the data in hand. Curation is an important process because data is often acquired with varying degrees of reliability, accuracy, precision, and veracity. One aspect of curation is to canonicalize the data to a common set of units or to base concepts. Poor data quality can skew or invalidate results produced by the best algorithms. However, data curation tools cannot replace human analysis and understanding. However, with Big Data, humans beings cannot do it all (or even a small part of it), and so we must trust the tools to assist in assessing and improving data quality. Table 4.8 presents some elements of the digital curation process.

Data Cleansing

Data cleansing is the process of detecting and correcting or removing data from a data set. It includes verifying that the data are valid within the problem parameters and removing outlying data while risking that the phenomena may not be fully understood. The objective of data cleansing is to ensure a consistent data set that will allow some degree or measure of trust in the results generated from the data set. Table 4.9 presents some criteria for data cleansing.

Table 4.8 Selected elements of digital curation

Know your data—both what you acquire and what you generate. Develop an acquisition plan, a storage plan, a management plan, and a retirement plan. Authenticate the data with metadata so its provenance is established.

Creating data—for data you generate, associate the appropriate metadata with it.

Determine accessibility and use. Identify the classes of users of the data, what they can and cannot use, and what the accessibility constraints on the data are and how they will be enforced. Check periodically to ensure that the data is accessible and uncorrupted.

Establish selection criteria—you may not be able to keep all of the data that you acquire. Establish selection criteria that determine what data you will keep based on its relevancy to the organization’s mission and operation. One criterion to consider is perishability—how useful the data is or will be after a given time.

Disposal plan and operations—it is not simply a matter of erasing digital data or throwing paper or other media in a recycling bin. Some data is sensitive and must be properly discarded—shredding of paper and other media, overwriting data stores with zeroes, and so on. Periodically appraise you data to determine it is still relevant and useful for your organization’s mission and operations.

Ingest the data—including data curation, cleaning, and transforming prior to storing it. It may be desirable or necessary to transform the data into a different digital format before storing it—based on its intended use or efficiency of storage.

Preserving the data—ensuring the data is available, protected, and backed up so if the primary store is lost, the data will be recoverable.

Retention of data—are there legal requirements for retaining the data? For a fixed period? For a long time?

Access to the data—is access to the data limited by law? For example, HIPAA and the Privacy Act (amended) restrict access to personal data through personal identifiable information (PII). (Data that in itself or combined with other information can identify, contact, or locate a specific person, or an individual in a context.)

Data correction may include correcting typographical errors created during data collection. Or, it may be correcting values against a list of known entities or a range of valid values. For example, checking postal zip codes against the names of towns and ensuring the correct zip code is associated with the address.

Data Storage

As data science has become more important in the business analytics community, the data can often be aggregated and linked in different ways. Sharing and archiving have become important aspects of being able to use this data efficiently. Whether Big Data exists as a very large number of small data files, a few very large data sets, or something in between, a key challenge is organizing this data for access, analysis, updating, and visualization.

Table 4.9 Some criteria for data cleansing

Validity: A measure of the conformance of data to the criteria set established for useful data within the organization. Each data set should have a set of validity criteria and a mechanism for establishing the validity of each element. If the data set is refreshed—either periodically or continually—then the validity criteria should be checked on the same schedule.

Retention: Although data has been removed from the data set, should it be retained? For example, there may be legal requirements to retain the original data set even though the cleansed data set is used in further processing.

Duplicate removal: Removal of duplicates is often performed through data correction or outright removal. However, there may not be enough information available to determine whether duplicate removal should occur. The risk is loss of information.

Streaming data: Cleansing of streaming data is problematic for at least two reasons. First, decision about correction, transformation, and duplicate removal have to be made within a short time frame and, often, with insufficient information. Second, if the data re-occurs, a question arises as to whether the earlier cleansing was the right decision.

User perspective: Different users may have different perspectives on how data should be cleansed for a particular problem. For example, the question of “how many employees do we have” should result in different answers depending on the person asking the question. This implies that different versions of the data set must be maintained to satisfy different user perspectives.

Data homogenization: Sometimes, data must be homogenized to eliminate bias due to data collection methodologies and variability in instruments—whether physical or textual, such as opinion survey instruments. Homogenization is a transformation of a data to a base value with scaling to ensure the all data represents the same phenomena.

This is a fundamental change in the way many businesses manage their data—a cultural change, where silos still often prevail among divisions. This problem is not unique to businesses, however, as government and science also experience the same challenges. Although the Federal government and a few state governments have moved toward increased sharing of data, including the Federal government’s www.data.gov website, it is a slow process to overcome institutional and individual barriers.

To archive data for future use, one needs to develop a management plan, organize it, and physically store it in both an accessible repository as well as a backup repository. Data management is briefly described in the next section. Here, we will be concerned about technology for storing and archiving data.

The capacity of disk drives seems to be doubling about every 18 months due to new techniques, such as helical recording, that lead to higher density platters. However, disk rotational speed has changed little over the past 20 years, so the bottleneck has shifted—even with large disk caches—from disk drive capacity to getting data on and off the disk. In particular, as National Academy of Science (2013) noted, if a 100-TByte disk requires mostly random-access, it was not possible to do so in any reasonable time.

Data storage is usually divided into three structures: flat files, relational databases, and NoSQL databases. The first two are well-known structures that have been discussed extensively elsewhere.

Flat Files

Flat files are the oldest type of data storage structure. Basically, they are a set of records that has a structure determined by an external program. One of the most common types of flat files are “csv” (comma-separated value) files where an arbitrary sequence of tokens (e.g., words or numbers or punctuation other than commas) are separated by commas. The interpretation of the meaning of the sequence of tokens is provided by one or more applications programs that read and process one record at a time. CSV files are often used to exchange data between applications, including across distributed systems, because they are text files.

Relational Database Management Systems

Relational database management systems (RDBMSs) have been available both commercially and as OSS since the early 1980s. As data sets have grown in volume, relational DBMSs have provided enhanced functionality to deal with the increased volumes of data.

RDBMSs implement a table model consisting of rows with many fields. Each table is described by a primary key consisting of one or more fields. One way to think of relational databases is to view a table as a spreadsheet where the column headers are the field names and the rows are the records. Then, a database with multiple tables is like a set of spreadsheets where each spreadsheet represents a different table. This model can implement complex relationships as entries in the field of one table can be entries in the field of another table thus linking the tables together.

Much has been written about relational databases, and hence we will not address them further here. The Further Reading section contains a number of references on relational databases.

NoSQL Databases

NoSQL is a database model that differs from the traditional relational database and flat file models. It implements a simplified data model that trades off speed of access and large volume scalability for complexity in representing and managing detailed structural relationships. Typical implementations use either key-value, graph, or document representations. The term “NoSQL” is often interpreted to mean “Not Only SQL.” Yet, NoSQL databases can support a subset of the standard SQL-like queries.

NoSQL Properties. NoSQL databases provide three properties, known by the acronym BASE:

  • Basically available: Since a NoSQL database is often distributed across multiple servers, parts of it may not always be consistent. However, the goal is to ensure that most of it is available all of the time.
  • Soft state: The state of the system may change over time, even without input. This is because of the eventual consistency model.
  • Eventual consistency: Given a reasonably long duration over which no changes are sent, all updates will propagate throughout the system. This means that some copies of the data distributed across multiple servers may be inconsistent.

Types of NoSQL Databases. NoSQL databases utilize different structures to provide different levels of storage flexibility and performance. You should select a database technology that fits both the natural representation of the data and the way that you primarily expect to access the data. Table 4.10 describes the four major types of NoSQL database structures.

Table 4.10 NoSQL database types

Type

Brief description

Column

Column-family databases store data in column families as rows that have many columns associated with a row key. Each column can be thought of as a set of rows in a relational database. The difference is that various rows do not have to have the same columns, and columns can be added to any row at any time without having to add it to other rows. Column stores do not need indexes for fast access to the database contents.

Key-value

Key-value stores are the simplest NoSQL data stores to use from an access perspective. A client can either get the value for the key, put a value for a key, or delete a key from the data store. The value is a blob of data that the data store just stores, without caring or knowing what is inside; it is the responsibility of the application to understand what was stored. Key-value stores generally have very good performance and are easily scaled.

Document

A document is the unit of data stored in a document store such as MongoDB. The document can be represented in many ways, but XML, JSON, and BSON are typical representations. Documents are self-describing tree-structures that can consist of maps, collections, and scalar values. The documents stored are similar to each other but do not have to be exactly the same.

The documents are stored in the value part of a key-value store.

Graph

Graph databases allow you to store entities and relationships between these entities. Entities are also known as nodes, which have properties. Relations are known as edges that can have properties. Edges have directional significance; nodes are organized by relationships which allow you to find interesting patterns between the nodes. The graph model represents data as RDF triples of the form <subject, predicate, object> tuple. Complex relationships can be represented in this format, including network and mesh data structures that features multiple named edges between named nodes.

Selected Popular NoSQL Databases

The ease of use and popularity of the NoSQL model has spawned a rich competitive environment in developing NoSQL database systems. Table 4.11 presents selected NoSQL database systems that are widely used across the Big Data community.

Table 4.11 Selected NoSQL database systems

Database/website

Brief description

HBase

http://hbase.apache.org/

HBase is an open-source, nonrelational database that stores data in columns. It runs on top of the Hadoop distributed file system (HDFS), which provides fault-tolerance in storing large volumes of data. Typically, it is used with Zookeeper, a system for coordinating distributed applications, and Hive, a data warehouse infrastructure (George 2011).

MongoDB

https://www.mongodb.org/

MongoDB is a distributed document-oriented database, which stores JSON—like documents with dynamic schemas.

SCIDB

http://www.scidb.org/

SCIDB is an array database management system. Arrays are the natural way to organize, store, and retrieve ordered or multifaceted data. Data extraction is relatively easy—by selecting any two dimensions you extract a matrix. Array processing is very scalable parallel processors that can yield substantial performance gains over other types of databases.

Accumulo

http://accumulo.apache.org/

Accumulo is a sorted, distributed key-value store which embeds security labels at the column level. Data with varying security requirements can be stored in the same table, and access to individual columns depends on the user’s permissions. It was developed by a government agency, but then released as open source.

Cassandra

http://cassandra.apache.org/

Cassandra is an open-source distributed database system that was designed to handle large volumes of data distributed across multiple servers with no single point of failure. Created at Facebook, Cassandra has emerged as a useful hybrid of a column-oriented database with a key-value store. Rows in the database are partitioned into tables each of whose first component is a primary key (Hewitt 2010).

Neo4J

http://neo4j.com/

Neo4J is an open-source graph database which implements a data model of nodes and edges—each described by attributes.

A larger listing of NoSQL databases of various type may be found at http://nosql-database.org/.

Advantages and Disadvantages of NoSQL Databases

As with RDBMSs, NoSQL databases have both advantages and disadvantages. The primary advantage of NoSQL databases is claimed to be scalability into the petabyte range. Although several RDBMSs claim this ability, it remains to be seen whether performance scales with data volume. A few of the advantages and disadvantages are described in Tables 4.12 and 4.13 (Harrison 2010).

Table 4.12 Advantages of NoSQL databases

Advantage

Brief description

Elastic scaling

NoSQL databases allow organizations to scale out (across distributed servers) rather than scaling up (with more expensive and faster hardware). Scaling out in cloud environments is relatively easy, often transparent, and allows organizations to adjust their data storage to fit their current requirements.

Big Data

As databases grow to volumes of petabytes and beyond, transaction processing times increase exponentially. By distributing transactions across multiple servers, processing time can be kept to near-linear increases.

Reducing the DBA role

DBA expertise is often expensive. The simple models used by NoSQL databases usually require only a basic understanding of data structures. Moreover, their inherent reliability, maintainability, and performance reduce the need for many highly trained database administrators.

Economies of scale

NoSQL databases tend to use clusters of low-cost commodity hardware coupled with mechanisms to ensure high reliability and fault tolerance. Thus, the cost per terabyte or transaction/second is often significantly much less than for a conventional RDBMS.

Flexible data models

The simple and flexible data models allow data to be easily described and changed. And, databases can be expanded through new fields and columns without affecting the rest of the database. NoSQL databases are often said to be “schema-free” in that data is represented as simple JSON structures that may be rapidly searched through the application programming interface (API).

Table 4.13 Disadvantages of NoSQL databases

Disadvantage

Brief description

Maturity

Some NoSQL DBMSs are relatively mature, although all are less than 10 years old. Some data models are still evolving; the reliability and functionality of the software is still evolving.

Support

Most NoSQL DBMSs are open-source projects meaning there are many anonymous contributors to source code that continues to evolve at varying rates.

Analytics support

All NoSQL DBMSs provide a Java API, but this may be difficult to use without significant Java programming experience. The emergence of SQL-like query languages such as Pig and Hive can provide easier access to data, but at the loss of some rich SQL functionality.

Administration

NoSQL databases are NOT zero-administration databases. Organizations must still retain expertise to describe the data within the data model supported by the DBMS. Most NoSQL DBMSs require considerable expertise to install and operate within a distributed server environment and require considerable infrastructure expertise to support their operation.

Data model expertise

Even with the simple models provided by NoSQL DBMSs, expertise is still required to describe the external data structures in the internal data model representation and tune the structures to achieve high-performance.

Data Ownership

Data ownership presents a critical and ongoing challenge, particularly in the social media arena. While petabytes of social media data reside on the servers of Facebook, MySpace, and Twitter, it is not really owned by them (although they may contend so because of residency). Certainly, the “owners” of the pages or accounts believe they own the data. This dichotomy will have to be resolved in court. If your organization uses a lot of social media data that it extracts from these sites, you should be concerned about the eventual outcome of these court cases. Kaisler, Money, and Cohen (2012) addressed this issue with respect to cloud computing as well as other legal aspects that we will not delve into here.

With ownership comes a modicum of responsibility for ensuring its accuracy. This may not be required of individuals, but almost certainly is so of businesses and public organizations. However, enforcement of such an assumption (much less a policy) is extremely difficult. Simple user agreements will not suffice since no social media purveyor has the resources to check every data item on its servers. Table 4.14 presents some ownership challenges.

Table 4.14 Some Big Data ownership challenges

When does the validity of (publicly available) data expire?

If data validity is expired, should the data be removed from public-facing websites or data sets?

Where and how do we archive expired data? Should we archive it?

Who has responsibility for the fidelity and accuracy of the data? Or, is it a case of user beware?

With the advent of numerous social media sites, there is a trend in BDA toward mixing of first-party, reasonably verified data, with public and third-party external data, which has largely not been validated and verified by any formal methodology. The addition of unverified data compromises the fidelity of the data set, may introduce non relevant entities, and may lead to erroneous linkages among entities. As a result, the accuracy of conclusions drawn from processing this mixed data varies widely.

Unlike the collection of data by manual methods, where rigorous protocols are/were often followed in order to ensure accuracy and validity, digital data collection is much more relaxed. The richness of digital data representation prohibits a bespoke methodology for data collection. Data qualification often focuses more on missing data or outliers than trying to validate every item. Data is often very fine-grained such as clickstream or metering data. Given the volume, it is impractical to validate every data item: new approaches to data qualification and validation are needed.

Going forward, data and information provenance will become a critical issue. JASON has noted (2008) that “there is no universally accepted way to store raw data, reduced data, and … the code and parameter choices that produced the data.” Further, they note: “We are unaware of any robust, open source, platform-independent solution to this problem.” As far as we know, this remains true today. To summarize, there is no perfect Big Data management solution yet. This represents an important gap in the research literature on Big Data that needs to be filled.

Data Management

Data management will, perhaps, be the most difficult problem to address with Big Data. This problem first surfaced a decade ago in the UK eScience initiatives where data was distributed geographically and “owned” and “managed” by multiple entities. Resolving issues of access, metadata, utilization, updating, governance, and reference (in publications) have proven to be major stumbling blocks. Within a large organization, data is often distributed among different business operations and “owned” and “managed” by those business operations. Without a centralized data dictionary and uniform management practices, conflicts, including inconsistencies, representational issues, data duplication among others, will arise between one or more business operations.

If we consider that data and information are both sources of business strategies, the raw material of business operations and the basis for metrics for assessing how well the organization is doing, then we begin to understand how critical it is for it to be managed well. “Managed well” is best simply described as the following: delivering the right information to the right place (and people) at the right time in the right form at the right cost. To do so requires a strong, sound architecture, good processes, and effective project management. It is hard to get all three elements correct.

A good analogue is to think of an information supply chain. We need to ensure that information flows smoothly through the business from multiple sources through multiple points of processing and analysis to multiple storage facilities and, then, out to multiple points of consumption. As with other supply chains, an effective information supply chain can enhance customer satisfaction, better support analytics to plan and manage business operations, and facilitate regulatory compliance as required. It can also reduce the need for redundant data and leverage infrastructure and personnel resources to improve the cost–benefit ratio.

Data management is different from database management. It is concerned more with organizing the data in a logical, coherent fashion with appropriate naming conventions and managing the metadata associated with the data. However, data management also has the function of distributing large data sets across multiple, distributed processors such as cloud-based or geographically distributed systems.

The reason for distributing data is that single systems cannot often process the large volumes of data represented by Big Data, for example, terabytes or petabytes of data. Thus, to achieve reasonable performance, the data must be distributed across multiple processors—each of which can process a portion of the data in a reasonable amount of time. There are two approaches to distributing data:

  • Sharding: Sharding distributes subsets of the data set across multiple servers, so each server acts as the single source for a subset of data.
  • Replication: Replication copies data across multiple servers, so each bit of data can be found in multiple places. Replication comes in two forms:
    • Master-slave replication makes one node the authoritative copy that handles writes while slave nodes synchronize with the master and handle reads. Thus, multiple concurrent reads can occur across a set of slave nodes. Since reads are more frequent than writes, substantial performance gains can be achieved.
    • Peer-to-peer replication allows writes to any node. The nodes coordinate to synchronize their copies of the data usually as background tasks. The disadvantages are that for short periods of time different nodes may have different versions of the data.

Master-slave replication reduces the chance of conflicts occurring on updates. Peer-to-peer replication avoids loading all writes onto a single server, thus attempting to eliminate a single point of failure. A system may use either or both techniques. Some databases shard the data and also replicate it based on a user-specified replication factor.

Some data management issues that you should consider are presented in Table 4.15 (Kaisler et al. 2013).

Data Enrichment

Data enrichment is the process of augmenting collected raw data or processed data with existing data or domain knowledge to enhance the analytic process. Few, if any, business problems represent entirely new domains or memoryless processes. Thus, extant domain knowledge can be used to initialize the problem solving process and to guide and enhance the analysis process. Domain knowledge can be used to enrich the data representation during and after the analysis process. Transformation through enrichment can add the necessary domain knowledge to enable analytics to describe and predict patterns and, possibly, prescribe one or more courses of action.

Table 4.15 Some management challenges for Big Data

Are the data characteristics sufficiently documented in the metadata? If not, how difficult is it to find them?

If all the data cannot be stored, how does one filter/censor/process data to store only the most relevant data?

Can ETL (extract, transform, load) be performed on all data without resorting to external mass storage?

How much data enrichment must be performed before the data can be analyzed for a specific problem?

Determining the amount of enrichment to perform on acquired data such that it does not skew or perturb the results from the original data.

How does one handle outliers and uncorrelated data?

Where is the tradeoff between the integration of diverse data (multimedia, text, and web) versus more complex analytics on multiple data sets?

What visualization techniques will help to understand the extent and diversity of the data?

Data enrichment is directly influenced by the V: data veracity. About 80 percent of the work to analyze data is about preparing the data for analysis. This percentage will vary substantially depending on the quality of the data. According to Davenport (2006), the most important factor for using sophisticated analytics is the availability of high-quality data. This requires that data be precise (within context), accurate, and reliable. Accuracy of data will influence the quality of the analytic’s output, per the old axiom “garbage in, garbage out.” Precision will influence the exactness of the analysis and, in numerical analysis, minimize the error field. Reliability—how well we trust the data to reflect the true situation—will influence the believability and value of the results. Ensuring veracity is a result of good data acquisition and curation practices.

An example of data enrichment that resonates with us is the following advertisement from AT&T in the early 1980s. Years ago, there was an AT&T commercial on TV about selling telephone service to a trucking company. The AT&T representative approaches the general manager of the trucking company who says “Let’s talk about telephones.” The AT&T representative says, “No, let’s talk about trucking, then we will talk about telephones.” This vignette captures the essence of the problem: Understanding the domain is essential to being able to solve problems in the domain! The AT&T representative knew about telephones, but needed to understand the trucking company manager’s perspective in order to understand how AT&T’s services could impact and enhance his business.

Data Movement

The velocity dimension of Big Data often refers to just the acceptance of incoming data and the ability to store it without losing any of it. However, velocity also refers to the ability to retrieve data from a data store and the ability to transmit it from one place to another, such as from where it is stored to where it is to be processed. Google, for example, wants their data to be transmitted as fast as possible. Indeed, they establish internal constraints on their systems to make sure they give up on certain approaches very quickly if they are going to take more time than a given threshold.

Network communications speed and bandwidth have not kept up with either disk capacity or processor performance. Of the 3Es—exabytes, exaflops, and exabits—only the first two seem attainable within the next 10 years, and only with very large multiprocessor systems. The National Academy of Science (2013) has noted that data volumes on the order of petabytes mean that the data cannot be moved to where the computing is; instead, the analytical processes must be brought to the data.

Data Retrieval

In order to use Big Data to assist in solving business operations problems, we need to find the relevant data that applies to the business problem at hand. Identifying the relevant data is a search problem where we generally wish to extract a subset of data that satisfy some criteria associated with the problem at hand.

Quality versus Quantity

An emerging challenge for Big Data users is “quantity vs. quality.” As users acquire and have access to more data (quantity), they often want even more. For some users, the acquisition of data has become an addiction. Perhaps, because they believe that with enough data, they will be able to perfectly explain whatever phenomenon they are interested in.

Conversely, a Big Data user may focus on quality which means not having all the data available, but having a (very) large quantity of high quality data that can be used to draw precise and high-valued conclusions. Table 4.16 identifies a few issues that must be resolved.

Another way of looking at this problem is: what is the level of precision you require to solve business problems? For example, trend analysis may not require the precision that traditional DB systems provide, but which require massive processing in a Big Data environment.

Data Retrieval Tools

The tools designed for transaction processing that add, update, search for, and retrieve small to large amounts of data are not capable of extracting the huge volumes typically associated with Big Data and cannot be executed in seconds or a few minutes.

Some tools for data retrieval are briefly described in Table 4.17.

Table 4.16 Some quantity and quality challenges

How do we decide which data is irrelevant versus selecting the most relevant data?

How do we ensure that all data of a given type is reliable and accurate? Or, maybe just approximately accurate?

How much data is enough to make an estimate or prediction of the specific probability and accuracy of a given event?

How do we assess the “value” of data in decision making? Is more necessarily better?

Table 4.17 Selected data retrieval tools

Tool

Brief description

Drill

http://drill.apache.org/

Apache Drill is a low latency distributed query engine for large-scale data sets, including structured and semi-structured/nested data. It is the open source version of Google’s Dremel. A version of Drill has been pre-installed in Hadoop’s MapR sandbox to facilitate experimentation with drillbits, the components that receive and execute user’s queries.

Flume

http://flume.apache.org/

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It implements that a streaming data flow model is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.

Table 4.18 Some data value challenges

For a given problem domain, what is the minimum data volume required for descriptive, estimative, predictive, and prescriptive analytics and decision modeling with a specified accuracy?

For a given data velocity, how do we update our data volume to ensure continued accuracy and support (near) real-time processing?

For a given problem domain, what constitutes an analytic science for non-numerical data?

“What if we know everything?”—What do we do next?

The Value of “Some Data” versus “All Data”

Not all data are created equal; some data are more valuable than other data—temporally, spatially, contextually, and so on. Previously, storage limitations required data filtering and deciding what data to keep. Historically, we converted what we could and threw the rest away (figuratively, and often, literally).

With Big Data and our enhanced analytical capabilities, the trend is toward keeping everything with the assumption that analytical significance will emerge over time. However, at any point in time the amount of data we need to analyze for specific decisions represents only a very small fraction of all the data available in a data source and most data will go un-analyzed. Table 4.18 presents some data value challenges.

New techniques for converting latent, unstructured text, image, or audio information into numerical indicators to make them computationally tractable are required in order to improve the efficiency of large-scale processing. However, such transformations must retain the diversity of values associated with words and phrases in text or features in image or audio files.

Data Visualization

Data visualization involves the creation and study of the visual representation of data. It is a set of techniques used to communicate data or information by encoding it as visual objects (e.g., points, lines, or bars) contained in graphics.

Table 4.19 describes the common types of data visualization methods, adapted from http://vadl.cc.gatech.edu/taxonomy/. A comprehensive set of visualization techniques is available at http://www.visualliteracy.org/periodic_table/periodic_table.html. A lack of space prohibits detailed descriptions, but further reading contains a number of references that will allow you to explore the different techniques.

Table 4.19 Common data visualization types

Type

Description

1D/linear

A list of data items, usually organized by a single feature. Useful for textual interfaces, although frequently used in simple pages within web browsers.

2D/planar

A set of data items organized into a 2D representations such as matrices, maps, charts, and plots.

3D/volumetric

A set of data displayed in a 3S space such as a 3D plot, 3D models such as wire models, surface and volume models, and animated computer simulations of different phenomena.

Temporal

A set of data is organized and displayed according to its temporal attributes using techniques such as timelines, time series, Gantt charts, stream graphs, rose charts, and Scatter plots where time is one of the axes. Tools include MS Excel, the R Time Series library, and Google Charts.

Multidimensional

A set of data is organized according to multiple attributes taken one at a time in which the data set is partitioned into groups. Techniques include pie charts, histograms, tag clouds, bubble charts, and tree maps, among other techniques.

Tree/hierarchy charts

A set of data is partitioned into sets and subsets where the subsets are more explicitly descriptive based on the attributes of the items. Techniques include trees, dendrograms, tree maps, and partition charts.

Networks

A set of items is displayed as a partially connected graph where the nodes are connected by edges labeled by attributes shared by the nodes. Techniques include subway/tube maps, node-link diagrams, and alluvial diagrams.

Communications and Networking

As little as three decades ago, our means of communication between individuals, individuals and computers, and computer to computer was very limited. The advent of Ethernet as a simple-to-implement network and the TCP/IP stack as a communication protocol unleashed a diversity of technology and applications. Continued development of Ethernet devices led to wireless communications, Wi-Fi, cellphones, tablets, and a plethora of mobile device applications.

Network communications speed and bandwidth have not kept up with either disk capacity or processor performance. Of the 3Es—exabytes, exaflops, and exabits—only the first two seem attainable within the next 10 years. The National Academy of Science (2013) has noted that data volumes on the order of petabytes mean that the data cannot be moved to where the computing is in a reasonable amount of time; instead, the analytical processes must be brought to the data. There seem to be physical limitations on moving data at an exabit per second over current physical transmission media. If data is compressed prior to transmission, it may be possible to achieve apparent exabit per second bandwidth.

Over the past two decades, as wireless technology has evolved, transmission rates have continuously increased. These increases, accompanied by the shrinkage of the cell phone and the emergence of other wireless devices—tablets, watches, Google glasses, and so on—has resulted in a new discipline—mobility science—which focuses on research and development in mobile devices and applications. Indeed, some of the fastest growing conferences have to do with mobile technology, devices, and applications.

Computer and Information Security

Today’s threat environment is extremely challenging, but none more so than in the online world. The rapid growth in mobile applications and the increasing web-hopping by Internet users exposes them to a myriad of threats. As the number of websites increases, so do the opportunities for computer and information security threats.

The threat environment has evolved rapidly. In the 1990s very few spam messages were sent. By 2014, it was estimated by Spam Laws (http://www.spamlaws.com/) that over 14.5 billion spam messages were sent per day, which comprised over 45 percent of the total e-mails transmitted across the Internet. As IPv6, the expanded Internet routing protocol, is implemented across the Internet, approximately 4 billion unique IP addresses will become available. This has created an enormous opportunity for cybercriminals to use to attack and exploit user sites. Defensive cyber technology applies to all sites, not just Big Data sites.

Cybercriminals have become extremely sophisticated in their methods of attack, exploitation, and exfiltration. They use modern software engineering techniques and methodologies to design, develop, and deploy their software. Polymorphic software can evolve and propagate in thousands of ways, much of which is undetectable by traditional methods. The newest types of complex and sophisticated threats (advanced persistent threats—APTs) are beyond the scope of this book. The attackers of today are not limited in their attack platforms or targets. In today’s environment, multiplatform means that mobile devices are also highly susceptible to attack. The volume of data, the increasing frequency of attacks and diversity of targets, and the variety of threat software pose a Big Data challenge to organizations and individuals alike.

To defend themselves, organizations must gather, analyze, and assess more data than ever before at more locations within the organization infrastructure. This means dedicating significantly greater resources to ensure the viability of the organization in the face of this aggressive and evolving threat environment. It is becoming blatantly clear that off-the-shelf solutions have not and cannot address these problems. Scaling up resources for defense must be managed intelligently, because just identifying each threat and fixing it as it is found is economically not viable, and simply not technically possible. The bad guys are too creative and highly skilled!

Most organizations in academia, government, and industry are expected to do much more to protect themselves against the deluge of cyberattacks that are exponentially increasing. Today’s organizations are adding security operation centers to meet these growing threats. In medieval times, one could notice when the Huns were attacking the city gates or someone had stolen half the grain supply. Today, cybercriminals can steal the data while leaving the original data in place. In many cases, an organization does not know it has been attacked until its own data is used against it.

Security software organizations must not only collect, analyze, detect, and develop responses, but they must also develop tools to predict how software will morph, where attacks might occur and where from, and how frequently they might occur.

Firewalls

A primary component of a security architecture is the firewall, which is a network security device that inspects and controls incoming and outgoing network traffic. Inspection and control is based on a set of rules defined by the organization. For example, many organizations prohibit the acceptance of traffic from known phishing scans and prohibit access to known websites that host malware. In effect, the firewall is intended to establish a barrier between a trusted network, such as a company’s internal network, and an external untrusted network, such as the Internet. Firewalls may be implemented in hardware, software, or a combination of the two—although it is generally accepted that hardware-based firewalls may be more secure. Most personal computer operating systems offer software-based firewalls.

Firewalls can operate at many levels—as packet filters for raw network traffic, as state-based systems that operate at the message level, to application level systems that handle protocol-based applications such as browsers and file transfer programs.

Intrusion Detection System

An intrusion detection system (IDS) is hardware or software that monitors network traffic for malicious activities, policy violations, and attempts to bypass security mechanisms to gain access to an organization’s internal computer systems. Some IDSs also attempt to prevent intrusions through various mechanisms. Generally, an IDS will identify possible incidents of intrusion, log the information, and possibly send alerts to system administrators, ensuring the organizational security policies are heeded.

Passive systems usually log the incident and send alerts. Reactive systems will automatically attempt to prevent an intrusion by rejecting traffic, or even shutting down the host system to prevent attack and corruption. IDS may, like firewalls, look outward, but they also look inward at the operation of organizational systems for anomalous behavior based on known models of operation and communications between those systems.

It is impossible to detect every possible intrusion and even harder to prevent some of the intrusions that take or are taking place without analyzing them. Often, prevention is difficult because of the time required to identify and analyze the attack. Detection and prevention become harder as system usage grows and more computational effort must be focused on detecting anomalous behavior through comparison and prevention based on understanding the nature and target of the attack and having adequate methods for preventing or defeating it.

A good reference for intrusion detection is the NIST 800-94 Publication: A Guide to Intrusion Detection and Prevention Systems, 2007, which is available at http://csrc.nist.gov/publications/nistpubs/800-94/SP800-94.pdf.

Virtual Private Network

A virtual private network (VPN) extends a private network across a public network, such as the Internet. In effect, the VPN rides on top of the public network. A VPN is implemented as a point-to-point connection between two nodes in which the communications are encrypted at each end before transmission and decrypted at the receiving end. Two commonly used VPN techniques are openVPN (https://openvpn.net/) and ipSec (http://ipsec-tools.sourceforge.net/).

Security Management Policies and Practices

Security management is the identification, classification, and protection of an organization’s information and IT assets. Both computer systems and data must be managed and protected. The objective of security management is to prevent the loss of information, compromise of IT assets, and financial impact on an organization due to corruption or exfiltration of critical data, or denial of service to internal and external clients. To do so, an organization must identify, categorize as to impact and likelihood, and develop methods for prevention and mitigation of security threats.

Security management is implemented through a set of security policies developed and mandated by the organization. A security policy is a specification for achieving some level of security in an IT system, such as, for example, a password policy. Security policies may be machine- or human-focused. In either case, they prescribe the behavior of the entity to which they are addressed. Policies are typically decomposed into subpolicies which then specify specific operational constraints on the use and operation of the IT systems and access to organizational data.

Compliance with Security and Privacy Laws and Regulations

Security and privacy laws regulate how we deal with personal data in the United States. Organizations must ensure that they are in compliance with such laws and regulations or face stiff penalties. In some domains (e.g., health), the laws and regulations are much stricter than in others. Moreover, there is an emerging concern that as data accumulates about an individual or group of individuals, the aggregation of such data will allow the determination of properties of the individual through reasoning that were not explicitly represented in the original data itself.

In certain domains, such as social media and health information, as more data is accumulated about individuals, there is a fear that certain organizations will know too much about individuals. For example, data collected in electronic health record systems in accordance with HIPAA/HITECH provisions is already raising concerns about violations of one’s privacy. International Data Corporation (IDC) coined the term “digital shadow” to reflect the amount of data concerning an individual that has been collected, organized, and perhaps analyzed to form an aggregate “picture” of the individual. It is the information about you that is much greater than the information you create and release about yourself. A key problem is how much of this information—either original or derived—do we want to remain private?

Perhaps the biggest threat to personal security is the unregulated accumulation of data by numerous social media companies. This data represents a severe security concern, especially when many individuals so willingly surrender such information. Questions of accuracy, dissemination, expiration, and access abound. For example, the State of Maryland became the first state to prohibit—by law—employers asking for Facebook and other social media passwords during employment interviews and afterward.

Clearly, some Big Data must be secured with respect to privacy and security laws and regulations. IDC suggested five levels of increasing security (Gantz and Reinsel 2011): privacy, compliance-driven, custodial, confidential, and lockdown. The recent spate of international hackings and exfiltration of customer data from multiple large corporations in the United States demonstrates the critical nature of the problem and highlights the risk facing all businesses and organizations. Your corporation should take the initiative to review the Big Data assets and its structure and properties to determine the risk to your corporation if it is exposed. The IDC classifications are a good start absent a formal standard developed by the IT community. Table 4.20 presents some of the data compliance challenges.

Table 4.20 Some Big Data compliance challenges

What rules and regulations should exist regarding combining data from multiple sources about individuals into a single repository?

Do compliance laws (such as HIPAA) apply to the entire data warehouse or just to those parts containing relevant data?

What rules and regulations should exist for prohibiting the collection and storage of data about individuals—either centralized or distributed?

Should an aggregation of data be secured at a higher level than its constituent elements?

Given IDC’s security categorization, what percentage of data should reside in each category? What mechanisms will allow data to move between categories?

The Buy or Own Dilemma?

The primary issue facing any organization considering cloud computing as an infrastructure for providing computing services can be categorized as the “buy or build dilemma.” Basically, does an organization buy its cloud computing services from a public vendor or does it develop its own in-house cloud computing services. These are obviously the extremes. Many organizations have opted for a hybrid approach.

From a buy perspective, there are many issues that an organization should consider before deciding to wholly commit to an external, third-party supplier of cloud computing services. Kaisler and Money (2010, 2011) and Kaisler, Money, and Cohen (2012) have described some of the issues that should be considered. Table 4.21 presents a summary of some of these issues.

Table 4.21 Selected issues in buying cloud computing services

If the organization moves to a competing service provider, can you take your data with you?

Do you lose access (and control and ownership) of your data if you fail to pay your bill?

What level of control over your data do you retain, for example, the ability to delete data that you no longer want?

If your data is subpoenaed by a government agency, who surrenders the data (e.g., who is the target of the subpoena)?

If a customer’s information/data resides in the cloud, does this violate privacy law?

How does an organization determine that a cloud computing service provider is meeting the security standards it espouses?

What legal and financial provisions are made for violations of security and privacy laws on the part of the cloud computing service provider?

Will users be able to access their data and applications without hindrance from the provider, third parties, or the government?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset