Introduction

Data has existed since the early Mesopotamians began recording their goods, trades, and money flow, more than seven thousand years ago. Data is quite simply the representation of facts, with a single datum being a single fact. The first data analytics – the process by which data can be translated into information, knowledge, and actions – was most likely the same ancient people determining whether they had a surplus of animals or grains at the end of a season, and using that to decide whether to sell or buy.

The first general-purpose programmable computer designed to work with data was the Electronic Numerical Integrator and Computer, or ENIAC, which powered on in 1945 and was controlled by switches and dials with data fed into it via punch cards. It was used for diverse tasks such as helping to develop the hydrogen bomb, design wind tunnels, and predict weather. It didn’t, however, manage or store data. It wasn’t until the 1960s that a true data management and processing system, or database, would be created.

Although computers had previously been used for automating manual accounting tasks, and complex control systems, the Semi-Automated Business Research Environment (SABRE) airline reservation system of the 1960s was the first true transactional database system. This system ensured single booking of seats and handled more than 80,000 calls per day by 1964.

At that time, data was mostly stored in a hierarchical (document-like) structure. In 1970 Edgar Codd of IBM wrote a paper describing a relational system for storing data, and showed how it could not only handle creating, updating, and deleting data, but be used for querying it. Codd’s system consisted of tables that represented entities, such as organizations and people, and the relationships between the entities. IBM started a research project called System R to implement Codd’s vision, and created Structured Query Language, or SQL, as the language for working with data.

Inspired by Codd’s vision, Eugene Wong and Michael Stonebraker of the University of California, Berkeley, created the INteractive Graphics REtrieval System (INGRES) as the first commercial SQL-based relational database management system (RDBMS), which was distributed as source code to many universities at a nominal cost.

During the 1980s, RDBMSs became increasingly popular. INGRES spawned multiple commercial offerings including Sybase, Microsoft SQL Server, and NonStop SQL, while System R resulted in the IBM SQL/DS (later Db2) and Oracle databases. These databases became the storage and retrieval systems for operational business software applications used for supply chain, inventory management, customer relationships, and others which became packaged together as Enterprise Resource Planning (ERP) systems. These Online Transaction Processing (OLTP) systems became the backbone of industry.

However, to analyze data and provide businesses with insights in what were known as decision-support systems, a different solution was required. The data warehouse was created, starting with Teradata in 1983, to be the home of all enterprise data across multiple systems, with an architecture and data structure that facilitated fast and complex queries. Additional software was created to feed data warehouses using batch processing from operational systems through a process involving the extraction, transformation, and loading (ETL) of data.

Further software emerged that could analyze, visualize, and produce reports on this data, and in 1989 the term business intelligence (BI) was used to describe packages from Business Objects, Actuate, Crystal Reports, and MicroStrategy.

This rise of the World Wide Web in the 1990s changed all of this. The interaction of millions, and soon billions, of people across millions of websites generated exponentially more data, and in different forms, than the structured, limited-user, operational business systems. In 2003, the notion of the “Three Vs” – volume, velocity, and variety – of data was coined to express the change in the nature of data the web had introduced. New technology was required to deal with this, and Hadoop was invented in 2006 as a way to scale data storage and analytics for this new big data paradigm.

A Batch of Problems

Databases have thus been the predominant source of enterprise data for decades. The majority of this data came from manual human entry within applications and web pages, with some automation. Data warehouses, fed by batch-oriented ETL systems, provided businesses with analytics. However, in the past 10 years or so, businesses realized that machine data, logs produced by web servers, networking equipment, and other systems could also provide value. This new unstructured data, generated by an ever-increasing variety of sources, needed newer big data systems to handle it as well as different kinds of analytics.

Both of these waves were driven by the notion that storage was cheap and, with big data, almost infinite, whereas CPU and memory were expensive. As a result, the movement and processing of data from sources to analytics was done in batches, predominantly by ETL systems. Outside of specific industries that required real-time actions, such as equipment automation and algorithmic trading, the notion of truly real-time processing was seen as expensive, complicated, and unnecessary for traditional business operations. However, batch processing is crumbling under the strain of competing modern business objectives, shrinking batch windows in a 24/7 world where businesses hunger for up-to-the-second information.

Under Pressure

Business leaders around the world must balance a number of competing pressures to identify the most appropriate technologies, architectures, and processes for their business. Although cost is always an issue, this needs to be measured against the rewards of innovation. Risks of failure versus the status quo must also be considered.

This leads to cycles for technology, with early adopters potentially leapfrogging their more conservative counterparts, who might not then be able to catch up if they wait for full technological maturity. In recent years, the length of these cycles has been dramatically reduced, and formerly solid business models have been disrupted by insightful competitors – or outright newcomers.

Data management and analytics are not immune to this trend, and the increasing importance of relevant, accurate, and timely data has added to the risk of maintaining the status quo.

Business look at data modernization to solve problems such as:

  • How do we move to scalable, cost-efficient infrastructures such as the cloud without disrupting our business processes?

  • How do we manage the expected or actual increase in data volume and velocity?

  • How do we work in an environment with changing regulatory requirements?

  • What will be the impact and use cases for potentially disruptive technologies like artificial intelligence (AI), blockchain, digital labor, and the Internet of Things (IoT), and how do we incorporate them?

  • How can we reduce the latency of our analytics to provide business insights faster and drive real-time decision making?

It is clear that the prevalent legacy and predominantly batch method of doing things might not be up to the task of solving these problems, and a new direction is needed to move businesses forward. But the reality is that many existing systems cannot be just ripped out and replaced with shiny new things without severely affecting operations.

Time Value of Data

Much has been written about the “time value of data” – the notion that the worth of data drops quickly after it is created. We can also presume from this notion that if the process of capturing, analyzing, and acting on that information can be accelerated, the value to the business will increase. Although this is often the case and the move to real-time analysis is a growing trend, this high-level view misses many nuances that are essential to planning an overall data strategy.

A single piece of data can provide invaluable insight in the first few seconds of its life, indicating that it should be processed rapidly in a streaming fashion. However, that same data, when stored and aggregated over time alongside millions of other data points, can also provide essential models and enable historical analysis. Even more subtly, in certain cases, the raw streaming data has little value without historical or reference context – real-time data is worthless unless older data is also available.

There are also cases for which the data value effectively drops to zero over a very short period of time. For these perishable insights, if you don’t act upon them immediately, you have lost the opportunity to do so. The most dramatic examples are detecting faults in, say, power plants or airplanes to avoid catastrophic failure. However, many modern use cases – prevention, real-time offers, real-time resource allocation, and geo-tracking, to name a few – are also dependent on up-to-the-second data.

Historically, the cost to businesses to move to real-time analytics has been prohibitive, so only the truly extreme cases (such as preventing explosions) were handled in this way. However, the recent introduction of streaming integration platforms (as explained in detail throughout this book) has made such processing more accessible.

Data variety and completeness also play a big part in this landscape. To have a truly complete view of your enterprise, you need to be able to analyze data from all sources, at different timescales, in a single place. Data warehouses were the traditional repository of all database information for long-term analytics, and data lakes (powered by Hadoop) have matured to perform a similar function for semi-structured log and device data. If you wanted to analyze the same data in real time, you needed additional systems given that both the warehouse and the lake are typically batch fed with latencies measured in hours or days.

The ideal solution would collect data from all the sources (including the databases), move it into a data lake or scalable cloud data warehouse (for historical analysis and modeling), and also provide the capabilities for real-time analysis of the data as it’s moving. This would maximize the time-value of the data from both immediate and historical perspectives.

The Rise of Real-Time Processing

Fortunately, CPU and memory have become much more affordable, and what was unthinkable 10 years ago is now possible. Streaming integration makes real-time in-memory stream processing of all data a reality, and it should be part of any data modernization plans. This does not need to happen overnight, but can be applied on a use-case-by-use-case basis without necessitating ripping and replacing existing systems.

The most important first step enterprises can make today is to utilize streaming integration to move toward a streaming-first architecture. In a streaming-first architecture, all data is collected in a real-time, continuous fashion. Of course, companies can’t modernize overnight. But the ability to do continuous, real-time data collection enables organizations to integrate with legacy technologies. At the same time, they can reap the benefits of a modern data infrastructure capable of meeting the ever growing business and technology demands within the enterprise.

When data is being streamed, the solutions to the problems mentioned earlier become more manageable. Database-change streams help keep cloud databases synchronized with those on-premises while moving to a hybrid cloud architecture. In-memory edge processing and analytics can scale to huge data volumes and be used to extract the information content from data. This massively reduces its volume prior to storage. Streaming systems with self-service analytics can help companies be agile and nimble, and continuously monitoring systems can ensure regulatory compliance. Of course, new technologies become much easier to integrate if, instead of separate silos and data stores, you have a flexible streaming data distribution mechanism that provides low-latency capabilities for real-time insights.

In summary, data modernization is becoming essential for businesses focused on operational efficiency, customer experience, and gaining a competitive edge. This book will explain in detail data modernization through streaming integration to help you understand how you can apply it to solve real-world business problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset