CHAPTER 15

Data Architectures

Why Architecture?

The challenge of this approach is our ability to handle all the data we are now generating. In 2008, I spoke with a division of Autonomy (now HP) that had a product aimed at broadcasters. This system could ingest a video, strip out the audio file, convert it into text, and put it back together so that you could now search the video for any word spoken. It also converted any words it could find on the screen and used OCR (optical character recognition) to convert them to text, so a sign, a hoarding, or billboard could be similarly searched for. The system also ran a face recognition tool, so if the video included Bruce Springsteen, it could recognize Springsteen’s face and make this a searchable field as well. In the nascent world of Web video, this could be really powerful as it would allow both end users as well as production and edit teams to search for any content by person, word, or text on screen. The product ran into one problem. The cost of storing and managing that data set almost doubled the cost of storing the video itself. At a time when broadcasters were still figuring out the cost of video storage, they simply didn’t have the tools or budget for such a voluminous data set.

This is the problem that I lightheartedly call the unbearable bigness of data! We have so much data today that it hurts. We have terabyte drives sitting by our laptops sucking up our thousands of digital photographs and videos. We consume gigabytes of videos every day on Facebook or YouTube, on our mobile devices, without even thinking. Every act of ours leaves a digital data trail—a digital exhaust, which can be gathered and used by our service providers. We are plugged into global networks of social connections, smart energy, financial transactions, and in the foreseeable future, autonomous transport—zipping petabytes of data across the world. Even our bodies—now being understood at a subcellular and genome level—are generating data that is harvestable. Not only is this data insanely huge, it is also relatively unstructured. The challenge of dealing with this ever growing and unstructured data is what the industry calls Big Data.

Database Structures

For the longest time, the dominant way of storing data was in rows and columns in relational databases. Relational database management systems (or RDBMSs as they were called) stacked data in rows with an index. For example, a customer list in a CRM (customer relationship management) database would have customer name as a key but could have an address, contacts, and status as a columns. A separate table could have information on the 10 different kinds of statuses and what they meant; a third table could carry links between customers—such as family members. When these were indexed, searching became easy, for example, for a customer called John Smith or Raj Sen—you just had to index on the last name (i.e. sort by that column) and you wouldn’t have to check each column but go straight to S and the name would either be in the right place or not at all, in the database. However, if you were searching for John or Raj—you would need to re-sort this by first names. You can see why this would be tricky if there were too many sort fields. Other reasons why RDBMSs were struggling to keep up with the digital data were the size of the data, the speed at which they were being added, and also, the lack of structure. If the data didn’t have a clear rows and columns structure or the inter-relationships were too complex, this would be a problem. In the world of store transactions or clearly defined product catalogs, this wasn’t a problem. But if you’re trying to build a comprehensive view of the customer, as MetLife was looking to do in 2013, the relational database is no longer the right tool. When that view of the customer requires information about 100 products, sourced from 70 different systems, and you want to do that for 100 million customers—it needs a different kind of approach.

Hadoop/Spark

Today’s applications solve these problems of volume, velocity, and variety of data using a number of new tools for data. You will definitely have heard of Hadoop if you’ve worked on any digital project with a lot of data over the past few years. Hadoop is actually just a file management software—analogous to your windows file system, but designed to handle huge amounts of data, which may not have a clear structure. Hadoop is structured for scale in a way that the data can be distributed across multiple clusters of servers and therefore scale more easily. Each bit of data is also stored in more than one server, so it’s also more resilient—if one server goes down, the data is still accessible. Retrieval is done through a process called Map-Reduce. This method is a more efficient way of storing very large volumes of data, which may not be structured or may include documents tweets or multimedia files. In recent years, Hadoop has been superseded by Apache Spark, which, like Hadoop, is an open-source project. But Spark runs in memory, so is faster and much more suited to real-time search and retrieval, as well as real-time interactions, although the cost of RAM is higher. Another tool you might come across is Kafka, which is typically used alongside Hadoop or Spark, to handle streams and distributed messaging—say from a social media environment. Other tools from the Apache family include Hive (data warehousing), Flink (real-time event processing), and Storm (similar to Hadoop but works in real time rather than batches).

NoSQL

NoSQL is a database structure that uses key value pairs rather than rows and columns. This is a useful structure when not every item has the same descriptors. For example, a in a database of zoo animals, a key may be wings, and the value may be red, but not every animal will have them. Other keys could be scales or horns. This is more efficient than a row and column format with lots of blanks. MongoDB is one of the most popular NoSQL databases—and it was the tool used by MetLife in the preceding example, to build its single view of the customer.1 Bear in mind that this approach isn’t necessarily better in every way—you would probably still use RDBMS tools for handling transactions, or a traditional inventory management application. Hadoop/NoSQL isn’t the best system for real-time retrieval of information and real-time processing, in the way you need for transactions.

Graph Databases

Another evolution that stems from the exponential growth of the inter-relationships between the data is Graph databases, which are a specific type of NoSQL databases. Graph databases are able to capture this complexity of inter-relationships, which would be incredibly difficult to map, manage, and maintain in a traditional RDBMS. For example, when a bank tries to evaluate a customer’s credit risk based on their social network, or when you want to analyze the risk of fraud—you need to model a very high number of interconnected variables. As you can imagine, the social network of an individual is a complex network that is frequently changing—growing, shrinking, or shape-shifting, with clusters forming and dissipating over time. This is the home-turf of graph databases, which captures nodes and relationships in the data. In the case of fraud analysis, you may regularly encounter new information that doesn’t fit your current data model. This too is another difference, as pointed out by Neo4J, a well-known provider of Graph Databases: in a traditional RDBMS database, you have to build the model before adding data—that is construct a row/column structure with column labels, and specify its relationship to the other rows and columns, before you can put in a row of new data. Yet, the specific challenge you face may well be a discovery driven one, as is likely with fraud analysis, so your model may evolve as your analysis progresses.

It’s not just these obvious scenarios though, you can often reimagine current processes such as sales and customer service using a network model, as Telenor did with its approval process for new customers, to reduce response times from minutes to milliseconds. This network approach with inherently encoded nodes and relationships can be supported by graph databases, better than other formats.

The bottom line is that today, you don’t need a one-size-fits-all approach to manage your data challenges. Instead, this should be a portfolio of options, each one with its specific advantages, and benefits, for the different kinds of problems it solves most efficiently. Also, any data strategy has to focus on the efficiency and effectiveness of storing data but also its retrieval. Data is useless if it can’t be retrieved. So, while designing any data solution, you also need to be clear about the search and retrieval tools and strategy. For instance, Elastic Search, which is a tool designed for distributed document and data searches and often used on conjunction with Hadoop/NoSQL data.

Data Lakes

A lot of companies have moved to a model of data lakes because of the complexity of the data processing requirements. The data lake model holds data in flat files, rather than the rows and columns structure. The data lake stores the data in its native or original format until it is needed. So rather than apply a lot of resource-intensive cleaning, modeling, and processing, this is done in near-real time when the data is actually needed. Note that creating a data lake is not an outcome. It’s the equivalent of creating the foundation of a building, or perhaps organizing the shelves in a retail store. The value is only delivered with the data is consumed.

Data Processes

For all this data to be usable, it needs to be prepared, processed, and modeled. Traditionally, businesses separated their data to keep transaction and reporting requirements separate. Databases designed for transactions were not best suited for reporting and vice versa. This hasn’t really changed, as you can see. What has changed is that the analytical databases have become much more specialized and more innovative technologies have been used here to handle speed, volume, and inter-relationships. On the other hand, you still have the problem of data cleansing as a lot of data (especially from sources such as social media) may be incomplete and sometimes contain errors or duplication. This needs to be cleaned before it can be modeled.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset