As mentioned, big data can be unstructured or semi-structured, with a high level of heterogeneity. The information expressed in these datasets is an essential factor in the process for supporting decision making. That is one of the reasons that heterogeneous data must be integrated and analyzed to present a unique view of information for many kinds of applications. This section addresses the problem of modeling and integrating heterogeneous data that originates from multiple heterogeneous sources in the context of cyber-infrastructure systems and big data platforms.
The growth of big data swaps the planning strategies from long-term thinking to short-term thinking, as the management of the city can be made more efficient. Healthcare systems are also reconstructed by the big data paradigm, as data is generated from sources such as electronic medical records systems, mobilized health records, personal health records, mobile healthcare monitors, genetic sequencing, and predictive analytics, as well as a vast array of biomedical sensors and smart devices that rise up to 1,000 petabytes. The motivation for this section arises from the necessity of a unified approach to data processing in large-scale cyber-infrastructure systems as the characteristics of nontrivial scale cyber-physical systems exhibit significant heterogeneity. Here are some of the important features of a unified approach to big data modeling and data management:
- The creation of data models, data model analysis, and the development of new applications and services based on the new models. As discussed in Chapter 1, Introduction to Big Data and Data Management, the most important characteristics of big data are volume, variety, velocity, value, volatility, and veracity.
- Big data analytics is the process of examining large and varied datasets to discover hidden patterns, market trends, and customer preferences that can be useful for companies for business intelligence.
- Big data models represent the building blocks of big data applications. Chapter 4 , Categorizing Data Models, categorizes different types of data models.
- When it comes to big data representation and aggregation, the most important aspect is how to represent aggregate relational and non-relational data in the storage engines. In relation to uniform data management, a context-aware approach requires an appropriate model to aggregate, semantically organize, and access large amounts of data in various formats, collected from sensors or users.
- Apache Kafka (https://kafka.apache.org/) is a solution that proposes a unified approach to offline and online processing by providing a mechanism for parallel load in Hadoop systems, as well as the ability to partition real-time consumption over a cluster of machines. In addition to that, Apache Kafka provides a real-time publish-subscribe solution.
-
In order to overcome data heterogeneity in big data platforms and to provide a unified and unique view of heterogeneous data, a layer on top of the different data management systems, with aggregation and integration functions, must be created.