Chapter 3. Pragmatic Challenges in Building Data Cleaning Systems

Acquiring and collecting data often introduces errors, including missing values, typos, mixed formats, replicated entries of the same real-world entity, and even violations of business rules. As a result, “dirty data” has become the norm, rather than the exception, and most solutions that deal with real-world enterprise data suffer from related pragmatic problems that hinder deployment in practical industry and business settings.

In the field of big data, we need new technologies that provide solutions for quality data analytics and retrieval on large-scale databases that contain inconsistent and dirty data. Not surprisingly, developing pragmatic data quality solutions is a challenging task, rich with deep theoretical and engineering problems. In this chapter, we discuss several of the pragmatic challenges caused by dirty data, and a series of principles that will help you develop and deploy data cleaning solutions.

Data Cleaning Challenges

In the process of building data cleaning software, there are many challenges to consider. In this section, we’ll explore seven characteristics of real-world applications, and the often-overlooked challenges they pose to the data cleaning process.

1. Scale

One of the building blocks in data quality is record linkage and consistency checking. For example, detecting functional dependency violations involves (at least) quadratic complexity algorithms, such as those that enumerate all pairs of records to assess if there is a violation (e.g., Figure 3-1 illustrates the process of determining that if two employee records agree on the zip code, they have to be in the same city). In addition, more expensive activities, such as clustering and finding the minimum vertex, work to consolidate duplicate records or to accumulate evidence of data errors. Given the complexity of these activities, cleaning large-scale data sets is prohibitively expensive, both computationally and in terms of cost. (In fact, scale renders most academic proposals inapplicable to real-world settings.) Large-scale blocking and hashing techniques are often used to trade off the complexity and recall of detected anomalies, and sampling is heavily used in both assessing the quality of the data and producing clean data samples for analytics.

fig1
Figure 3-1. Expensive operations in record deduplication

2. Human in the Loop

Data is not born an orphan, and enterprise data is often treated as an asset guarded by “data owners” and “custodians.” Automatic changes are usually based on heuristic objectives, such as introducing minimal changes to the data, or trusting a specific data source over others. Unfortunately, these objectives cannot lead to viable deployable solutions, since oftentimes human-verified or trusted updates are necessary to actually change the underlying data.

A major challenge in developing an enterprise-adoptable solution is allowing only trusted fixes to data errors, where “trusted” refers to expert interventions or verification by master data or knowledge bases. The high cost involved in engaging data experts and the heterogeneity and limited coverage of reference master data make trusted fixes a challenging task. We need to judiciously involve experts and knowledge bases (reference sources) to repair erroneous data sets.

Effective user engagement in data curation will necessarily involve different roles of humans in the data curation loop: data scientists are usually aware of the final questions that need to be answered from the input data, and what tools will be used to analyze it; business owners are the best to articulate the value of the analytics, and hence control the cost/accuracy trade-off; while domain experts are uniquely qualified to answer data-centric questions, such as whether or not two instances of a product are the same (Figure 3-2).

fig2
Figure 3-2. Humans in the loop

What makes things even more interesting is that enterprise data is often protected by layers of access control and policies to guide who can see what. Solutions that involve humans or experts have to adhere to these access control policies during the cleaning process. While that would be straightforward if these policies were explicitly and succinctly represented to allow porting to the data curation stack, the reality is that most of these access controls are embedded and hardwired in various applications and data access points. To develop a viable and effective human-in-the-loop solution, full awareness of these access constraints is a must.

3. Expressing and Discovering Quality Constraints

While data repairing is well studied for closed-form integrity constraints formulae (such as functional dependency or denial constraints), real-world business rules are rarely expressed in these rather limited languages. Quality engineers often require running scripts written in imperative languages to encode the various business rules (Figure 3-3). Having an extensible cleaning platform that allows for expressing rules in these powerful languages, yet limiting the interface to rules that are interpretable and practical to enforce, is a hard challenge. What is even more challenging is discovering these high-level business rules from the data itself (and ultimately verifying them via domain experts). Automatic business and quality constraints discovery and enforcement can play a key role in continually monitoring the health of the source data and pushing data cleaning activities upstream, closer to data generation and acquisition.

fig3
Figure 3-3. Sample business rules expressed as denial constraints

4. Heterogeneity and Interaction of Quality Rules

Data anomalies are rarely due to one type of error; dirty data often includes a collection of duplicates, business rules violations, missing values, misaligned attributes, and unnormalized values. Most available solutions focus on one type of error to allow for sound theoretical results, or for a practical scalable solution. These solutions cannot be applied independently because they usually conflict on the same data. We have to develop “holistic” cleaning solutions that compile heterogeneous constraints on the data, and identify the most problematic data portions by accumulating “evidence of errors” (Figure 3-4).

fig4
Figure 3-4. Data cleaning is holistic

5. Data and Constraints Decoupling and Interplay

Data and integrity constraints often interplay and are usually decoupled in space and time, in three different ways. First, while errors are born with the data, they are often discovered much later in applications, where more business semantics are available; hence, constraints are often declared and applied much later, and in multiple stages in the data processing life cycle. Second, detecting and fixing errors at the source, rather than at the application level, is important in order to avoid updatability restrictions and to prevent future errors. Finally, data cleaning rules themselves are often inaccurate; hence, a cleaning solution has to consider “relaxing” the rules to avoid overfitting and to respond to business logic evolution. Cleaning solutions need to build on causality and responsibility results, in order to reason about the errors in data sources. This allows for identifying the most problematic data, and logically summarizing data anomalies using predicates on the data schema and accompanying provenance information.

6. Data Variety

Considering only structured data limits the complexity of detecting and repairing data errors. Most current solutions are designed to work with one type of structured data—tables—yet businesses and modern applications process a large variety of data sources, most of which are unstructured. Oftentimes, businesses will extract the important information and store it in structured data warehouse tables. Delaying the quality assessment until after this information is extracted and loaded into data warehouses becomes inefficient and inadequate. More effective solutions are likely to push data quality constraints to the information extraction subsystem to limit the amount of dirty data pumped into the business intelligence stack and to get closer to the sources of errors, where more context is available for trusted and high-fidelity fixes (Figure 3-5).

fig5
Figure 3-5. Iterative by design

7. Iterative by Nature, Not Design

While most cleaning solutions insist on “one-shot cleaning,” data typically arrives and is handled incrementally, and quality rules and schema are continuously evolving. One-shot cleaning solutions cannot sustain large-scale data in a continuously changing enterprise environment, and are destined to be abandoned. The cleaning process is iterative by nature, and has to have incremental algorithms at its heart. This usually entails heavy collection and maintenance of data provenance (e.g., metadata that describes the sources and the types of changes the data is going through), in order to keep track of data “states.” Keeping track of data states allows algorithms and human experts to add knowledge, to change previous beliefs, and even to roll back previous actions.

Building Adoptable Data Cleaning Solutions

With hundreds of research papers on the topic, data cleaning efforts in industry are still pretty much limited to one-off solutions that are a mix of consulting work, rule-based systems, and ETL scripts. The data cleaning challenges we’ve reviewed in this chapter present real obstacles in building cleaning platforms. Tackling all of these challenges in one platform is likely to be a very expensive software engineering exercise. On the other hand, ignoring them is likely to produce throwaway system prototypes.

Adoptable data cleaning solutions can tackle at least a few of these pragmatic problems by:

  1. Having humans or experts in the loop as a first-class cleaning process for training models and verification

  2. Focusing on scale from the start, and not as an afterthought (which will exclude most naïve brute-force techniques currently used in problems like deduplication and schema mapping)

  3. Realizing that curation is a continuous incremental process that requires a mix of incremental algorithms and a full-fledged provenance management system in the backend, to allow for controlling and revising decisions long into the curation life cycle


  4. Coupling data cleaning activities to data consumption end-points (e.g., data warehouses and analytics stacks) for more effective feedback

Building practical, deployable data cleaning solutions for big data is a hard problem that is full of both engineering and algorithmic challenges; however, being programmatic does not mean being unprincipled.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset