Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7
Localizing Errors in Data Records

7.1 Error Localization

Data can be invalidated with validation rules. If data is marked as invalid, what are the subsequent actions one should take? One possible action is to remove all invalid observations from the data. Although this may seem a sound advice, often this has severe consequences: removing all observations that contain invalid values may lead to the following:

remove many valid and useful values,
leave too little data to be analyzed,
introduce selection bias in the resulting data.

Removing observations with invalid data may result in a dataset that is not representative. In many cases, the erroneous records are very relevant for the research question at hand. In statistical practice, such observations often describe difficult-to-measure subpopulations that differ from the rest of the the population and are of interest for policy makers or from a scientific point of view. Examples of interesting populations in social statistics where survey data often suffer from incomplete and erroneous observations are elderly, migrants, foreign workers, students, and homeless people. Examples in economic data are startups, flex workers, and customer-to-customer sales. Removing such records can heavily bias the statistical estimates that follow. As an extreme example, suppose that every case contains a single invalid or missing value. Removing each invalid case would then result in no data at all, thereby wasting all the observed values that are valid.

An alternative for removing invalid data is to “correct” the invalid data through data editing so that the edited data can be used in the data analysis. Changing (scientific) data for further analysis may seem wrong but often is the lesser of the two evils: having no or heavily biased data versus minimally edited data. Moreover, in some cases, the measurement error causing violation of the validation rules can be reconstructed or properly estimated. In this sense, the whole editing process can be viewed as part of the estimation procedure (van der Loo et al., 2017).

The overall data-editing strategy we follow is to detect invalid data, determine which values are considered faulty, and remove those values so that they can be imputed in later steps. This procedure thus generates missing values, and they should be chosen so that these can be imputed in a way so that no rule violations remain. As a guiding principle, we therefore demand that the original data is altered as little as possible while ensuring that the data is no longer invalidated by validation rules.

The ability to correct values requires that data errors can be identified. The procedure to find errors is referred to as error localization: which values of the data are causing invalidation and are therefore considered to be errors?

An error localization procedure must thus search the space of field combinations to find those combinations of fields that can be altered to satisfy a set of validation rules. The space of possibilities is large, and in principle, the solution need not be unique. Most error localization procedures are restricted to classes of rules that can be evaluated for each case (record) separately. That is, to check whether a case contains valid data, one only needs access to the data in that particular record. No reference to other records or datasets is necessary. Since there are less fields to consider, this restriction on the ruleset makes the problem of error localization more manageable. Moreover, error localization procedures are aimed at minimizing the (weighted) number of fields to adapt, to stay as close to the original data as possible. To clarify this, let us have a look at some examples.

For univariate rules, error localization is trivial: if an observation is invalidated by an univariate validation rule, the value of the corresponding variable is invalid.

Univariate rules are variable domain definitions: they define what values are allowed. For numeric variables, univariate rules are defined as one or more ranges, and for categorical variables they are defined as the set of allowed discrete values.

Error localization in a multivariate validation rule is more complex: in principle, any combination of values can be invalid.

This example illustrates that solutions to error localization problems are in general not unique. We can make the data satisfy the validation rule by changing the value of either Age or Married (or both). It is sometimes possible to leverage extra information. For example, if a person's date of birth is known from another reliable source, then this can be used to point out the field that most likely contains the error.

Things become more complicated still when multiple rules are involved that pertain to an overlapping set of variables. In that case, changing a value to satisfy one rule may cause the violation of another.

This example shows that an error localization procedure has to take all validation rules into account, including the rules that are not violated.

These examples motivate the following definition of the error localization procedure in terms of an optimization problem.

This minimization problem is named after (Fellegi and Holt, 1976), who first formulated and solved the problem for the case of categorical data under logical restrictions. Assigning a weight to each variable allows domain experts to influence which variables are more likely to be part of a solution. Since variables with a higher assigned weight are less likely to end up in the optimal solution, the weights are often referred to as ‘reliability weights’. Domain experts can assign higher reliability weights to variables they deem to have higher quality.

Over time, several algorithms have been developed for solving the error localization problem. An extensive overview of the theory is given in de Waal et al. (2011, Chapters 3–5). On one hand, there are specialized approaches that explicitly manipulate the rule set such as the branch-and-bound algorithm of de Waal and Quere (2003) and the original method of Fellegi and Holt (1976). On the other hand, there are approaches that translate the problem to a well-known general optimization problem (mixed-integer programming (MIP)). Some specialized approaches that focus on a single data type and a single type of rules exist as well.

In this chapter, we focus on the MIP approach. The advantage of a MIP approach is that it allows one to leverage existing, proven implementations of MIP solvers, either commercial or open source. The main downside of the MIP approach compared to the branch-and-bound algorithm is that users have less fine-grained influence on the chosen solution. Indeed, MIP solvers often return just one of the optimal solutions when multiple equivalent solutions exist, while implementations of the branch-and-bound method often return all solutions.

In the rest of this chapter, we first describe the usage of the errorlocate package. This section is then followed by a description of the translation of the error localization problem into a MIP-problem for the linear, categorical, and conditional validation rules. We end the chapter with describing the numerical stability of the implemented algorithm and give suggestions on handling potential issues.

7.2 Error Localization with R

7.2.1 The Errorlocate Package

The errorlocate package offers a small set of functions that allows users to localize errors in records that are to be validated by a set of linear, categorical, or conditional validation rules using a fast implementation of the Fellegi–Holt paradigm. Erroneous values can be automatically replaced with NA or another chosen value. The package is extensible by allowing users to implement their own error location routines. The package integrates with the validate package, which is used for rule definition and manipulation.

In errorlocate, three types of rules can be treated. First, linear rules express linear (in)equality conditions on numerical variables. Examples include (account) balance equations or nonnegativity restrictions. Second, categorical rules express conditional relations (logical implications) between two or more categorical variables. They are expressed in if-then syntax, for example, if ( age < 16) marital_status == "unmarried". Third, there are rules of mixed type. These are logical conditions where each clause consists of either a linear restriction or a restriction on categorical values. The following example illustrates how such rules can be defined in the validate package.

  rules <- validator(
    employees >= 0,                             # numerical
    profit == turnover - cost,                  # numerical
    sector   %in% c("nonprofit", "industry"),   # categorical
    bankrupt %in% c(TRUE, FALSE),               # categorical
    if (bankrupt == TRUE) sector == "industry", # categorical
    if ( sector    == "nonprofit"
       | bankrupt  == TRUE
       ) profit    <= 0,     # conditional
    if (turnover > 0) employees>= 1            # conditional
  )

The variables used in the rules are taken from the observation to be checked. It is also possible to supply one or more reference datasets that can be used in the error localization. The variables of these reference datasets are then excluded from the error localization procedure.

A Simple Example

Finding error locations with errorlocate is done with the locate_errors function. This function needs at least two parameters: a data.frame and a validator object.

The function locate_errors returns an errorlocation object that contains not only the location of errors but also some metadata and data on the process of finding the error. The location of the errors can be retrieved using the values method of errorlocation.

  library(errorlocate)
  rules <- validator(
    age >= 0
  )
  raw_data <- data.frame(age = -1, married = TRUE)
  le <- locate_errors(raw_data, rules)
  values(le)
  ##       age married
  ## [1,] TRUE   FALSE

In the example, the variable age is erroneous, since an age cannot be negative. Let us expand the example a little bit:

  library(errorlocate)
  rules <- validator(
    age >= 0,
    if (age <= 16) married == FALSE
  )
  raw_data <- data.frame(age = -1, married = TRUE)
  le <- locate_errors(raw_data, rules)
  values(le)
  ##       age married
  ## [1,] TRUE   FALSE

We have added a conditional restriction describing that a person cannot be married at the age of 16 years or younger. error_locate correctly identifies age to be a value that is causing the invalidation of both rules.

We can automatically remove errors using the replace_errors function.

  replace_errors(raw_data, rules)
  ##   age married
  ## 1  NA    TRUE

replace_errors internally calls the error_locate function. Any extra argument supplied to replace_errors will be passed through to error_locate.

What happens if age is 4?

  library(errorlocate)
  v <- validator(
    age >= 0,
    if (married==TRUE) age >= 16
  )
  raw_data <- data.frame( age     = 4
                        , married = TRUE)
  le <- locate_errors(raw_data, v)
  values(le)
  ##        age married
  ## [1,] FALSE    TRUE

Note that both age and married can be wrong. error_locate decides at random which of the variables is considered to be wrong. This is sensible since error_locate has no further information whether age or married is to be preferred. However, it is possible to supply a weight vector, matrix, or data.frame to error_locate to indicate in which value error_locate should put more trust.

  raw_data <- data.frame(age = 4, married = TRUE)
  weight <- c(age = 2, married = 1) # assuming we have a
  registration of marriages
  le <- locate_errors(raw_data, v, weight=weight)
  values(le)
  ##        age married
  ## [1,] FALSE    TRUE

Some Words on the Implementation of `errorlocate`

The implementation of the Fellegi–Holt algorithm is located in an R class FHErrorLocator, which is derived from the base class ErrorLocator. Each ErrorLocator object has to implement two methods.

The editrules package, the predecessor of errorlocate, implemented error localization with the branch-and-bound algorithm described by de Waal (2003) and a MIP formulation.

The main disadvantage of the branch-and-bound algorithm is that it has $c07-math-004$ worst-case time and memory complexity, where $c07-math-005$ is the number of variables occurring in a connected set of rules. Moreover, the branch-and-bound solver was written in pure R, making it intrinsically slower than a compiled language implementation. The main advantages of this approach were the ease of implementation and the opportunity for users to exert fine-grained control over the algorithm.

7.3 Error Localization as MIP-Problem

The error localization problem can be translated to a MIP-problem. This allows us to reuse well-established results from the field of linear and MIP. Indeed, many advanced algorithms for solving such problems have been developed, and in many cases, implementations in a compiled language are available under a permissive license. In errorlocate, the solver of the lp_solve library (Berkelaar et al., 2010) is used through R's lpSolveAPI package (Konis, 2011). The lp_solve library is written in ANSI C and has been tried and tested extensively.

The strategy to solve error localization problems through this library from R therefore consists of translating the problem to a suitable MIP-problem, feeding this problem to lpSolveAPI, and translating the results back to an error location. It is necessary to distinguish among the following:

linear restrictions on purely numerical data,
restrictions on purely categorical data, and
conditional restrictions on mixed-type data,

since restriction for each data type calls for a different translation to a MIP problem.

We will focus on how to translate these types of error localization problems to a mixed-integer formulation, paying attention to both theoretical and practical details. In Section 7.4, attention is paid to numerical stability issues and Section 7.2 is devoted to examples in R code.

7.3.1 Error Localization and Mixed-Integer Programming

A MIP-problem is an optimization problem that can be written in the form

7.1

where $c07-math-007$ is a constant vector and $c07-math-008$ is a vector consisting of real and integer coefficients. One usually refers to $c07-math-009$ as the decision vector and the inner product $c07-math-010$ as the objective function. Furthermore, $c07-math-011$ is a coefficient matrix and $c07-math-012$ a vector of upper bounds. Formally, the elements of $c07-math-013$ , $c07-math-014$ , and $c07-math-015$ are limited to the rational numbers (Schrijver, 1998). This is never a problem in practice since we are always working with a computer representation of numbers.

The name MIP stems from the fact that $c07-math-016$ contains continuous as well as integer variables. When $c07-math-017$ consists solely of continuous or integer variables, Problem (7.1) reduces, respectively, to a linear or an integer programming problem. An important special case occurs when the integer coefficients of $c07-math-018$ may only take values from {0,1}. Such variables are often called binary variables. It occurs as a special case since defining $c07-math-019$ to be integer and applying the appropriate upper bounds yield the same problem.

MIP is well understood, and several software packages are available that implement efficient solvers. Most MIP software support a broader, but equivalent, formulation of the MIP problem, allowing the set of restrictions to include inequalities as well as equalities. As a side note we mention that under equality restrictions, solutions for the integer part of $c07-math-020$ are only guaranteed to exist when the equality restrictions pertaining to the integer part of $c07-math-021$ are totally unimodular.¹ However, as we will see below, restrictions on $c07-math-025$ are always inequalities in our case, so this is of no particular concern to us.

We reformulate Fellegi–Holt error localization (Fellegi and Holt, 1976) for numerical, categorical, and mixed-type restrictions in terms of MIP problems. The precise reformulations of the error localization problem for the three types of rules are different, but in each case the objective function is of the form

7.2

where $c07-math-027$ is a vector of positive weights and $c07-math-028$ a vector of binary variables, one for each variable in the original record, that indicates whether its value should be replaced. More precisely, for a record $c07-math-029$ of $c07-math-030$ variables, we define

7.3

This objective function obviously meets the requirement that the minimal (weighted) number of variables should be replaced. In general, a record may contain numeric, categorical, or both types of data, and restrictions may pertain to either one or both data types. To distinguish between the data types below, we shall write $c07-math-032$ , where $c07-math-033$ represents the categorical and $c07-math-034$ the numerical part of $c07-math-035$ .

For an error localization problem, the restrictions of Problem (7.1) consist of two parts, which we denote

7.4

Here, the restrictions indicated with $c07-math-037$ represent a matrix representation of the user-defined (hard) restrictions that the original record $c07-math-038$ must obey. The vector $c07-math-039$ contains at least a numerical representation of the values in a record $c07-math-040$ and the binary variables $c07-math-041$ . An algorithmic MIP solver will iteratively alter the values of $c07-math-042$ until a solution satisfying (7.4) is reached. To make sure that the objective function reflects the (weighted) number of variables altered in the process, the restrictions in $c07-math-043$ serve to make sure that the values in $c07-math-044$ that represent values in $c07-math-045$ cannot be altered without setting the corresponding value in $c07-math-046$ to 1.

Summarizing, in order to translate the error localization problem for the special cases of linear, categorical, or conditional mixed-type restrictions to a general MIP-problem, for each case we need to properly define $c07-math-047$ , the restriction set $c07-math-048$ , and the restriction set $c07-math-049$ .

7.3.2 Linear Restrictions

For a numerical record $c07-math-050$ taking values in $c07-math-051$ , a set of linear restrictions can be written as

7.5

where in errorlocate, we allow the set of restrictions to contain equalities, inequalities ( $c07-math-053$ ), and strict inequalities ( $c07-math-054$ ). The formulation of these rules is very close to the formulation of the original MIP problem of Eq. (7.1). The vector to minimize over is defined as follows:

7.6

with the $c07-math-056$ as in Eq. (7.3). The set of restrictions $c07-math-057$ is equal to the set of restrictions of Eq. (7.5), except in the case of strict inequalities. The reason is that while errorlocate allows the user to define strict inequalities ( $c07-math-058$ ), the lpsolve library used by errorlocate only allows for inclusive inequalities ( $c07-math-059$ ). For this reason, strict inequalities of the form $c07-math-060$ are rewritten as $c07-math-061$ , with $c07-math-062$ a suitably small positive constant.

In the case of linear rules, the set of constraints $c07-math-063$ consists of pairs of the form

7.7

for $c07-math-065$ . Here, $c07-math-066$ are the actual observed values in the record and $c07-math-067$ a suitably large positive constant allowing $c07-math-068$ to vary between $c07-math-069$ and $c07-math-070$ . It is not difficult to see that if $c07-math-071$ is different from $c07-math-072$ , then $c07-math-073$ must equal 1. For, if we choose $c07-math-074$ , we obtain the set of restrictions

7.8

which states that $c07-math-076$ equals $c07-math-077$ .

7.3.3 Categorical Restrictions

Categorical records $c07-math-088$ take values in a Cartesian product domain

7.9

where each $c07-math-090$ is a finite set of categories for the $c07-math-091$ categorical variable. The category names are unimportant, so we write

7.10

The total number of possible value combinations $c07-math-093$ is equal to the product of the $c07-math-094$ .

A categorical edit is a subset $c07-math-095$ of $c07-math-096$ where records are considered invalid, and we may write

7.11

where each $c07-math-098$ is a subset of $c07-math-099$ . It is understood that if a record $c07-math-100$ , then the record violates the edit. Hence, categorical rules are negatively formulated (they specify the region of $c07-math-101$ where $c07-math-102$ may not be) in contrast to linear rules that are positively formulated (they specify the region of $c07-math-103$ where $c07-math-104$ must be). To be able to translate categorical rules to a MIP problem, we need to specify $c07-math-105$ such that if $c07-math-106$ , then $c07-math-107$ satisfies $c07-math-108$ . Here, $c07-math-109$ is the complement of $c07-math-110$ in $c07-math-111$ , which can be written as

7.12

where for each variable $c07-math-113$ , $c07-math-114$ is the complement of $c07-math-115$ in $c07-math-116$ . Observe that Eq. (7.12) states that if at least one $c07-math-117$ , then $c07-math-118$ satisfies $c07-math-119$ . Below, we will use this property and construct a linear relation that counts the number of $c07-math-120$ over all variables.

To be able to formulate the Fellegi-Holt problem in terms of a MIP problem, we first associate with each categorical variable $c07-math-121$ a binary vector $c07-math-122$ of which the coefficients are defined as follows (see also Eq. (7.9)):

7.13

where $c07-math-124$ . Thus, each element of $c07-math-125$ corresponds to one category in $c07-math-126$ . It is zero everywhere except at the value of $c07-math-127$ . We will write $c07-math-128$ to indicate the concatenated vector $c07-math-129$ , which represents a complete record. Similarly, each edit can be represented by a binary vector $c07-math-130$ given by

7.14

where we interpret 1 and 0 as TRUE and FALSE, respectively, and the logical 'or' ( $c07-math-132$ ) is applied element-wise to the coefficients of $c07-math-133$ . The above relation can be interpreted as stating that $c07-math-134$ represents the valid value combinations of variables contained in the edit.

To set up the hard restriction matrix $c07-math-135$ of Eq. (7.4), we first impose the obvious restriction that each variable can take but a single value:

7.15

for $c07-math-137$ . It is now not difficult to see that the demand (Eq. (7.12)) that at least one of the $c07-math-138$ may be written as

7.16

Eqs. (7.15) and (7.16) constitute the hard restrictions, stored in $c07-math-140$ .

Using the binary vector notation for $c07-math-141$ , and adding the $c07-math-142$ variables that indicate variable change, the vector to minimize over (Eq. (7.1)) is written as

7.17

To ensure that a change in $c07-math-144$ results in a change in $c07-math-145$ , the matrix $c07-math-146$ contains the restrictions

7.18

for $c07-math-148$ . Here, $c07-math-149$ is the observed value for variable $c07-math-150$ . One may check, using Eq. (7.13), that the above equation can only hold when either $c07-math-151$ and $c07-math-152$ (the original value is retained) or $c07-math-153$ and $c07-math-154$ (the value changes).

7.3.4 Mixed-Type Restrictions

Records $c07-math-169$ containing both numerical and categorical data can be denoted as a concatenation of categorical and numerical variables taking values in $c07-math-170$ :

7.20

where $c07-math-172$ is defined in Eq. (7.9). As stated earlier, categorical edits are usually defined negatively as a region of $c07-math-173$ that is disallowed, while linear rules define regions in $c07-math-174$ that are allowed. We may choose a negative formulation of rules containing both variable types by defining a single rule $c07-math-175$ as follows:

7.21

where $c07-math-177$ and $c07-math-178$ is a convex subset of $c07-math-179$ defined by a (possibly empty) set of $c07-math-180$ linear inequalities of the form $c07-math-181$ . It is understood that if $c07-math-182$ , then $c07-math-183$ violates the rule. An example of a restriction pertaining to a categorical and a numerical variable is ‘A company employs staff if and only if it has positive personnel cost’. The corresponding rule can be denoted as $c07-math-184$ .

To obtain a positive reformulation, we first negate the set membership condition and apply basic rules of proposition logic:

7.22

This then yields a positive formulation of $c07-math-186$ . That is, a record $c07-math-187$ satisfies $c07-math-188$ if and only if

7.23

Observe that this formulation allows one to define multiple disconnected regions in $c07-math-190$ containing valid records using just a single rule. For example, one may define a numeric variable to be either smaller than 0 or larger than 1. This type of restriction cannot be formulated using just linear numerical restrictions.

This formulation is both a generalization of linear inequality (Eq. (7.5)) and categorical rules (Eq. (7.11)). Choosing $c07-math-191$ , we get $c07-math-192$ , and only the categorical part remains. Similarly, choosing $c07-math-193$ , only the disjunction of linear inequalities remains. A system of linear equations that must simultaneously be obeyed like in Eq. (7.5) can be obtained by defining multiple rules $c07-math-194$ , each containing a single linear restriction.

The definition in Eq. (7.22) can be rewritten as a ‘conditional rule’ using the implication replacement rule from propositional logic, which states that $c07-math-195$ may be replaced by $c07-math-196$ . If we limit Eq. (7.22) to a single inequality, we obtain the normal form of de Waal (2003).

7.24

If we choose $c07-math-198$ and leave two inequalities, we obtain a conditional rule on numerical data:

7.25

Writing mixed-type rules in conditional form seems more user-friendly as they can directly be translated into an if statement in a scripting language. Finally, note that equalities can be introduced by defining pairs of rules like the following:

7.26

To reformulate Eq. (7.22) as a MIP problem, we first define binary variables $c07-math-201$ that indicate whether $c07-math-202$ obeys $c07-math-203$ :

7.27

Using the or-form of the set condition (Eq. (7.22)), we can write the mixed-data rule as

7.28

Recall from Eqs. (7.14) and (7.13) that $c07-math-206$ is the binary vector representation of a categorical rule and $c07-math-207$ the binary vector representation of a categorical record. In the above equation, the ‘ $c07-math-208$ ’ is the arithmetic translation of the logical ‘ $c07-math-209$ ’ operator in Eq. (7.22) that connects the categorical with the linear restrictions. When any one of the two terms is positive, record $c07-math-210$ satisfies rule $c07-math-211$ .

Rules of this form constitute the user-defined part of the $c07-math-212$ part of the restriction matrix. To explicitly identify $c07-math-213$ with the linear restrictions, we also add

7.29

to $c07-math-215$ with $c07-math-216$ a suitably large positive constant. Indeed, if $c07-math-217$ , the inequality $c07-math-218$ is enforced, and Eq. (7.28) always is satisfied. When $c07-math-219$ , the whole restriction can hold regardless of whether the inequality holds. Finally, similar to the purely categorical case, we need to add restrictions on the binary representation of $c07-math-220$ as in Eq. (7.15), so Eqs. (7.15) (7.28), and (7.29) constitute $c07-math-221$ .

There may be multiple mixed-type rules, each yielding one or more $c07-math-222$ indicator variables for each rule. The decision vector for the MIP problem may therefore be written as

7.30

where $c07-math-224$ is the total number of linear rules occurring in all the mixed-type rules. Finally, the $c07-math-225$ matrix connecting the change indicator variables ( $c07-math-226$ ) with the actual recorded values consists of the union of the restrictions for categorical variables (Eq. (7.18)) and those for numerical variables (Eq. (7.7)).

Example 7.3.8

We consider a record $c07-math-227$ with the variables type of business $c07-math-228$ , which takes values in $c07-math-229$ , where “sp” stands for “sole proprietorship”, personnel cost $c07-math-230$ , and number of staff $c07-math-231$ . Hence, we have $c07-math-232$ . We impose the following rules on $c07-math-233$ : $c07-math-234$ , $c07-math-235$ , $c07-math-236$ , and if the business type is a sole proprietorship, then the number of staff must equal zero. This may be expressed as $c07-math-237$ or equivalently $c07-math-238$ . For a record $c07-math-239$ , the error localization problem takes the following form:

The first row in the restriction represents the mixed-type rule, translated as shown in Eq. (7.28). Row 6 connects the indicator variable $c07-math-240$ with the numerical rule in the consequent of $c07-math-241$ . Rows 2, 3, and 4 represent the numerical rules limiting values of $c07-math-242$ and $c07-math-243$ . Row 5 forces $c07-math-244$ to have only one value, and row 7 connects the value of $c07-math-245$ with that of $c07-math-246$ . Finally, rows 8–11 connect the numerical variables with the corresponding change indicators.

7.4 Numerical Stability Issues

An error localization problem, in its original formulation, is an optimization problem over $c07-math-247$ binary decision variables that indicate which variables in a record should be adapted. Depending on the type of rules, its reformulation as a MIP problem adds at least $c07-math-248$ variables and $c07-math-249$ restrictions. Moreover, the reformulation as a MIP problem introduces a constant $c07-math-250$ , the value of which has no mathematical significance but for which a value must be chosen in practice. Because of limitations in machine accuracy, which is typically on the order of $c07-math-251$ , the range of problems that can be solved is limited as well. In particular, MIP problems that involve both very large and very small numbers in the objective function and/or the restriction matrix may yield erroneous solutions or become numerically unfeasible. Indeed, the manual of lp_solve (Berkelaar et al., 2010) points out that ‘[…] to improve stability, one must try to work with numbers that are somewhat in the same range. Ideally, in the neighborhood of 1’. The following sections point out a number of sources of numerical instabilities and provide ways to handle them.

7.4.1 A Short Overview of MIP Solving

Consider a set of linear restrictions on numerical data of the form $c07-math-252$ , where we assume $c07-math-253$ , and the restrictions consist solely of inequalities ( $c07-math-254$ ). In practice, these restrictions will not limit the type of linear rules covered by this discussion, since it can be shown that all linear rules can be brought to this form, possibly by introducing dummy variables [see, e.g., Bradley et al. (1977); Schrijver (1998)]. Furthermore, suppose that we have a record $c07-math-255$ that does not obey the restrictions. The MIP formulation of the error localization problem can be written as follows:

7.31

and $c07-math-257$ . Also, $c07-math-258$ denotes the unit matrix, $c07-math-259$ a vector with all coefficients equal to 1, and $c07-math-260$ . The last row is added to force $c07-math-261$ . This is necessary because we will initially treat the binary variables $c07-math-262$ as if they are real numbers in the range $c07-math-263$ .

The lp_solve library uses an approach based on the revised Phase I– Phase II simplex algorithm to solve MIP problems. In this approach, every inequality of Eq. (7.31) is transformed to an equality by adding dummy variables: each row $c07-math-264$ is replaced by $c07-math-265$ , with $c07-math-266$ . Depending on the sign of the inequality, the extra variable $c07-math-267$ is called a slack or surplus variable. In Eq. (7.31), there are four sets of restrictions (rows). We therefore need to add four sets of surplus and slack variables (columns) in order to rewrite the whole system in terms of equalities.

Note that after this transformation, the whole problem including the cost function is written in terms of equalities. It is customary to organize this set of equality objective function in a single tableau notation as follows:

7.32

Here, the first row and column represent the cost function. Columns 2 and 3 correspond to the original set of variables in Eq. (7.31), while columns 4-7 correspond to sets of slack and surplus variables. The final column contains the constant vector.

A tableau representation shows all the numbers that are relevant in an LP problem at a glance. By examining how LP solvers typically manipulate these numbers, we gain some insight into how and where numerical stability issues may arise.

Since the tableau represents a set of linear equalities, it may be manipulated as such. In fact, the simplex method is based on performing a number of cleverly chosen Gauss–Jordan elimination steps on the tableau. For a complete discussion, the reader is referred to one of the many textbooks discussing it (e.g., Bradley et al. (1977)), but in short, the Phase I–Phase II simplex algorithm consists of the following steps:

Phase I.
Repeatedly apply Gauss–Jordan elimination steps (called pivots) to derive a decision vector that obeys all restrictions. A vector obeying all restrictions is called a basic solution.
Phase II.
Repeatedly apply pivots to move from the initial nonoptimal solution to the solution that minimizes the objective function $c07-math-269$ .

In Phase I, a decision vector $c07-math-270$ (with $c07-math-271$ the vector of slack and surplus variables) is derived that obeys all restrictions. The precise algorithm need not be described here. It involves adding again extra variables where necessary and then manipulating the system of equalities represented by the tableau so that those extra variables are driven to zero. The binary variables $c07-math-272$ are first treated as if they are real variables. In Appendix 7.A, it is shown in detail how an initial solution for Eq. (7.32) can be found; here, we just state the result of a Phase I operation:

7.33

This tableau immediately suggests a valid solution: it is easily confirmed by matrix multiplication that the vector $c07-math-274$ obeys all restrictions. In the above form of a tableau, where the restriction matrix contains (a column permutation of) the unit matrix, the right-hand side has only nonnegative coefficients, and the cost vector equals zero for the columns above the unit matrix, which is called the canonical form.

Now, a pivot operation consists of the following steps:

1. Select a positive element $c07-math-275$ from the restriction matrix. This is called the pivot element.
2. Multiply the $c07-math-276$ th row by $c07-math-277$ .
3. Subtract the $c07-math-278$ th row, possibly after rescaling, from all other rows of the tableau such that their $c07-math-279$ th column equals zero.

The result of a pivot operation is again a tableau in canonical form but with possibly a different value for the cost function. The simplex algorithm proceeds by selecting pivots that decrease the cost function until the minimum is reached or the problem is shown to be unfeasible.

Up until this point, we have treated the binary variables $c07-math-280$ as if they were real variables, so the tableaux discussed above do not represent solutions to our original problem, which demands that all $c07-math-281$ are either 0 or 1. In the lp_solve library, this is solved as follows:

1. For each optimized value $c07-math-282$ , test whether it is 0 or 1. If all $c07-math-283$ are integers, we have a valid solution of objective value $c07-math-284$ , and we are done.
2. For the first variable $c07-math-285$ that is not integer, create two submodels: one where the minimum value of $c07-math-286$ equals 1, and one where the maximum value of $c07-math-287$ equals 0.
3. Optimize the two submodels. If solutions exist, the result will contain an integer $c07-math-288$ .
4. For the submodels that have a solution and whose current objective value does not exceed that of an earlier found solution, return to step 1.

The above branch-and-bound approach completes this overview. The discussion of pivot and branch-and-bound operations has so far been purely mathematical: no choices have been made regarding issues such as how to decide when the floating-point representation of a value is regarded zero, or how to handle badly scaled problems. Do note, however, that in the course of going from Phase I to Phase II, the LP solver is handling numbers that may range from $c07-math-289$ to $c07-math-290$ , which typically differ many orders of magnitude.

7.4.2 Scaling Numerical Records

In the MIP formulation of error localization over numerical records under linear restrictions, Eq. (7.7) restricts the search space around the original value $c07-math-291$ to $c07-math-292$ . This restriction may prohibit a MIP solver from finding the actual minimal set of values to adapt or even render the MIP problem unsolvable. As an example, consider the following error localization problem on a two-variable record:

Obviously, the record can be made to obey the restriction by multiplying $c07-math-293$ by $c07-math-294$ or by dividing $c07-math-295$ by the same amount. However, in errorlocate, the default value for $c07-math-296$ , which renders the corresponding MIP problem unsolvable. Practical example where such errors occur is when a value is recorded in the wrong unit of measure (e.g., in € instead of k€.).

It is therefore advisable to remove such unit-of-measure errors prior to error localization² and to express numerical records on a scale such that all $c07-math-297$ . Note that under linear restrictions (Eq. (7.5)) one may always apply a scaling factor $c07-math-298$ to a numerical record $c07-math-299$ by replacing $c07-math-300$ with $c07-math-301$ . In the above example, one may replace $c07-math-302$ by $c07-math-303$ for the purpose of error localization. If $c07-math-304$ and the coefficients of $c07-math-305$ do not vary over many orders of magnitude, such a scaling will suffice to numerically stabilize the MIP problem.

7.4.3 Setting Numerical Threshold Values

On most modern computer systems, real numbers are represented in IEEE (2008) double precision format. In essence, real numbers are represented as rounded-off fractions so that arithmetic operations on such numbers always result in loss of precision and round-off errors. For example, even though mathematically we have $c07-math-306$ , in the floating-point representation (denoted $c07-math-307$ ), we have $c07-math-308$ . In fact, the difference is about $c07-math-309$ in this case.

This means that in practice one cannot rely on equality tests to determine whether two floating-point numbers are equal. Rather, one considers two numbers $c07-math-310$ and $c07-math-311$ equal when $c07-math-312$ is smaller than a predefined tolerance. For this reason, lp_solve comes with a number of predefined tolerances. These tolerances have default values, but these may be altered by the user.

The tolerances implemented by lp_solve are summarized in Table 7.1. The value of epspivot is used to determine whether an element of the restriction matrix is positive so that it may be used as a pivoting element. Its default value is $c07-math-313$ , but note that after Phase I, our restriction matrix contains elements on the order of $c07-math-314$ . For this reason, the value of epspivot is lowered in errorlocate by default, but users may override these settings. For the same reason, the value of epsint, which determines when a value for one of the $c07-math-315$ can be considered integer, is lowered in errorlocate as well. The other tolerance settings of lp_solve, epsb (to test if the right-hand side of the restrictions differs from 0), epsd (to test if two values of the objective function differ), epsel (all other values), and mip_gap (to test whether a bound condition has been hit in the branch-and-bound algorithm), have not been altered.

The limited precision inherent to floating-point calculations implies that computations get more inaccurate as the operands differ more in magnitude. For example, on any system that uses double precision arithmetic, the difference $c07-math-316$ is indistinguishable from $c07-math-317$ . This, then, leads to two contradictory demands on our translation of an error localization problem to a MIP problem. On one hand, one would like to set $c07-math-318$ as large as possible so that the ranges $c07-math-319$ contain a valid value of $c07-math-320$ . On the other hand, large values for $c07-math-321$ imply that MIP problems such as Eq. (7.31) may become numerically unstable.

In practice, the tableau used by lp_solve will not be exactly the same as represented in Eq. (7.33). Over the years, many optimizations and heuristics have been developed to make solving linear programming problems fast and reliable, and several of those optimizations have been implemented in lp_solve. However, the tableau of Eq. (7.33) does fundamentally show how numerical instabilities may occur: the tableau simultaneously contains numbers on the order of $c07-math-322$ and on the order of $c07-math-323$ . It is not at all unlikely that the two differ in many orders of magnitude.

The above discussion suggests the following rules of thumb to avoid numerical instabilities in error localization problems:

1. Make sure that elements of $c07-math-324$ are expressed in units such that $c07-math-325$ , $c07-math-326$ , and $c07-math-327$ are on the order of 1 wherever possible.
2. Choose a value of $c07-math-328$ appropriate for $c07-math-329$ .
3. If the above does not help in stabilizing the problem, try lowering the numerical constants of Table 7.1.

In our experience, the settings denoted in Table 7.1 have performed well in a range of problems where elements of $c07-math-330$ and $c07-math-331$ are on the order of 1 and values of $c07-math-332$ are in the range $c07-math-333$ . However, these settings have been made configurable so that users may choose their own settings as needed.

c07-math-335 — **Table 7.1** Numerical parameters for MIP-based error localization

7.5 Practical Issues

The following sections provide some background that can aid in choosing a set of reliability weights or to make a error localization problem more tractable by showing how certain rules can be simplified.

7.5.1 Setting Reliability Weights

The error localization problem can be reformulated as follows: For a record $c07-math-357$ violating one or more validation rules imposed on it, find the set of variables $c07-math-358$ of which the values can be altered such that (1) $c07-math-359$ satisfies all restrictions and (2) the sum

is minimized. Here, the vector $c07-math-360$ is a vector of chosen reliability weights. Solutions $c07-math-361$ that satisfy restriction (1) are called feasible solutions, and solutions that satisfy (2) are called minimal feasible solutions. We also denote with $c07-math-362$ the size of the solution. The solution for which $c07-math-363$ is minimal is called the feasible solution of minimal size.

When all weights are equal, the solution found by the error localization procedure is the solution that alters the least number of variables: in that case the solution of minimal weight is equal to the solution of minimal size. If the weights differ enough across the variables, this is not necessarily the case anymore, the solution of minimal weight may contain more variables than the solution of minimal size. As an example, suppose that there are $c07-math-364$ variables. We find for a certain error localization problem that there are three feasible solutions: $c07-math-365$ , $c07-math-366$ , and the trivial solution $c07-math-367$ . If the weight vector $c07-math-368$ , we find that $c07-math-369$ is the solution of minimal size, while $c07-math-370$ is the solution of minimal weight.

In practice, weights are often determined by domain experts, based on their experience with repeatedly analyzing a type of dataset. One question arising in this context is whether it is possible to choose a set of weights such that the solution of minimal weight can be guaranteed to be a solution of minimal size. Such a set of weights then represents the guiding principle that states that we wish to change as few values as possible while still allowing to control the localization procedure with information on the relative reliability of variables. Perhaps surprisingly, it is possible to create such a vector. To show this we will derive the following result (this is a slightly stricter version of the result in van der Loo (2015a)).

We can apply this result to construct a rescaling for any weight vector so that the feasible solution of minimal weight becomes a feasible solution of minimal size.

7.34

Here, we defined $c07-math-400$ so that $c07-math-401$ . This of course ensures that $c07-math-402$ .

To illustrate this scaling, return again to the example with $c07-math-403$ , $c07-math-404$ , and feasible solutions $c07-math-405$ $c07-math-406$ and $c07-math-407$ . After scaling, we get $c07-math-408$ , so $c07-math-409$ , $c07-math-410$ , and $c07-math-411$ .

7.5.2 Simplifying Conditional Validation Rules

Validation rules that combine linear (in)equality restrictions can put a severe performance burden on the error localization algorithm. Depending on the algorithm, the computational time needed for error localization may be as much as double with each extra conditional rule. In the case of MIP solving, a conditional rule means the introduction of dummy variables that increase the size of the problem. There are some cases, however, where conditional restrictions on combinations of linear inequalities can be simplified and rewritten into a set of pure inequalities. Here, we discuss two types of (partial) rule sets where simplification is possible.

Case 1

The first type is a ruleset of the form

7.35

As a practical example, one can think of $c07-math-413$ as the number of employees and $c07-math-414$ as wages paid. Both cannot be negative, and if there are any employees, some wages must have been paid. Figure 7.1(a) shows in gray the valid region corresponding to this ruleset in the $c07-math-415$ -plane. The black dots correspond to valid combinations of $c07-math-416$ and $c07-math-417$ , while the crosses correspond to invalid combinations. Now as depicted in 7.1(b), if we replace the conditional rule with $c07-math-418$ , with some appropriately chosen $c07-math-419$ , the valid region shrinks somewhat, while all valid observations remain valid, and all invalid observations remain invalid. So, the question is what choice of $c07-math-420$ would be appropriate, given some observed pairs $c07-math-421$ . This is stated in the following result:

Image described by caption and surrounding text. — **Figure 7.1** (a) In gray, the area of valid combinations $c07-math-422$ according to the rule set of Eq. (7.36), black dots represent valid combinations, and crosses represent invalid combinations. The $c07-math-423$ -axis is part of the valid area. (b) The valid area is defined by $c07-math-424$ and $c07-math-425$ , with an appropriate choice for $c07-math-426$ .

In the above calculation, the smallest possible value for $c07-math-454$ was chosen so that the replacing rule still works. In practice, one can choose for $c07-math-455$ some reasonable minimum. For example, suppose again that $c07-math-456$ stands for number of employees and $c07-math-457$ for wages paid. The smallest nonzero value for $c07-math-458$ would then be 1, and the value $c07-math-459$ is then the smallest wage that can be paid to an employee.

Case 2

Now, consider a second type of ruleset, given by

7.39

Again, we can think of $c07-math-461$ as number of employees and of $c07-math-462$ as wages paid. The ruleset then states that wages are paid if and only if the number of employees is positive. Figure 7.2(a) shows in gray the associated valid region in the $c07-math-463$ -plane. Black dots again depict valid observations, while crosses depict invalid observations. In Figure 7.2(b), we see the valid regions bordered by the lines $c07-math-464$ and $c07-math-465$ . The question is, again what values for $c07-math-466$ and $c07-math-467$ to choose?

As an example, consider a dataset on hospitals, where $c07-math-500$ is the number of beds and $c07-math-501$ the number of patients occupying a bed. These numbers cannot be negative. We also assume that if a hospital has beds, then there must be a patient (hospitals are never empty), and if there is at least one patient, there must be a nonzero number of beds. This amounts to the ruleset

Using the above proposition, we can replace this ruleset with the simpler

Here, $c07-math-502$ is the smallest number of beds per patient, which we can choose equal to 1. $c07-math-503$ is the largest number of beds per patient or the inverse of the smallest occupancy rate that can reasonably be expected for a hospital.

Generalizations

Finally, we state a few simple generalizations of the above simplifications. The proofs of these generalizations follow the same reasoning as the proofs for Propositions 7.5.10 and 7.5.11 and will not be stated here.

7.6 Conclusion

When a record is invalidated by a set of record scoped rules, it can be a problem to identify which fields are likely to be responsible for the error. In this chapter, we have seen that these fields can be indentified using the Fellegi–Holt formalism. This error localization problem for linear, categorical, and mixed-type restrictions can be formulated in terms of a MIP-problem. Mixed-type restrictions can be seen as a generalization of both record-wise linear and categorical restrictions. MIP-problems can be solved.

Although MIP-problems can be solved by readily available software packages, there may be a trade-off in numerical stability with respect to a branch-and-bound approach. This holds especially when typical default settings of such software are left unchanged and data records and/or linear coefficients of the restriction sets cover several orders of magnitude. In this paper, we locate the origin of these instabilities and provide some pointers to avoiding such problems.

The lp_solve package has now been introduced as a MIP solver back end to our errorlocate R package for error localization and rule management. Our benchmarks indicate that in generic cases where no prior knowledge is available about which values in a record may be erroneous (as may be expressed by lower reliability weights), the MIP method is much faster than the previously implemented branch-and-bound-based algorithms. On the other hand, the branch-and-bound-based approach returns extra information, most notably the number of equivalent solutions. The latter can be used as an indicator for quality of automatic data editing.

Appendix 7.A: Derivation of Eq. (7.33)

In the Phase I–Phase II simplex method, Phase I is aimed to derive a valid solution, which is then iteratively updated to an optimal solution in Phase II. Here, we derive a Phase I solution, specific for error localization problems.

Recall the tableau of Eq. (7.32); for clarity, the top row indicates to what variables the columns of the tableau pertain to.

Here, the $c07-math-509$ are slack or surplus variables, aimed to write the original inequality restrictions as equalities. The $c07-math-510$ are used to rewrite restrictions on observed variables, the $c07-math-511$ to write the upper and lower limits on $c07-math-512$ as equalities, and $c07-math-513$ to write the upper limits on $c07-math-514$ as equalities.

Observe that the above tableau almost suggests a trivial solution. If we choose $c07-math-515$ , $c07-math-516$ , and $c07-math-517$ , we may set $c07-math-518$ . However, recall that we demand all variables to be nonnegative, so $c07-math-519$ may not be equal to $c07-math-520$ . To resolve this problem, we introduce a set of artificial variables $c07-math-521$ and extend the restrictions involving $c07-math-522$ as follows:

This yields the following tableau:

The essential point to note is that the tableau now contains a unit matrix in columns 4, 5, 7, and 8, so choosing $c07-math-523$ , $c07-math-524$ , $c07-math-525$ , $c07-math-526$ , and $c07-math-527$ is a solution obeying all restrictions. The artificial variables have no relation to the original problem, so we want them to be zero in the final solution. Since the tableau represents a set of linear equalities, we are allowed to multiply rows with a constant and add and subtract rows. If this is done in such a way that we are again left with a unit matrix in part of the columns, a new valid solution is generated. Here, we add the fourth row to row 3 and subtract $c07-math-528$ times the fourth row from the last row. Row 4 is multiplied with $c07-math-529$ . This gives

This almost gives a new valid solution, except that the first row, representing the objective function, has not vanished at the third column. However, we may re-express the objective function in terms of the other variables, namely, using row 4, we have

Substituting this equation in the top row of the tableau shows that $c07-math-530$ , $c07-math-531$ , $c07-math-532$ , $c07-math-533$ , and $c07-math-534$ represent a valid solution. Since the artificial variables have vanishing values, we may now delete the corresponding column and arrive at the tableau of Eq. (7.33).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Name	Age	Married	Attends
Eric	3	Yes	Kindergarten

	Default value
Parameter	`lp_solve`	`errorlocate`	Description
`M`	—	$c07-math-334$	Set bounds so $c07-math-335$
`eps`	—	$c07-math-336$	Translate $c07-math-337$ to $c07-math-338$
`epspivot`	$c07-math-339$	$c07-math-340$	Test pivot element $c07-math-341$
`epsint`	$c07-math-342$	$c07-math-343$	Test $c07-math-344$
`epsb`	$c07-math-345$	$c07-math-346$	Test $c07-math-347$
`epsd`	$c07-math-348$	$c07-math-349$	Test obj. values $c07-math-350$ during simplex
`epsel`	$c07-math-351$	$c07-math-352$	Test other numbers $c07-math-353$
`mip_gap`	$c07-math-354$	$c07-math-355$	Test obj. values $c07-math-356$ during B&B