Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9
Methods Based on Models for Domain Knowledge

9.1 Correction with Data Modifying Rules

Domain experts who are familiar with a recurring statistical analyses acquire knowledge about errors that are typical for that process. This knowledge is often put to use by writing data processing scripts, for example, in SQL, R, or Python that modify the data to accommodate for such errors. The statements in these scripts can be fairly simple, and in fact they are often a sequence of rules that can be stated as

9.1

The aim of each such statement is to resolve an error (detected in the condition) for which an obvious solution or other amendment can be found. Some examples of data modifying rules that we have encountered in practice include the following:

If total revenues per total employees (in full-time equivalent) is less than 1000, all variables from the income slate must be divided by 1000.
If other staff costs is less than zero, replace with its absolute value.
If general expenditures exceeds one half of total expenditures, then total other expenditures is set to missing.

In the last case, a value is set to missing so that a better value can later be estimated by an imputation method.

This method of data correction is a common practice and may have a significant effect on the final results. In a study of Pannekoek et al. (2014), the effect of various data cleaning steps on two different multistep data cleaning processes was investigated. It was found that of all cells that were altered during the process, a majority was touched by data modifying rules. It was also shown that the change in mean value for some key variables may change significantly by application of such rules, for example, because they may correct for unit of measure errors.

The main advantage of data correction through such scripts is that it allows for a quick way to translate knowledge to (well-documented) code. Possible disadvantages include lack of reproducibility when code is not under a decent version control system or when manipulations are done by hand. Also, each statement solves a single problem, not taking into account other such statements or validation rules. Indeed, it is our experience that in some cases these scripts do more damage to a data set, in terms of the number of violated validation rules, than they do good. One would therefore like to keep track of such modifications, and at least have some knowledge about whether they interfere with each other or not.

One solution is to define simple knowledge rules not hard-coded in the processing software but to provide them as parameters to a program that applies the rules to data and monitors their effects. This way, the impact of each rule can be kept track of and evaluated. The implementation of such an approach is necessarily a trade-off between flexibility and maintainability. In the most flexible case, a user can insert arbitrary code to alter the data and the system only keeps track of the changes. This means that maintaining such a rule set is as complex as maintaining any software system. In particular, it will be very difficult to discover contradictions or redundancies in such code, and one needs to resort to software testing techniques to guarantee any level of quality. On the more restricted side, one can restrict the type of rules that may be specified, allowing for some (automated) ways of interpreting their properties before confronting them with the data.

Two interesting properties of such rules present themselves. First, note that the condition in rules of the type (9.1) detects an invalid situation. The first requirement we have on a modifying rule is therefore that it actually solves that particular invalidity. In other words, if we apply the same rule twice to a data set, no modification is done at the second round. Second, it would be nice if the rules do not interfere with each other. That is, if one such knowledge rule is applied, we do not wish it to modify the data in such a way that a second rule is necessary to repair the damage done. In general, we shall demand that the order of execution of such rules is not important. Both properties together greatly enhance maintainability and understandability of the rule sets.

In the following section, we will give a formal definition of modifying rules in terms of ‘modifying functions’. This allows us to make the two demands on modifying functions more precise and to study the conditions under which they hold. We shall see that we can obtain necessary and sufficient condition for the first demand and only a sufficient condition for the second. After discussing these general conditions, they will be made explicit for an important class of rules on numerical data. In Section 9.2, we will show how modifying rules can be applied to data from R. Readers who are less interested in the theoretical aspects can skip Section 9.1 on first reading.

9.1.1 Modifying Functions

In the following discussion, it will be assumed that the data to be treated is a rectangular data set consisting of $c09-math-002$ records and $c09-math-003$ variables, which are not necessarily stored in the correct type. The measurement domain of each variable is denoted $c09-math-004$ , so a record of $c09-math-005$ variables is an element of $c09-math-006$ (this is a special case of the definition of the measurement domain in Section 6.3.1).

We interpret a modifying rule as a function $c09-math-007$ . Both the input and the output are in the measurement domain, since we do not demand that a single modifying rule fixes every issue a record might have (although it is also not excluded). We will use the term modifying function (to be made precise at the end of this section) to indicate a function based on a data modifying rule.

Before giving the formal definitions, let us look at an example. Consider the following rule:

9.2

For compactness we will use the shorthands $c09-math-009$ for general expenditures, $c09-math-010$ for total expenditures, and $c09-math-011$ for other expenditures. The measurement domains for each variable is given by $c09-math-012$ , so we have $c09-math-013$ . A formal expression of the rule in terms of a function is simply the following:

9.3

Here, the notation $c09-math-015$ indicates the open range $c09-math-016$ . We also include explicitly the demand $c09-math-017$ since we do not want to (re)set a value unnecessarily.

The form of function (9.3) is easy to generalize in the following formal definition.

This definition includes the case where $c09-math-026$ , meaning that $c09-math-027$ is transformed to $c09-math-028$ in any case. If $c09-math-029$ is a proper subset of $c09-math-030$ ( $c09-math-031$ ), then we can interpret $c09-math-032$ as an invalid region, and the rule is an attempt to move a record $c09-math-033$ out of this invalid region.

Idempotency

Consider again the modifying function of Eq. (9.3). It is easy to see that if we replace a record $c09-math-034$ with $c09-math-035$ a second application of $c09-math-036$ will not change the result. To see this, we need to check two cases: the cases where the initial record satisfies the condition and the cases where the condition is not satisfied. Starting with the latter case, suppose that for our original record we have $c09-math-037$ . In that case, we have $c09-math-038$ and therefore $c09-math-039$ . Now, if we have $c09-math-040$ for the original record, we get $c09-math-041$ . Since we now have $c09-math-042$ , the result of $c09-math-043$ .

The point is that once a record is moved out of an invalid region, we do not want the function to move it again. The property that iterating a function twice does not alter the result is referred to as the idempotent property of a function, which we will now define formally.

The idempotent property is not only desirable from the theoretical or subject-matter points of view. The main practical benefit is that the function, or its underlying rule, can be interpreted just by understanding the definition of the condition and the function $c09-math-050$ without regarding possible iterations of the rule. It is therefore useful to have some general conditions under which a modifying function is idempotent. Intuitively, a modifying function should be idempotent when $c09-math-051$ takes a record $c09-math-052$ out of the invalid region, as depicted in Figure 9.1. The following proposition claims that this is indeed a necessary and sufficient condition.

To show that $c09-math-056$ is indeed idempotent on whole $c09-math-057$ under these conditions we prove it separately for the complement of $c09-math-058$ in $c09-math-059$ (denoted $c09-math-060$ ) and for $c09-math-061$ .

It is not difficult to define modifying functions that are not idempotent. Exercise 9.1.2 is aimed to give an example of how such situations may occur.

Illustration of Idempotency of a modifying function. — **Figure 9.1** Idempotency of a modifying function. The full rectangle represents the measurement domain $c09-math-080$ . The portion in the rectangle outside of $c09-math-081$ is valid according to the condition in $c09-math-082$ . The region $c09-math-083$ is considered invalid. The function $c09-math-084$ maps an invalid record to a valid one and a valid record to itself.

Commutativity

Consider an example rule that can be used to detect and repair unit-of-measure errors where units are off by a factor of 1000. We use the same variables as in Eq. (9.3).

9.4

The formal definition can be written as follows:

9.5

The result of applying both $c09-math-087$ and $c09-math-088$ to a record $c09-math-089$ may well be order-dependent. Consider a record for which $c09-math-090$ . Then for sure it must be in $c09-math-091$ . Now, if we apply first $c09-math-092$ and then $c09-math-093$ we get $c09-math-094$ . If $c09-math-095$ then $c09-math-096$ . Now, suppose we first execute $c09-math-097$ and then $c09-math-098$ . We have $c09-math-099$ .

So here we have a case where the order of execution of a set of modifying functions $c09-math-100$ may be important for the result. For larger sets of modifying rules, such interdependencies make the action of any single rule on a data set hard to judge. In the worst case, one function may undo the action of another. When designing a set of rules, it is therefore desirable to detect these cases so the ‘risk of collision’ can be assessed or to avoided altogether.

The technical term for independence of order of execution is called commutativity of two functions, which are now ready to define precisely.

Two functions trivially commute when they act on different variables. If two modifying functions $c09-math-107$ and $c09-math-108$ act on different variables in the condition as well as in the consequent, the action of the first function cannot alter the outcome of the second. Apart from acting on different variables, we can derive some very general sufficient (but not necessary) conditions under which two modifying functions commute. These are stated in the following proposition.

Before stating the proof, it is elucidating to consider the situation depicted in Figure 9.2. Here, $c09-math-121$ and $c09-math-122$ are modifying functions that we define to be idempotent on all of $c09-math-123$ . On the right, we see that $c09-math-124$ and $c09-math-125$ commute on the point $c09-math-126$ : the dashed lines correspond to $c09-math-127$ and the solid arrows correspond to $c09-math-128$ . On the left, we see that $c09-math-129$ and $c09-math-130$ do not commute on the point $c09-math-131$ , because $c09-math-132$ takes $c09-math-133$ into the invalid region of $c09-math-134$ (labeled $c09-math-135$ ), after which $c09-math-136$ takes $c09-math-137$ out of $c09-math-138$ . Again, following the solid arrows corresponds to $c09-math-139$ and following the dashed lines corresponds to $c09-math-140$ .

The proof of the proposition now consists of carefully considering the four regions in the diagram: $c09-math-141$ less the valid regions, region $c09-math-142$ less $c09-math-143$ , region $c09-math-144$ less $c09-math-145$ , and the intersection of $c09-math-146$ and $c09-math-147$ .

Illustration of Commutativity of idempotent modifying functions. — **Figure 9.2** Commutativity of idempotent modifying functions. The full rectangle including subsets $c09-math-148$ and $c09-math-149$ corresponds to the measurement domain $c09-math-150$ . The idempotent modifying functions $c09-math-151$ and $c09-math-152$ commute on $c09-math-153$ but not on $c09-math-154$ .

Proof

First note that we can write as a disjunct union

where complements are taken with respect to $c09-math-155$ . First consider the region $c09-math-156$ . Since according to the definition of modifying functions (Definition 9.1.1) both $c09-math-157$ equals identity on $c09-math-158$ and $c09-math-159$ equals identity on $c09-math-160$ , they commute on $c09-math-161$ under no further conditions.

Now consider the region $c09-math-162$ . Since $c09-math-163$ is idempotent, it maps any $c09-math-164$ to a region outside of $c09-math-165$ . Suppose that it maps all points in $c09-math-166$ to $c09-math-167$ . In that case, by idempotency of $c09-math-168$ we have $c09-math-169$ for all $c09-math-170$ . Now suppose that $c09-math-171$ also maps all points in $c09-math-172$ to $c09-math-173$ . By idempotency of $c09-math-174$ we have $c09-math-175$ . So, when both $c09-math-176$ and $c09-math-177$ map all points in $c09-math-178$ to points in $c09-math-179$ , they commute only when $c09-math-180$ . This proofs the sufficiency of condition 3 for the region $c09-math-181$ .

Now consider the region $c09-math-182$ . Suppose that $c09-math-183$ maps every point in $c09-math-184$ to $c09-math-185$ . In that case, we have $c09-math-186$ . By idempotency of $c09-math-187$ we have $c09-math-188$ so $c09-math-189$ and $c09-math-190$ commute. This shows the sufficiency of the first condition on the region $c09-math-191$ . The sufficiency of the second condition on the region $c09-math-192$ follows from symmetry by replacing in the previous argument $c09-math-193$ with $c09-math-194$ and $c09-math-195$ with $c09-math-196$ .

Since we have now shown sufficiency of the conditions in the theorem for the four different disjunct regions that constitute $c09-math-197$ , we are done. $c09-math-198$

The above outline of a procedure combines the trivial notion that two functions commute when they act on different variables with the general conditions derived above. This procedure does not guarantee that mutually interfering rules can always be found automatically. It does provide a strategy that can be partially automated to indicate possible conflicts. In order to be able to explicitly test the conditions for idempotency or commutativity given in Propositions 9.1.3 and 9.1.5, we need to assume more about the type of rules. In the following section, these conditions are made explicit for an important class of direct rules.

9.1.2 A Class of Modifying Functions on Numerical Data

Many modification rules on numerical data that are used in practice can be written in the form

9.6

Here, the real matrices $c09-math-204$ and $c09-math-205$ are of dimensions $c09-math-206$ and $c09-math-207$ , respectively, and we assume that the system $c09-math-208$ is feasible. A transformation that sends a vector $c09-math-209$ to $c09-math-210$ is called an affine map. Hence, we will call a modifying function as in Eq. (9.6) an affine modifier with linear inequality conditions, or AMLIC for short.

It was noted by Scholtus (2015) that affine maps cover several common operations that are performed either manually by analysts or via automated processing of modifying rules. In particular, we will show that such mappings can be used to replace values by a constant, multiply values by a constant, swap values in a record, and compute a value from a linear combination of other values in the record.

As an example, we will study three variables: turnover, other revenue, and total revenue. A record containing a value for each of these variables can be represented as a vector $c09-math-211$ , where $c09-math-212$ is the value of turnover, $c09-math-213$ the value of other revenue, and $c09-math-214$ the value of total revenue.

There are other examples, which will be discussed in Exercise 9.1.5.

Given that several such common operations can be expressed in the form of Eq. (9.6), it is interesting to see if general and explicit conditions of idempotency and commutativity can be derived. As we will be shown next, this is indeed the case.

Idempotency

According to Proposition 9.1.3, we need to check if every vector $c09-math-221$ in $c09-math-222$ is mapped to a vector that satisfies the opposite. That is, we need to ensure that any vector $c09-math-223$ ( $c09-math-224$ ) does not satisfy $c09-math-225$ , that is, we demand that

for all $c09-math-226$ satisfying $c09-math-227$ . These demands are summarized by stating that the system

9.7

has no solutions. The consistency of such systems can be determined, for example, through Fourier–Motzkin elimination.

Commutativity

Given two modifying functions $c09-math-229$ and $c09-math-230$ of the class defined by Eq. (9.6). Also define $c09-math-231$ and $c09-math-232$ to be their respective invalid regions, defined by $c09-math-233$ and $c09-math-234$ . The modifying function of $c09-math-235$ is defined by $c09-math-236$ and the modifying function of $c09-math-237$ is defined by $c09-math-238$ . Before stating Procedure 9.1.1 explicitly for this case, we derive sufficient conditions for the two functions to be commutative, in the case that they are idempotent.

If $c09-math-239$ and $c09-math-240$ are idempotent, we need to check the three conditions of Proposition 9.1.5. Conditions 1 and 2 are related to regions $c09-math-241$ and $c09-math-242$ , respectively, and they state that any vector in those regions must be mapped outside of $c09-math-243$ . Using a similar reasoning as for Eq. (9.7), we arrive at the conclusion that this is equivalent to stating that the systems of inequalities given by

9.8

have empty solution spaces.

The third condition states that for any vector in $c09-math-245$ , the modifying functions must be equal so $c09-math-246$ for all $c09-math-247$ . Let us denote with $c09-math-248$ the region where this equality holds, so $c09-math-249$ . The condition implies that every $c09-math-250$ is also in $c09-math-251$ , or $c09-math-252$ . To express this in terms of systems of linear equations, we can restate this by demanding that $c09-math-253$ . Now, $c09-math-254$ is a set of vectors that is defined by not satisfying a set of linear equalities. A vector is a member of this set when it violates at least one of those restrictions. Now, let us denote with $c09-math-255$ the $c09-math-256$ th row of the matrix $c09-math-257$ . The $c09-math-258$ th restriction defining $c09-math-259$ is given by

For a vector to violate this restriction, it should satisfy either

We are now able to state the third condition in terms of demands on sets of linear restrictions. The set $c09-math-260$ is empty if for at least one of the $c09-math-261$ one of the two sets

9.9

is infeasible.

Now that we have derived explicit conditions for the commutativity of two modifiers we can specialize Procedure 9.1.1 to affine modifiers with linear inequality constraints. The above procedure is stated in a fairly general way, and may in principle involve many feasibility checks on systems of linear inequations. In practice, user-defined rules commonly involve only one or a few linear inequality conditions and only very simple transformations that involve few variables. So in many cases this procedure will be computationally tractable.

Exercises for Section 9.1

9.2 Rule-Based Correction with `dcmodify`

The package dcmodify implements a work flow where data modifying rules are not part of data processing software, but instead are stored and manipulated as parameters to a data cleaning workflow. The situation is depicted schematically as follows:

Scheme for a work flow where data are stored and manipulated as parameters to a data cleaning workflow.

Here, a set of rules defines how data is modified. The allowed rules are assignments that executed possibly within a (nested) conditional statement. Just like the validate package, dcmodify employs a subset of the R language for rule definition.

The simplest command-line use is implemented as the modify_so function, accepting a data.frame and one or more (conditional) modifying rules.

  library(dcmodify)
  dat <- data.frame(x = c(5,17), y = c("a","b"),stringsAsFactors=FALSE)
  modify_so(dat, if( x > 6 ) y <- "c")
  ##    x y
  ## 1  5 a
  ## 2 17 c

The function is set up so it can be integrated in a stream-like syntax facilitated by the magrittr package.

  library(magrittr)
  dat %>% modify_so(
      if (x > 6) y <- "c"
   , if (y == "c") x <- 11 )
  ##    x y
  ## 1  5 a
  ## 2 11 c

We point out two observations. First, the conditional statements are interpreted as if they are executed record by record. Under the hood, optimization takes place to make sure that code is executed in vectorized manner. Second, the rules are executed sequentially. In the example, the first rule changes the value of y from "b" to "c". Next, the second rule changes the value of x from 6 to 11 in the second record because the value of y is now "c". (Needless to say, these two rules do not commute in the sense explained in the previous section). This is in fact default behavior that can be toggled by setting the sequential option to TRUE.

9.2.1 Reading Rules from File

The dcmodify package follows the structure of the validate package. Like in validate, there is an object storing rules, the difference being that rules are stored in an object called modifier and not validator, reflecting the purpose of the rules. To execute the rules, there is a function called modify that ‘confronts’ a data set with a set of modifying rules.

  mod <- modifier(
    if ( x > 6 ) y <- "c"
   , if ( y == "c" ) x <- 11
 )
  modify(dat, mod)
  ##    x y
  ## 1  5 a
  ## 2 11 c

Modifying rules may be read from and stored to files just like validation rules, and the structure of the files is the same as discussed in Section 6.5.6, except that the rule syntax is different. So rule may be defined in a free-format text file such as the following:

# modifiers.txt
if ( x > 6 ){
  y <- "c"
}
if ( y == "c" ){
  x <- 11
}

Subsequently, rules may be read with modifier using the .file argument.

  mod <- modifier(.file="modifiers.txt")

Moreover, modifier objects support the same metadata as validator objects, so functions such as origin, label, and variables also work on modifier objects.

  variables(mod)
  ## [1] "x" "y"

Metadata can be defined using the yaml format described earlier in Section 6.5.6 as well. So, for example, the following file defines a bit of metadata for one of the rules defined earlier:

rule:
-
  expr: |
    if ( x > 6 ) y <- "c"
  name: fix_x
  description |
    When `x' is too large, the y-category must be 'c'.

9.2.2 Modifying Rule Syntax

The general structure of a modifying statement allowed currently by the dcmodify package follows the following very simple syntax diagram:

Illustration of a simple syntax diagram.

Curly braces (blocks) have been left out for clarity. In short, a modifying statement consists of one or more assignments, possibly executed under a (nested) conditional statement.

The following examples illustrate the type of rules that can be defined with this syntax:

  # multiple assignment
  if ( x> 0 ){
    x <- 0
    y <- 2 * y
  }
  # nested conditions
  if ( x> 0 ){
    x <- 0
    if ( y < 10 ){
      y <- 2*y
    }
  }
  # reassignment
  x <- y - 7

Like in the validate package, it is possible to define variable macros (using the := operator) or variable groups (with vargroup). In other words, the modifier defined by

  modifier(
    x_limit := 10*median(x)
  , if ( x > x_limit) x <- NA
 )

is equivalent to

  modifier( if ( x > 10*median(x) ) x <- NA )

9.2.3 Missing Values

It may occur that the condition in a modifying rule cannot be evaluated because one or more of the necessary values are missing. By default, the package then assumes that the result is FALSE. For example, compare the output for the two-record data set in the following code:

  dat <- data.frame(x=c(NA, 0), y=c(0,0))
  m <- modifier( if ( x == 0 ) y <- 1 )
  modify(dat, m)
  ##    x y
  ## 1 NA 0
  ## 2  0 1

In the first record, the condition x == 0 could not be evaluated so the result FALSE was assumed and consequently, y is left untouched. The second record could be evaluated normally.

The default assumption that an NA in the condition is interpreted as FALSE can be altered with the option na.condition.

  modify(dat, m, na.condition=TRUE)
  ##    x y
  ## 1 NA 1
  ## 2  0 1

9.2.4 Sequential and Sequence-Independent Execution

In principle, execution of one rule can influence the condition of another rule (see also the discussion of Section 9.1.1). As an example, consider the following single-row data set and sequence of modifications:

  dat <- data.frame(x = 0, y = 0)
  m <- modifier(
    if ( x == 0 ) x <- 1
  , if ( x == 0 ) y <- 1
 )

If we apply the modifier to the data, the first modifying rule will set x to 1, since x equals zero. The consequent of the second rule is not executed since now x has changed.

It may occur that the decision to execute the second statement should not depend on previously executed rules. This can be controlled by setting the option sequential=FALSE. As an example, consider the following outputs:

  modify(dat, m)
  ##   x y
  ## 1 1 0
  modify(dat, m, sequential=FALSE)
  ##   x y
  ## 1 1 1

In the second case, the conditional part of a rule is always evaluated using the original data. So in the second case, the condition x == 0 always evaluated to TRUE, regardless of the changes made by the first rule.

9.2.5 Options Settings Management

The option settings for the dcmodify package are handled by the same options manager as for the validate package. This means (again) that options can be set on a global level using validate::voptions.

  voptions(na.condition=TRUE)

It is also possible to attach options to objects carrying rules, including modifier objects.

  voptions(m, sequential=FALSE)

The options hierarchy, where global options (set with voptions) are overruled by options attached to an object, which are again overruled by options passed during a function call are the same as for the validate package as well.

9.3 Deductive Correction

In some cases, inconsistencies in a data record can be traced back to a single origin, either by (automatic) reasoning based on the validation rules or by allowing for a weak assumption based on subject matter knowledge. In general, deductive correction methods allow one to use data available within a faulty record to find a unique correction that fixes one or more inconsistencies.

For example, suppose we have a numerical record with integer values $c09-math-283$ , subject to the rule $c09-math-284$ ( $c09-math-285$ rofit plus $c09-math-286$ ost equals $c09-math-287$ urnover). Without any further information, Fellegi and Holt's principle tells us that one randomly chosen variable should be altered. However, if we study the values a little, we see that the record can be made consistent with the rule if we swap the last two digits in the value for turnover: $c09-math-288$ , assuming that a typo has been made.

Over the years, several algorithms have been developed and implemented that allow various forms of deductive data correction. Such algorithms are often specific to a certain data type, a certain type of validation rules, and certain assumptions about the cause of error. In the following section, we highlight a number of those algorithms and show how they can be applied in R.

9.3.1 Correcting Typing Errors in Numeric Data

Integer data that is entered by humans may contain typing errors that cause violations of linear equality restrictions. If such errors can be detected, they may be repaired by replacing the erroneous value with a valid one. Scholtus (2009) describes a method for detecting and repairing typing errors in numerical data under linear restrictions. The basic idea is both elegant and simple; given a record violating one or more restrictions, one generates a number of candidate solutions by keeping one or more variables fixed and solving for the other ones using the linear restrictions. If the values in the candidate solution are not more than, say, a single typing error away from the original value, it is assumed that a typing error has been made. Finally, from all the candidate solutions, a trade-off between the minimal number of changes with the maximum number of resolved violations determines what changes are actually applied.

Area of Application

This method can be applied to records containing integer data, subject to at least set of linear equality restrictions that can be written as

9.10

where $c09-math-290$ , and $c09-math-291$ have integer coefficients. Such a subset is necessary to generate candidate solutions. There may be more (in)equality restrictions, but these are only used to check that proposed solutions do not generate extra violations. The demand that $c09-math-292$ contains only integer values poses a restriction on the constraint matrix. In particular, $c09-math-293$ must be a totally unimodular matrix for Eq. (9.10) to have any integer solutions. A matrix is totally unimodular when every nonsingular square submatrix has determinant $c09-math-294$ , which again implies that each coefficient $c09-math-295$ . There are several tests for unimodularity available, such as those described by Heller and Tompkins (1956) or Raghavachari (1976), and the lintools R package implements a combination of them.

Although a general classification of unimodular matrices seems to be missing, the authors have never encountered a practical set of restrictions that did not satisfy the unimodularity criterion. The reason is that many restriction sets are balance restrictions of the form $c09-math-296$ that follow a tree-like structure. In such systems, there is a single overall result, for example, total profit $c09-math-297$ total revenue $c09-math-298$ total cost, and each term on the right-hand side is the sum of new subvariables (a list of particular costs, a list of particular revenues, and so on). Each subvariable can be the sum of new variables again. It is not hard to see that for this important class of tree-like balance structures, an integer solution can always be constructed by choosing an integer for the top result variable, and recursively choosing values down the tree, so in these cases the corresponding restriction matrix must be totally unimodular.

The Algorithm

In his original work, Scholtus identifies typographical errors by studying their effect on the decimal expansion of integers. Here, we generalize the method a little by treating an integer as a string and allowing for any typographical error based on an edit-based string metric as described in Section 5.4.1. This has the advantage that sign errors can be detected and repaired as well. Moreover, in the original formulation, the set of equality restrictions was limited to the form $c09-math-299$ , whereas we treat the case of Eq. (9.10). The approach described as follows was first described in van der Loo et al. (2011).

The algorithm consists of two main parts. First, a list of candidate solutions is generated. Next, a subset of candidates that is in some sense optimal is selected from the set of candidates. The following procedure gives an overview of the algorithm.

Here, in step $c09-math-319$ attention is restricted to variables that occur only in violated restrictions since altering a variable that also occurs in a satisfied equality restriction would lead to new violations. To explain the second step, first note that given a violated restriction of the form $c09-math-320$ , one can always find a solution that resolves the violation choosing a variable and solving it from the restriction. In step $c09-math-321$ , this is done for each variable occurring in each violated restriction. Each such solution is considered a candidate if it differs no more than a certain string distance from the original value. The set of candidates $c09-math-322$ may therefore in general contain multiple suggestions for each variable. In step $c09-math-323$ , the set of candidate solutions is reduced by selecting the subset of candidates that (1) contains only one suggestion for each chosen variable, (2) maximizes the set of satisfied restrictions, and (3) does not violate any extra restrictions that were not violated before. This maximization be performed by generating all subsets without duplicate suggestions and computing the violations after substituting the suggestions in the original record $c09-math-324$ .

There are some extensions of this algorithm that make it more useful in practice. The first extension is to allow for linear (in)equations. The subset selection method in the final step can take these into account by checking that the chosen solution does not violate any new inequality restrictions. The second extension allows for missing values by eliminating missing values from the set of (in)equations before running the algorithm just described.

Correcting Typographical Errors in Numerical Data with R

The package deductive implements the algorithm for correcting typographical errors described above. The restrictions may be defined using the validate package. Example 9.3.10 can be computed in R as follows:

  library(validate)
  library(deductive)
  dat <- data.frame(x1 = 123, x2 = 192, x3 = 252)
  v <- validate::validator(x1 + x2 == x3)
  deductive::correct_typos(dat, v)
  ##    x1  x2  x3
  ## 1 123 129 252

The correct_typos function accepts a data set and a rule set and returns a data set with corrections where possible. By default the maximum accepted edit distance (measured as optimal string alignment, see Section 5.4.1) equals one. The distance measure may be set by passing method="[methodId]" where [methodId] is any method accepted by stringdist::stringdist (see Section 5.4.2).

Here is an example with a sign error, which may be detected as a typographical error.

  dat <- data.frame(x1 = -123, x2 = 129, x3 = 252)
  deductive::correct_typos(dat,v)
  ##    x1  x2  x3
  ## 1 123 129 252

The deductive package extends Procedure 9.3.1 by also rejecting solutions that would cause new violations of restrictions that are not used in deriving candidate solutions.

  dat <- data.frame(x1 = -123, x2 = 129, x3 = 252)
  v <- validate::validator(x1 + x2 == x3, x1 < 0)
  deductive::correct_typos(dat, v)
  ##     x1  x2  x3
  ## 1 -123 129 252

Here, no correction is applied to x1 since switching it from negative to positive would break the rule x1 < 0.

Exercises for Section 9.3.1

Exercise 9.3.6

Load the retailers data set from the validate package.

  data(retailers)

Use the validate package to set up a number of validation rules, for example.

Apply correct_typos to the data set and store it in an new variable called tcor. The old and new data set can be compared using the compare function of the validate package.

  compare(v, old=retailers, new=tcor)

where v is a validator object. Study the output and read the help in ?compare. Add more (in)equality restrictions to see if you can increase the number of typos found.

9.3.2 Deductive Imputation Using Linear Restrictions

Correction of numerical records that are subject to a set of linear (in)equations is often a multistep process during which existing values are accepted as correct, missing values imputed or erroneous values removed. Once all erroneous values are removed, a record can be thought of as consisting of two parts: a part with correct values and a part missing. Since the now-present values are deemed correct, they may be substituted in the system of restrictions, yielding a reduced system that applies to the unobserved values. If sufficient values have been substituted, the reduced set of restrictions may restrict one or more unobserved values to a unique value.

One way of flushing out such implied equalities for a variable, say $c09-math-334$ , is by eliminating one by one all other variables until either no restriction pertaining to $c09-math-335$ is left, or an obvious unique value for $c09-math-336$ occurs. Repeating this procedure for each missing value yields every value that is uniquely implied by the reduced system of restrictions. For systems with inequalities, the elimination procedure involves Fourier–Motzkin elimination that quickly becomes intractable as the number of variables grows; the number of implied inequations maximally grows exponentially with the number of eliminated variables.

As an alternative, one may attempt to derive as many implied values as possible using other, more heuristic methods. These methods do not necessarily find every unique imputation, but they have significantly lower complexity than Fourier–Motzkin-based methods. Once the heuristic methods have been applied, a (possibly) smaller set of missing values remain, leaving a simpler problem for the elimination-based methods.

Below, we describe three such methods. The first one is based on computing the Moore–Penrose pseudoinverse for the subset of equality restrictions, the second finds zero imputations in specific cases and the recursively third finds combined inequations that imply a simple equality.

Extracting Implied Values from the Subsystem of Equalities

For this method, we assume we have a numerical record $c09-math-337$ for which a number of values are missing. The values may have been missing from to begin with or perhaps have been removed after application of a suitable error localization method. We furthermore suppose that $c09-math-338$ must satisfy a set of linear restrictions denoted

9.11

It is possible that the record is subject to inequality restrictions as well, but these may be ignored here.

Sorting the values conveniently, we may write $c09-math-340$ , where $c09-math-341$ stands for observed and $c09-math-342$ for missing. Likewise, the columns of the restriction matrix $c09-math-343$ can be sorted so that we can write

This may be rewritten to obtain a set of demands on the missing values, expressed in the observed values.

9.12

Note that the right-hand side of this equation is fully known since it is computed from the observed values.

We are now left with three cases. The first option is that the system $c09-math-345$ has no solutions. In that case, either the system of Eq. (9.11) was inconsistent to begin with, or the values of $c09-math-346$ are such that the system $c09-math-347$ is rendered inconsistent. Note that this is never the case when the original system was consistent and an appropriate error localization algorithm was applied. The second option is that $c09-math-348$ is square and of full rank so it can be inverted. In that case, we can impute the missing values uniquely by setting $c09-math-349$ . The third option is that there are an infinite number of solutions. It can be shown that every solution $c09-math-350$ can be expressed as

9.13

where $c09-math-352$ is the so-called Moore–Penrose pseudoinverse of $c09-math-353$ and $c09-math-354$ is any vector in $c09-math-355$ with $c09-math-356$ the number of missings. We refer the reader to the appendix of this section for a short discussion of the Moore–Penrose matrix and the notation for this solution.

We may use Eq. (9.13) for deductive imputation in the following way. Suppose that the $c09-math-357$ th row of the matrix $c09-math-358$ is filled with zeros. Since the Moore–Penrose pseudoinverse is unique, the $c09-math-359$ th coefficient of $c09-math-360$ is then uniquely determined by the $c09-math-361$ th element of the first term in Eq. (9.13).

9.14

Using Nonnegativity Constraints

Under some specific circumstances, nonnegativity constraints combined with a linear equality restriction can lead to unique solutions. For example, suppose we have the following balance rule:

9.15

where all $c09-math-368$ . Now if $c09-math-369$ , $c09-math-370$ , and $c09-math-371$ are observed with values such that $c09-math-372$ , and $c09-math-373$ and $c09-math-374$ are missing, the unique possible values for $c09-math-375$ and $c09-math-376$ equals zero (assuming that the observed values are indeed correct). Such cases occur for instance, where only nonzero values have been submitted to a survey or administrative system.

To generalize this to a general linear equality restriction, we write

9.16

We may replace $c09-math-378$ with $c09-math-379$ if the following conditions are fulfilled (de Waal et al., 2011):

$c09-math-380$ .
$c09-math-381$ is restricted by $c09-math-382$ .
All coefficients of $c09-math-383$ have the same sign.

Deducing Values from Combined Inequalities

Given a record $c09-math-390$ subject to a set of rules $c09-math-391$ . Substituting the observed values $c09-math-392$ in the set of inequations, we obtain a reduced system

9.17

If the reduced system contains one or more pairs of inequations of the form

where

then $c09-math-394$ may be substituted by $c09-math-395$ . Since this method of imputation depends on substituting known values, the procedure is iterated by substituting imputed values into the system of inequations (9.17), detecting new pairs and imputing those until no new pairs are found.

Combining Methods

The three methods described above are not completely independent. For example, it is possible that after a record is partially imputed using Eq. (9.14), more zeros can be deductively imputed based on the nonnegativity constraints. These methods should therefore be applied in chain and iteratively until no further imputations can be found.

Deductive Imputation in R

The methods for deductive imputation mentioned above have been implemented in the deductive package. The function impute_lr (lr for linear restrictions) iteratively applies the heuristic methods, before computing the variable ranges of missing values by Fourier–Motzkin elimination. Here, we demonstrate Example 9.3.11.

  library(validate)
  library(deductive)
  v <- validate::validator(
      x1 + x2      == x3
     , x4 + x5 + x6 == x1
     , x7 + x8      == x2
     , x9 + x10     == x3)
  dat <- data.frame(
      x1 = 100, x2=NA_real_, x3=NA_real_, x4 = 15, x5 = NA_real_
    , x6 = NA_real_, x7 = 25, x8 = 35, x9 = NA_real_, x10 = 5)
  dat
  ##    x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
  ## 1 100 NA NA 15 NA NA 25 35 NA   5
  impute_lr(dat,v)
  ##    x1 x2  x3 x4 x5 x6 x7 x8  x9 x10
  ## 1 100 60 160 15 NA NA 25 35 155   5

The function impute_lr accepts a data set and a number of rules in the form of a validator object. The result is a data set where fields that have a unique solution are imputed.

The following exercise shows how to apply impute_lr to a larger data set and how to interpret the results.

Exercises for Section 9.3.2

Exercise 9.3.7

Set up a validator for the variables in the retailers data set (see Exercise 9.3.6 on page 224 for an example). Use impute_lr to apply deductive imputations, and store the result in a data.frame called imp. The function cells from the validate package summarizes the changes.

  cells(old = retailers, new = imp)

Appendix 9.A: Discussion of Equation (9.13)

Using some properties of the pseudoinverse, we will show that Eq. (9.13) is indeed a solution to the system of Eq. (9.12). Theory of the Moore–Penrose inverse is available in many textbooks, such as the book by Lipshutz and Lipson (2009) (where it is called the generalized inverse). The paper by Greville (1959) is a good early reference that discusses applications to linear systems and statistics.

The Moore–Penrose Pseudoinverse

The Moore–Penrose pseudoinverse (or just: pseudoinverse) is a generalization of the standard inverse of a square invertible matrix. Suppose that $c09-math-403$ is an invertible square matrix with singular value decomposition $c09-math-404$ , then its inverse is given by $c09-math-405$ where $c09-math-406$ is the matrix with the reciprocal eigenvalues of $c09-math-407$ on the diagonal. If $c09-math-408$ is not of full rank (this includes the case where it is not square), we can still decompose it using a singular value decomposition but since one or more eigenvalues are zero, $c09-math-409$ is not defined. To compute the Moore–Penrose pseudoinverse, $c09-math-410$ is replaced by $c09-math-411$ , which is the matrix that has zero at the diagonal when the singular values are zero and the reciprocal singular value otherwise. We note that it is not necessary to use the svd theorem to define the Moore–Penrose pseudoinverse, but the explicit construction does make it easier to understand the generalization and some of its properties.

In particular, it is easy to demonstrate by svd decomposition that the pseudoinverse of a matrix $c09-math-412$ has the property

9.18

This equation also shows that the matrix $c09-math-414$ is a left identity of $c09-math-415$ . That is, it has the property that $c09-math-416$ . Using associativity of matrix multiplication and Eq. (9.18), one sees that $c09-math-417$ is idempotent: $c09-math-418$ .

9.A.2 Equation (9.13) is a Solution

First recall that if $c09-math-419$ is a solution to $c09-math-420$ , than for any vector $c09-math-421$ for which $c09-math-422$ , the vector $c09-math-423$ is also a solution (we say that $c09-math-424$ ). Using Eq. (9.18), we see that the second term in Eq. (9.13) is in $c09-math-425$ :

So to show that Eq. (9.13) is a solution to the system of Eq. (9.12), we only need to show that the term

satisfies $c09-math-426$ (we introduce $c09-math-427$ as a convenient shorthand for $c09-math-428$ ). This can be done with the following calculation:

Here, we used that $c09-math-429$ spans a basis for the column space of $c09-math-430$ .

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 9: Methods Based on Models for Domain Knowledge

Create new playlist

Sign In

Sign Up

9.1 Correction with Data Modifying Rules

9.1.1 Modifying Functions

Idempotency

Commutativity

9.1.2 A Class of Modifying Functions on Numerical Data

Idempotency

Commutativity

Exercises for Section 9.1

9.2 Rule-Based Correction with dcmodify

9.2.1 Reading Rules from File

9.2.2 Modifying Rule Syntax

9.2.3 Missing Values

9.2.4 Sequential and Sequence-Independent Execution

9.2.5 Options Settings Management

9.3 Deductive Correction

9.3.1 Correcting Typing Errors in Numeric Data

Area of Application

The Algorithm

Correcting Typographical Errors in Numerical Data with R

Exercises for Section 9.3.1

9.3.2 Deductive Imputation Using Linear Restrictions

Extracting Implied Values from the Subsystem of Equalities

Using Nonnegativity Constraints

Deducing Values from Combined Inequalities

Combining Methods

Deductive Imputation in R

Exercises for Section 9.3.2

Appendix 9.A: Discussion of Equation (9.13)

The Moore–Penrose Pseudoinverse

9.A.2 Equation (9.13) is a Solution

Table of Contents for
Chapter 9: Methods Based on Models for Domain Knowledge

9.2 Rule-Based Correction with `dcmodify`