VII.10 Mathematical Statistics

Persi Diaconis


1 Introduction

Suppose you want to measure something: your height, or the velocity of an airplane for example. You take repeated measurements x1,x2, . . . , xn and you would like to combine them into a final estimate. An obvious way of doing this is to use the sample mean (x1 + x2 + · · · + xn)/n. However, modern statisticians use many other estimators, such as the median or the trimmed mean (where you throw away the largest and smallest 10% of the measurements and take the average of what is left). Mathematical statistics helps us to decide when one estimate is preferable to another. For example, it is intuitively clear that throwing away a random half of the data and averaging the rest is foolish, but setting up a framework that shows this clearly turns out to be a serious enterprise. One benefit of the undertaking is the discovery that the mean turns out to be inferior to nonintuitive “shrinkage estimators” even when the data are drawn from a PROBABILITY DISTRIBUTION [III.71] as natural as the bell-shaped curve (that is, are NORMALLY DISTRIBUTED [III.71 §5]).

To get an idea of why the mean may not always give you the most useful estimate, consider the following situation. You have a collection of a hundred coins and you would like to estimate their biases. That is, you would like to estimate a sequence of a hundred numbers, where the nth number θn is the probability that the nth coin will come up heads when it is flipped. Suppose that you flip each coin five times and note down how many times it shows heads. What should your estimate be for the sequence (θ1, . . . , θ100)? If you use the means, then your guess for θn will be the number of times the nth coin shows heads, divided by 5. However, if you do this, then you are likely to get some very anomalous results. For instance, if all the coins happen to be unbiased, then the probability that any given coin shows up heads five times is 1/32, so you are likely to guess that around three of the coins have biases of 1. So you will be guessing that if you flip those coins five hundred times then they will come up heads every single time.

Many alternative methods of estimation have been proposed in order to deal with this obvious problem. However, one must be careful: if a coin comes up heads five times it could be that θi really is equal to 1. What reason is there to believe that a different method of estimation is not in fact taking us further from the truth?

Here is a second example, drawn from work of Bradley Efron, this time concerning a situation from real life. Table 1 shows the batting averages of eighteen baseball players. The first column shows the proportion of “hits” for each player in their first forty-five times at bat, and the second column shows the proportion of hits at the end of the season. Consider the task of predicting the second column given only the first column. Once again, the obvious approach is to use the average. In other words, one would simply use the first column as a predictor of the second column. The third column is obtained by a shrinkage estimator: more precisely, it takes a number y in the first column and replaces it by 0.265+0.212(y−0.265). The number 0.265 is the average of the entries in the first column, so the shrinkage estimator is replacing each entry in the first column by one that is about five times closer to the average. (How the number 0.212 is chosen will be explained later.) If you look at the table, you will see that the shrinkage estimators in the third column are better predictors of the second column in almost every case, and certainly on average. Indeed, the sum of squared differences between the James–Stein estimator and the truth divided by the sum of squared differences between the usual estimator and the truth is 0.29. That is a threefold improvement.

There is beautiful mathematics behind this improvement and a clear sense in which the new estimator is always better than the average. We describe the framework, ideas, and extensions of this example as an introduction to the mathematics of statistics.

Before beginning, it will be useful to distinguish between probability and statistics. In probability theory,

Table 1 Batting averages for eighteen
major league players in 1970.

Image

one begins with a set X (for the moment taken to be finite) and a collection of numbers P(x), one for each x Image X, which are positive and sum to one. This function P(x) is called a probability distribution. The basic problem of probability is this. You are given the probability distribution P(x) and a subset A ⊂ X, and you must compute or approximate P(A), which is defined to be the sum of P(x) for x in A. (In probabilistic terms, each x has a probability P(x) of being chosen, and P(A) is the probability that x belongs to A.) This simple formulation hides wonderful mathematical problems. For example, X might be the set of all sequences of pluses and minuses of length 100 (e.g., + − − + + − − − − − · · ·), and each pattern might be equally likely, in which case P(x) = 1/2100 for every sequence x. Finally, A might be the set of sequences such that for every positive integer k ≤ 100 the number of + symbols in the first k places is larger than the number of − symbols in the first k places. This is a mathematical model for the following probability problem: if you and a friend flip a fair coin a hundred times, then what is the chance that your friend is always ahead? One might expect this chance to be very small. It turns out, however, to be about Image, though verifying this is a far from trivial exercise. (Our poor intuitions about chance fluctuations have been used to explain road rage: suppose you choose one of two lines at a toll booth. As you wait, you notice whether your line or the other has made more progress. We feel it should all balance out, but the calculations above show that a fair proportion of the time you are always behind—and frustrated!)

2 The Basic Problem of Statistics

Statistics is a kind of opposite of probability. In statistics, we are given a collection of probability distributions Pθ(x), indexed by some parameter θ. We see just one x and are required to guess which member of the family (which θ) was used to generate x. For example, let us keep X as the sequence of pluses and minuses of length 100, but this time let Pθ(x) be the chance of obtaining the sequence x if the probability of a plus is θ and the probability of a minus is 1− θ, with all terms in the sequence chosen independently. Here 0 ≤ θ ≤ 1, and Pθ(x) is easily seen to be θs(1 − θ)T, where S is the number of times “+” appears in the sequence x and T = 100 − S is the number of times “−” appears. This is a mathematical model for the following enterprise. You have a biased coin with a probability θ of turning up heads, but you do not know θ. You flip the coin a hundred times, and are required to estimate θ based on the outcome of the flips.

In general, for each x Image X, we want to find a guess, which we denote by Image(x), for the parameter θ. That is, we want to come up with a function Image, which will be defined on the observation space X. Such functions are called estimators. The above simple formulation hides a wealth of complexity, since both the observation space X and the space Θ of possible parameters may be infinite, or even infinite dimensional. For example, in nonparametric statistics, Θ is often taken as the set of all probability distributions on X. All of the usual problems of statistics—design of experiments, testing hypotheses, prediction, and many others—fit into this framework. We will stick with the imagery of estimation.

To evaluate and compare estimators, one more ingredient is needed: you have to know what it means to get the right answer. This is formalized through the notion of a loss function L(θ,Image (x)). One can think of this in practical terms: wrong guesses have financial consequences, and the loss function is a measure of how much it will cost if θ is the true value of the parameter but the statistician’s guess is Image(x). The most widely used choice is the squared error (θImage (x))2, but |θImage(x)| or |θImage(x)|/θ and many other variants are also used. The risk function R(θ,- Image) measures the expected loss if θ is the true parameter and the estimator Image is used. That is,

R(θ,Image) = ∫ L(θImage, (x))Pθ(dx).

Here, the right-hand side is notation for the average value of L(θ, Image(x)) if x is chosen randomly according to the probability distribution Pθ. In general, one would like to choose estimators that will make the risk function as small as possible.

3 Admissibility and Stein’s Paradox

We now have the basic ingredients: a family Pθ (x) and a loss function L. An estimator Image is called inadmissible if there is a better estimator θ*, in the sense that

R(θ, θ*) < R(θ,Image) for all θ.

In other words, the expected loss with θ* is less than the expected loss with Image, whatever the true value of θ.

Given our assumptions (the model Pθ and loss function L) it seems silly to use an inadmissible estimator. However, one of the great achievements of mathematical statistics is Charles Stein’s proof that the usual least-squares estimator, which does not at first glance seem silly at all, is inadmissible in natural problems. Here is that story.

Consider the basic measurement model

Xi + Imagei, 1 ≤ in.

Here Xi is the ith measurement, θ is the quantity to be estimated, and Imagei is measurement error. The classical assumptions are that the measurement errors are independently and normally distributed: that is, they are distributed according to the bell-shaped, or Gaussian, curve ex2/2/Image, −∞ < x < ∞. In terms of the language we introduced earlier, the measurement space X is Imagen, the parameter space Θ is Image, and the observation x = (x1, x2 , . . . , xn) has probability density Pθ(x) = exp Image The usual estimator is the mean: that is, if x = (x1, . . . , xn), then one takes Image(x) to be (x1 + · · · + xn)/n. It has been known for a long time that if the loss function L(θ, Image(x)) is defined to be (θImage(x))2, then the mean is an admissible estimator. It has many other optimal properties as well (for example, it is the best linear unbiased estimator, and it is minimax—a property that will be defined later in this article).

Now suppose that we wish to estimate two parameters, θ1 and θ2, say. This time we have two sets of observations, X1, . . . , Xn and Y1 , . . . ,Ym, with Xi = θ1 + Imagei, and Yj = θ2 + ηj. The errors Imagei and ηj are independent and normally distributed, as above. The loss function L((θ1 θ2), (Image1 (x)Image2 (y))) is now defined to be (θ1Image1 (x))2 + (θ2Image2 (y))2: that is, you add up the squared errors from the two parts. Again, the mean of the Xi and the mean of the Yi make up an admissible estimator for (θl, θ2).

Consider the same setup with three parameters, θ1, θ2, θ3. Again, Xi = θ2 + Imagei, Yi = θ2 + ηi, Zk = θ3 + δk are independent and all the error terms are normally distributed. Stein’s surprising result is that for three (or more) parameters the estimator

Image1(x) = (x1 + · · · + xn)/n,

Image2(y) = (y1 + · · · + ym)/m,

Image3(z) = (z1 + · · · + zl)/l,

is inadmissible: there are other estimators that do better in all cases. For example, if p is the number of parameters (and p ≥ 3), then the James–Stein estimator is defined to be

Image

Here we are using the notation X+ to denote the maximum of X and 0; θ stands for the vector (θ1, . . . , θp) of all the averages and || Image || is notation for (Image + · · · +Image1/2

The James–Stein estimator satisfies the inequality R(θ, Imagejs) < R(θ, Image) for all θ, and therefore the usual estimator Image is indeed inadmissible. The James–Stein estimator shrinks the classical estimator toward zero. The amount of shrinkage is small if || Image ||2 is large and appreciable for || Image ||2 near zero. Now the problem as we have described it is invariant under translation, so if we can improve the classical estimate by shrinking toward zero, then we must be able to improve it by shrinking toward any other point. This seems very strange at first, but one can obtain some insight into the phenomenon by considering the following informal description of the estimator. It makes an a priori guess θ0 at θ. (This guess was zero above.) If the usual estimator Image is close to the guess, in the sense that || Image || is small, then it moves Image toward the guess. If Image is far from the guess, it leaves Image alone. Thus, although the estimator moves the classical estimator toward an arbitrary guess, it does so only if there are reasons to believe that the guess is a good one. With four or more parameters the data can in fact be used to suggest which point θ0 one should use as the initial guess. In the example of table 1, there are eighteen parameters, and the initial guess θ0 was the constant vector with all its eighteen coordinates equal to the average 0.265. The number 0.212 that was used for the shrinking is equal to 1 − 16/||θθ0||. (Note that for this choice of θ0, || θθ0 || is the standard deviation of the parameters that make up θ.)

The mathematics used to prove inadmissibility is an elegant blend of harmonic function theory and tricky calculus. The proof itself has had many ramifications: it gave rise to what is called “Stein’s method” in probability theory—this is a method for proving things like the central limit theorem for complex dependent problems. The mathematics is “robust,” since it is applicable to nonnormal error distributions, a variety of different loss functions, and estimation problems far from the measurement model.

The result has had enormous practical application. It is routinely used in problems where many parameters have to be simultaneously estimated. Examples include national laboratories’ estimates of the percentage of defectives when they are looking at many different products at once, and the simultaneous estimate of census undercounts for each of the fifty states in the United States. The apparent robustness of the method is very useful for such applications: even though the James–Stein estimator was derived for the bell-shaped curve, it seems to work well, without special assumptions, in problems where its assumptions hold only roughly. Consider the baseball players above, for example. Adaptations and variations abound. Two popular ones are called empirical Bayes estimates (now widely used in genomics) and hierarchical modeling (now widely used in the assessment of education).

The mathematical problems are far from completely solved. For example, the James–Stein estimator is itself inadmissible. (It can be shown that any admissible estimator in a normal measurement problem is an analytic function of the observations. The James–Stein estimator is, however, clearly not analytic because it involves the nondifferentiable function x Image x+.) While it is known that there is little practical improvement possible, the search for an admissible estimator that is always better than the James–Stein estimator is a tantalizing research problem.

Another active area of research in modern mathematical statistics is to understand which statistical problems give rise to Stein’s paradox. For example, although at the beginning of this essay we discussed some inadequacies of the usual maximum-likelihood estimator for estimating the biases of a hundred coins, it turns out that that estimator is admissible! In fact, the maximum-likelihood estimator is admissible for any problem with finite state spaces.

4 Bayesian Statistics

The Bayesian approach to statistics adds one further ingredient to the family Pθ and loss function L. This is known as a prior probability distribution π (θ), which gives different weights to different values of the parameter θ. There are many ways of generating a prior distribution: it may quantify the working scientists’ best guess at θ; it may be derived from previous studies or estimates; or it may just be a convenient way to generate estimators. Once the prior distribution π(θ) has been specified, the observation x and Bayes’s theorem combine to give a posterior distribution for θ, here denoted π(θ|x). Intuitively, if x is your observation, then π(θ|x) measures how likely it is that θ was the parameter, given that the parameter was generated from the probability distribution π. The mean value of θ with respect to the posterior distribution π(θ|x) gives a Bayes estimator.

ImageBayes (x) = ∫ θ π (θ|x).

For the squared–error loss function, all Bayes estimators are admissible, and, in the converse direction, any admissible estimator is a limit of Bayes estimators. (However, not every limit of Bayes estimators is admissible: indeed, the average, which we have seen to be inadmissible, is a limit of Bayes rules.) The point for the present discussion is this. In a wide variety of practical variations of the measurement problem—things like regression analysis or the estimation of correlation matrices—it is relatively straightforward to write down sensible Bayes estimators that incorporate available prior knowledge. These estimators include close cousins of the James–Stein estimator, but they are more general, and allow it to be routinely extended to almost any statistical problem.

Because of the high-dimensional integrals involved, Bayes estimates can be difficult to compute. One of the great advances in this area is the use of computer-simulation algorithms, called variously Markov chain Monte Carlo or Gibbs samplers, to compute useful approximations to Bayes estimators. The whole package—provable superiority, easy adaptability, and ease of computation—has made this Bayesian version of statistics a practical success.

5 A Bit More Theory

Mathematical statistics makes good use of a wide range of mathematics: fairly esoteric analysis, logic, combinatorics, algebraic topology, and differential geometry all play a role. Here is an application of group theory. Let us return to the basic setup of a sample space X, a family of probability distributions (x), and a loss function L(θ, Image(x)). It is natural to consider how the estimator changes when you change the units of the problem: from pounds to grams, or from centimeters to inches, say. Will this have a significant impact on the mathematics? One would expect not, but if we want to think about this question precisely then it is useful to consider a group G of transformations of X. For example, linear changes of units correspond to the affine group, which consists of transformations of the form x Image ax + b. The family Pθ(x) is said to be invariant under G if for each element g of G the transformed distribution Pθ (xg) is equal to a distribution PImage(x) for some other Image in Θ. For example, the family of normal distributions

Image

is invariant under ax+b transformations: if you change x to ax + b, then after some easy manipulations you can rewrite the resulting modified formula in the form exp Image for some new parameters Image and Image2. An estimator θ is called equivariant if Image(xg) = Image(x). This is a formal way of saying that if you change the data from one unit to another, then the estimate transforms as it should. For example, suppose your data are temperatures presented in centigrade and you want an answer in Fahrenheit. If your estimator is equivariant, then it will make no difference whether you first apply the estimator and then convert the answer into Fahrenheit or first convert all the data into Fahrenheit and then apply the estimator.

The multivariate normal problem underlying Stein’s paradox is invariant under a variety of groups, including the p-dimensional group of Euclidian motions (rotations and translations). However, the James–Stein estimator is not equivariant, since, as we have already discussed, it depends on the choice of origin. This is not necessarily bad, but it is certainly thought provoking. If you ask a working scientist if they want a “most accurate” estimator, they will say “of course.” If you ask if they insist on equivariance, “of course” will follow as well. One way of expressing Stein’s paradox is the statement that the two desiderata—accuracy and invariance—are incompatible. This is one of many places where mathematics and statistics part company. Deciding whether mathematically optimal procedures are “sensible” is important and hard to mathematize.

Here is a second use of group theory. An estimator Image is called minimax if it minimizes the maximum risk over all Image. Minimax corresponds to playing things safe: you have optimal behavior (that is, the least possible risk) in the worst case. Finding minimax estimators in natural problems is hard, honest work. For example, the vector of means is a minimax estimator in normal location problems. The work is easier if the problem is invariant under a group. Then one can first search for best invariant estimators. Invariance often reduces things to a straightforward calculus problem. Now the question arises of whether an estimator that is minimax among invariant estimators is minimax among all estimators. A celebrated theorem of Hurt and Stein says “yes” if the group involved is nice (e.g., Abelian or compact or amenable). Determining whether the best invariant estimator is minimax when the group is not nice is a challenging open problem in mathematical statistics. And it is not just a mathematical curiosity. For example, the following problem is very natural, and invariant under the group of invertible matrices: given a sample from the multivariate normal distribution, estimate its correlation matrix. In this case, the group is not nice and good estimates are not known.

6 Conclusion

The point of this article is to show how mathematics enters and enriches statistics. To be sure, there are parts of statistics that are hard to mathematize: graphical displays of data are an example. Further, much of modern statistical practice is driven by the computer. There is no longer any need to restrict attention to tractable families of probability distributions. Complex and more realistic models can be used. This gives rise to the subject of statistical computing. Nonetheless, every once in a while someone has to think about what the computer should do and determine whether one innovative procedure works better than another. Then, mathematics holds its own. Indeed, mathematizing modern statistical practice is a challenging, rewarding enterprise, of which Stein’s estimator is a current highlight. This endeavor gives us something to aim for and helps us to calibrate our day-to-day achievements.

Further Reading

Berger, J. O. 1985. Statistical Decision Theory and Bayesian Analysis, 2nd edn. New York: Springer.

Lehmann, E. L., and G. Casella. 2003. Theory of Point Estimation. New York: Springer.

Lehmann, E. L., and J. P. Romano. 2005. Testing Statistical Hypotheses. New York: Springer.

Schervish, M. 1996. Theory of Statistics. New York: Springer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset