8.4 The Stein estimator
This section is about an aspect of classical statistics which is related to the aforementioned discussion, but an understanding of it is by no means necessary for developing a knowledge of Bayesian statistics per se. The Bayesian analysis of the hierarchical normal model is continued in Section 8.5.
One of the most puzzling and provocative results in classical statistics in the past half century was Stein’s startling discovery (see Stein, 1956, and James and Stein, 1961) that the ‘obvious’ estimator of the multivariate normal mean is inadmissible if . In fact if c is any constant with
then
dominates . The best value of c is r–2, leading to the James–Stein estimator
Because it may be considered as a weighted mean of and , it is often called a shrinkage estimator which ‘shrinks’ the ordinary estimator towards , despite the fact that if S1< r–2 it ‘shrinks’ past . Note, incidentally, that points which are initially far from are little affected by this shrinkage. Of course, this ties in with the results of Section 8.3 because the James–Stein estimator has turned out to be just the same as the empirical Bayes estimator .
In fact, it can be shown that the risk of is
The expectation on the right-hand side depends on and , but as it must be non-negative, so that dominates , that is,
for all . It turns out that S1 has a distribution which depends solely on r and the quantity
a fact which can be proved by considering an orthogonal transformation of the variates to variates Wi such that
Evidently if then , and in general we say that S1 has a non-central chi-squared distribution on r degrees of freedom with non-centrality parameter . We denote this by
It is fairly obvious that as typical values of will tend to infinity and we will get
whereas when the variate S1 has a central distribution on r degrees of freedom (no parameters are estimated), so
and hence
which, particularly for large values of r, is notably less than the risk of the obvious estimator.
In the particular case where the arbitrary origin is taken at 0 the James–Stein estimator takes the form
but it is important to note that this is only a special case.
Variants of the James–Stein estimator have been derived. For example, if c is any constant with
then
dominates , this time provided (loss of one dimension as a result of estimating a mean is something we are used to in statistics). The best value of c in this case is k–3, leading to the Efron–Morris estimator
In this case the ‘shrinkage’ is towards the overall mean.
In the case of the Efron–Morris estimator, it can be shown (see Lehmann, 1983, Section 4.6) that the risk of is
Since S has a central distribution on r–1 degrees of freedom,
and hence
which, particularly for large values of r, is again notably less than the risk of the obvious estimator.
When we consider using such estimates in practice we encounter the ‘speed of light’ rhetorical question,
Do you mean that if I want to estimate tea consumption in Taiwan, I will do better to estimate simultaneously the speed of light and the weight of hogs in Montana?
The question then arises as to why this happens. Stein’s own explanation was that the sample distance squared of from , that is , overestimates the squared distance of from and hence that the estimator could be improved by bringing it nearer (whatever is). Following an idea due to Lawrence Brown, the effect was illustrated as shown in Figure 8.1 in a paper by Berger (1980, Figure 2, p. 736).
The four points , , represent a spherical distribution centred at .
Consider the effect of shrinking these points as shown. The points and move, on average, slightly further away from , but the points and move slightly closer (while distant points hardly move at all). In three dimensions, there are a further two points (not on the line between and ) that are shrunk closer to .
Another explanation that has been offered is that can be viewed as a ‘pre-test’ estimator: if one performs a preliminary test of the hypothesis that and then uses or depending on the outcome of the test, then the resulting estimator is a weighted average of and of which is a smoothed version, although why this particular smoothing is to be used is not obvious from this chain of reasoning (cf. Lehmann, 1983, Section 4.5).
8.4.1 Evaluation of the risk of the James–Stein estimator
We can prove that the James–Stein estimator has the risk quoted earlier, namely
[An alternative approach can be found in Lehmann (1983, Sections 4.5 and 4.6).] We proceed by writing
where the expectations are over repeated sampling for fixed . The function g depends on alone by spherical symmetry about . Similarly, the function h depends on alone since . We note that because the unconditional distribution of S1 is , we have
the expectation being taken over values of or over values of , that is,
using the result at the very end of Section 8.2 and bearing in mind that . Now writing k=g–h we have
and hence
for all , which can happen only if k vanishes identically by the uniqueness of Laplace transforms.