Chapter 35
A Test for Normality Based on Sample Entropy

J. Roy. Statist. Soc., Series B, 38 (1) (1976), 54–59.

Abstract

This chapter introduces a test of the composite hypothesis of normality. The test is based on the property of the normal distribution that its entropy exceeds that of any other distribution with a density that has the same variance. The test statistic is based on a class of estimators of entropy constructed here. The test is shown to be a consistent test of the null hypothesis for all alternatives without a singular continuous part. The power of the test is estimated against several alternatives. It is observed that the test compares favorably with other tests for normality.

Entropy Estimation

The entropy of a distribution F with a density function f is defined as

1 equation

Let c35-math-0002, be a sample from the distribution F. Express (1) in the form

2 equation

An estimate of (2) can be constructed by replacing the distribution function F by the empirical distribution function c35-math-0004, and using a difference operator in place of the differential operator. The derivative of c35-math-0005 is then estimated by c35-math-0006 for c35-math-0007, c35-math-0008, where c35-math-0009 are the order statistics and m is a positive integer smaller than n/2. One-sided differences of the type c35-math-0010 are used in place of c35-math-0011 when c35-math-0012, respectively. This produces an estimate c35-math-0013 of c35-math-0014

3 equation

where c35-math-0016, and c35-math-0017.

To investigate the behavior of c35-math-0018, it is useful to write it as a sum of three components,

4 equation

where

equation

The first term in Eq. (4) does not depend on m and represents the sample mean estimate of c35-math-0021 assuming that the value of f at the points c35-math-0022 is known. If the variance of −c35-math-0023 is finite, it is the minimum variance unbiased estimate of c35-math-0024 given the values of f at the sample points. The two remaining terms represent two sources of additional estimation error. The term c35-math-0025 is due to estimation of f by finite differences. For fixed n, its effect decreases with decreasing values of m. The term c35-math-0026 corresponds to the error due to estimating increments of F by increments of Fn. The increments are taken over intervals c35-math-0027, whose length increases with m, and therefore the disturbance due to c35-math-0028 is the smaller the larger is the value of m.

As c35-math-0029, simultaneous reduction of the effect of these two noise terms requires that c35-math-0030. An optimal choice of m for a given n, however, depends on the (unknown) distribution F. In general, the smoother the density of F, the larger is such optimal value of m.

Since c35-math-0031 are distributed as an ordered sample of size n from the uniform distribution on (0, 1), the distribution of c35-math-0032 does not depend on F. Its limiting behavior is given by the following lemma.

Lemma 1. The variable c35-math-0033 converges to zero in probability as c35-math-0034.

Proof. Put c35-math-0035, c35-math-0036. Since the geometric mean does not exceed the arithmetic mean, it follows that c35-math-0037. Therefore, c35-math-0038 is a nonpositive variable with the mean

equation

The variable c35-math-0040 has the beta distribution with parameters c35-math-0041. The expected value of its logarithm is easily evaluated by differentiation of the generating function at zero as c35-math-0042, where c35-math-0043 is the digamma function. Thus, after some algebra,

5 equation

The right-hand side of the last equality converges to zero with c35-math-0045. Thus, c35-math-0046 forms a series of nonpositive variables with expectations approaching zero, and consequently

equation

Since the distribution of c35-math-0048 is independent of F, the bias due to the presence of c35-math-0049 in (4) can be eliminated by using

6 equation

rather than c35-math-0051, as an estimate of entropy. Here c35-math-0052 is given by (5). The following theorem deals with consistency of c35-math-0053 (and, by Lemma 1, also that of c35-math-0054).

Theorem 1. Let c35-math-0055 be a sample from a distribution F with a density f and a finite variance. Then

equation

Proof. With some reorganization, c35-math-0057 can be written

7 equation

where

equation

and Fn is the empirical distribution function. When c35-math-0060 belong to an interval in which c35-math-0061 is positive and continuous, then there exists a value c35-math-0062 such that

equation

Therefore, c35-math-0064 is a Stieltjes sum of the function − c35-math-0065 with respect to the measure c35-math-0066 over the sum of intervals of continuity of f in which c35-math-0067. The contribution of terms in Sj that corresponds to intervals in which c35-math-0068 approaches zero with c35-math-0069. Since in any interval in which c35-math-0070 is positive, c35-math-0071 a.s. as c35-math-0072 and c35-math-0073 a.s. uniformly over x, Sj converges a.s. to c35-math-0074, which is either finite or −∞ in virtue of finite variance of F. Moreover, this convergence is uniform over j. Consequently,

equation

Since

equation

the statement of the theorem follows from (7).

Test for Normality

A well-known theorem of information theory (Shannon, 1949, p. 55) states that among all distributions that possess a density function f and have a given variance σ2, the entropy c35-math-0077 is maximized by the normal distribution. The entropy of the normal distribution with variance c35-math-0078 is c35-math-0079. The question arises as to whether a test of the composite hypothesis of normality can be based on this property. The estimate c35-math-0080 will be used for that purpose.

Definition. Let c35-math-0081 be a sample from a distribution F and let c35-math-0082 be the order statistics. Let m be a positive integer smaller than c35-math-0083 and define c35-math-0084 for c35-math-0085. The c35-math-0086 test of the composite hypothesis of normality is a test with critical region c35-math-0087, where

8 equation

and

equation

Under the null hypothesis,

equation

Under an alternative distribution with density f and a finite variance c35-math-0091,

equation

This means that the K test is consistent for such alternatives. There is no need, however, to restrict the use of the test to distributions with a density and a finite second moment, as will be established in Theorem 2. First, a lemma will be proven.

Lemma 2. Let F be a distribution with a density function f and without a finite second moment. Put

equation

For each c such that c35-math-0094, define a density function c35-math-0095 by

9 equation

Denote the variance of c35-math-0097 by c35-math-0098. Then c35-math-0099 as c35-math-0100.

Proof. Let d be such that c35-math-0101. Then for c35-math-0102,

equation

where c35-math-0104. According to an inequality in information theory (cf., for instance, Kullback, 1959, p. 15),

for nonnegative functions c35-math-0106. Let g be the density of the normal distribution with the same mean and variance as c35-math-0107. An application of inequality (10) and the inequality c35-math-0108 then yields

equation

For a fixed d and c35-math-0110, the right-hand side of the last inequality approaches minus infinity, as was to be proven.

Theorem 2. The c35-math-0111 test of any size c35-math-0112 is a consistent test, as c35-math-0113, for all alternatives without a singular continuous part.

Proof. Let c35-math-0114 be a sample from a distribution F. If F has a density and a finite variance, the consistency of the test follows from Theorem 1. Assume that F has a density f but the second moment is infinite. Let c35-math-0115 be the truncated density (9) with variance c35-math-0116. Define a statistic c35-math-0117 as

equation

where c35-math-0119 is the subsample of all xi such that c35-math-0120. Since the subsample has the density c35-math-0121 and c35-math-0122 a.s. as c35-math-0123, it follows that

equation

The difference c35-math-0125 converges to zero in probability with c35-math-0126 uniformly over n. Therefore,

equation

in virtue of Lemma 2, which establishes consistency for that class of alternatives.

Finally, let F have an atom a with a weight c35-math-0128. Then

equation

as c35-math-0130. Thus,

equation

and the consistency of the test for alternatives with an atom follows. This completes the proof.

It can be shown that always

equation

Except in the simplest case c35-math-0133, the distribution of c35-math-0134 under the null hypothesis has not been obtained analytically. To determine the percentage points c35-math-0135, Monte Carlo simulations were employed. For each c35-math-0136 samples of size n from the normal distribution were formed, using the congruence method of generating pseudo-random numbers and obtaining approximately normal deviates as sums of 12 uniform deviates. The statistic c35-math-0137 for several values of m was calculated from each sample, and percentage points of the distribution of c35-math-0138 were estimated by the corresponding order statistics. For each significance level and each value of m, the estimates were smoothed by fitting a polynomial in powers of c35-math-0139. The lower-tail 5 percent significance points of c35-math-0140 for selected values of n, m are given in Table 35.1.1

Table 35.1 0.05 points for the K statistic

m = 1 m = 2 m = 3 m = 4 m = 5
n = 3 0.99
4 1.05
5 1.19 1.70
6 1.33 1.77
7 1.46 1.87 1.97
8 1.57 1.97 2.05
9 1.67 2.06 2.13
10 1.76 2.15 2.21
12 1.90 2.31 2.36
14 2.01 2.43 2.49
16 2.11 2.54 2.60 2.57
18 2.18 2.62 2.69 2.67
20 2.25 2.69 2.77 2.76
25 2.83 2.93 2.93 2.91
30 2.93 3.04 3.06 3.05
35 3.00 3.13 3.16 3.16
40 3.19 3.24 3.24
45 3.25 3.29 3.30
50 3.29 3.34 3.35

The power of the test was estimated against several alternatives. The method was that of Monte Carlo simulation of the distribution of Kmn under alternative population distributions. For each alternative, 1,000 samples of sizes n = 10, 20, 50 were generated, and the test power was estimated by the frequency of the samples falling into the critical region. The continuous alternatives investigated were gamma (1) (exponential), gamma (2), beta (1,1) (uniform), beta (2,1), and Cauchy distributions.

For these alternatives, the maximum power was typically attained by choosing c35-math-0141 for c35-math-0142 for c35-math-0143, and c35-math-0144 for c35-math-0145. With increasing n, an optimal choice of m also increases, while the ratio c35-math-0146 tends to zero.

The power of the K test was compared to that of some other tests for normality against the same alternatives. The tests investigated by Stephens (1974) were considered. These are the Kolmogorov-Smirnov D, Cramér-von Mises W2, Kuiper V, Watson c35-math-0147, Anderson-Darling c35-math-0148, and Shapiro-Wilk W tests. Of these, only the Shapiro-Wilk test is a test of the composite hypothesis of normality. The tests c35-math-0149, and c35-math-0150, based on the empirical distribution function (EDF), require a complete specification of the null hypothesis. When these tests are used to test the composite hypothesis, the parameters must be estimated from the sample. Critical values corresponding to such modification of the test statistics are then applicable.

Table 35.2 lists power estimates of .05 size tests with sample size n = 20. These results have been obtained by Stephens (1974) for the EDF statistics against the exponential, uniform, and Cauchy alternatives; by Van Soest (1967) for c35-math-0151 against gamma (2); and by Shapiro and Wilk (1965) for W. The powers of c35-math-0152, and c35-math-0153 against gamma (2) and of the EDF statistics against beta (2,1) were estimated by the author from 2,000 samples, using the critical values given in Stephens (1974). The standard error of the power estimates in Table 35.2 does not exceed .015.

Table 35.2 Powers of .05 tests against some alternatives (n = 20)

Alternative D W2 V U2 A2 W K3
Exponential .59 .74 .71 .70 .82 .84 .85
Gamma (2) .33 .45 .33 .37 .48 .50 .45
Uniform .12 .16 .17 .18 .21 .23 .44
Beta (2,1) .17 .23 .20 .23 .28 .35 .43
Cauchy .86 .88 .87 .88 .98 .88 .75

It is apparent from Table 35.1 that none of the tests considered performs better than all other tests against all alternatives. Compared with any other test, however, the K test exhibits higher power against at least three of the five alternative distributions. For three of the alternatives, the power of the K test is uniformly the highest. Similar results hold for other sample sizes and sizes of the test.

These results, together with the relative simplicity of the K test (no tables of coefficients or function values are needed to calculate the test statistic) and its asymptotic properties against any alternative, suggest that the K test may be preferred in many situations.

Acknowledgment

This research was partly supported by Wells Fargo Bank, N.A. The author is indebted to Larry J. Cuneo for help with the computer simulations. The author wishes to express his thanks to the editor and referees of the Journal for their helpful suggestions.

References

  1. Kullback, S. (1959). Information Theory and Statistics. New York: John Wiley & Sons.
  2. Shannon, C.E. (1949). The Mathematical Theory of Communication. Urbana: University of Illinois Press.
  3. Shapiro, S.S., and Wilk, M.B. (1965). “An Analysis of Variance Test for Normality (Complete Samples).” Biometrika, 52, 591–611.
  4. Stephens, M.A. (1974). “EDF Statistics for Goodness of Fit and Some Comparisons.” J. Amer. Statist. Ass., 69, 730–737.
  5. Van Soest, J. (1967). “Some Empirical Results Concerning Tests of Normality.” Statist. Neerland., 21, 91–97.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset