Chapter 1: Basic Probability Theory

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 1

Basic Probability Theory

PART I: THEORY

It is assumed that the reader has had a course in elementary probability. In this chapter we discuss more advanced material, which is required for further developments.

1.1 OPERATIONS ON SETS

Let denote a sample space. Let E1, E2 be subsets of . We denote the union by E1 E2 and the intersection by E1 E2. = − E denotes the complement of E. By DeMorgan’s laws = 1 2 and = 1 2.

Given a sequence of sets {En, n ≥ 1} (finite or infinite), we define

(1.1.1)

Furthermore, and are defined as

(1.1.2)

If a point of belongs to En, it belongs to infinitely many sets En. The sets , En and , En always exist and

(1.1.3)

If , En = , En, we say that a limit of {En, n ≥ 1} exists. In this case,

(1.1.4)

A sequence {En, n ≥ 1} is called monotone increasing if En En+1 for all n ≥ 1. In this case . The sequence is monotone decreasing if En En+1, for all n ≥ 1. In this case . We conclude this section with the definition of a partition of the sample space. A collection of sets = {E1, …, Ek} is called a finite partition of if all elements of are pairwise disjoint and their union is , i.e., Ei Ej = for all i ≠ j; Ei, Ej ; and . If contains a countable number of sets that are mutually exclusive and , we say that is a countable partition.

1.2 ALGEBRA AND σ–FIELDS

Let be a sample space. An algebra is a collection of subsets of satisfying

(1.2.1) numbered Display Equation

We consider = . Thus, (i) and (ii) imply that . Also, if E1, E2 then E1 E2 .

The trivial algebra is 0 = {, }. An algebra 1 is a subalgebra of 2 if all sets of 1 are contained in 2. We denote this inclusion by 1 2. Thus, the trivial algebra 0 is a subalgebra of every algebra . We will denote by (), the algebra generated by all subsets of (see Example 1.1).

If a sample space has a finite number of points n, say 1 ≤ n < ∞, then the collection of all subsets of is called the discrete algebra generated by the elementary events of . It contains 2n events.

Let be a partition of having k, 2 ≤ k, disjoint sets. Then, the algebra generated by , (), is the algebra containing all the 2k − 1 unions of the elements of and the empty set.

An algebra on is called a σ–field if, in addition to being an algebra, the following holds.

(iv) If En

, n ≥ 1, then

We will denote a σ–field by . In a σ–field the supremum, infinum, limsup, and liminf of any sequence of events belong to . If is finite, the discrete algebra () is a σ–field. In Example 1.3 we show an algebra that is not a σ–field.

The minimal σ–field containing the algebra generated by {(-∞, x], -∞ < x < ∞ } is called the Borel σ–field on the real line .

A sample space , with a σ–field , (, ) is called a measurable space.

The following lemmas establish the existence of smallest σ–field containing a given collection of sets.

Lemma 1.2.1 Let be a collection of subsets of a sample space . Then, there exists a smallest σ–field (), containing the elements of .

Proof. The algebra of all subsets of , () obviously contains all elements of . Similarly, the σ–field containing all subsets of , contains all elements of . Define the σ–field () to be the intersection of all σ–fields, which contain all elements of . Obviously, () is an algebra. QED

A collection of subsets of is called a monotonic class if the limit of any monotone sequence in belongs to .

If is a collection of subsets of , let * () denote the smallest monotonic class containing .

Lemma 1.2.2. A necessary and sufficient condition of an algebra to be a σ–field is that it is a monotonic class.

Proof. (i) Obviously, if is a σ–field, it is a monotonic class.

(ii) Let be a monotonic class.

Let En , n ≥ 1. Define . Obviously Bn Bn+1 for all n ≥ 1. Hence inline . But . Thus, , En . Similarly, En . Thus, is a σ–field. QED

Theorem 1.2.1. Let be an algebra. Then * () = (), where () is the smallest σ–field containing .

Proof. See Shiryayev (1984, p. 139).

The measurable space (, ), where is the real line and = (), called the Borel measurable space, plays a most important role in the theory of statistics. Another important measurable space is (n, n), n ≥ 2, where n = × × ··· × is the Euclidean n–space, and n = × ··· × is the smallest σ–field containing n, , and all n–dimensional rectangles I = I1 × ··· × In, where

The measurable space (∞, ∞) is used as a basis for probability models of experiments with infinitely many trials. ∞ is the space of ordered sequences x = (x1, x2, …), −∞ < xn < ∞, n = 1, 2, …. Consider the cylinder sets

and

where Bi are Borel sets, i.e., Bi . The smallest σ–field containing all these cylinder sets, n ≥ 1, is (∞). Examples of Borel sets in (∞) are

(a) {x: x

∞,

, xn > a}

(b) {x: x

∞,

, xn ≤ a}.

1.3 PROBABILITY SPACES

Given a measurable space (, ), a probability model ascribes a countably additive function P on , which assigns a probability P{A} to all sets A . This function should satisfy the following properties.

(1.3.1) numbered Display Equation

(1.3.2) numbered Display Equation

Recall that if A B then P {A} ≤ P{B}, and P{} = 1 − P{A}. Other properties will be given in the examples and problems. In the sequel we often write AB for A B.

Theorem 1.3.1. Let (, , P) be a probability space, where is a σ–field of subsets of and P a probability function. Then

(i) if Bn

Bn + 1, n ≥ 1, Bn

, then

(1.3.3)

(ii) if Bn

Bn+1, n ≥ 1, Bn

, then

(1.3.4)

Proof. (i) Since Bn Bn + 1, . Moreover,

(1.3.5) numbered Display Equation

Notice that for n ≥ 2, since n Bn−1 = ,

(1.3.6) numbered Display Equation

Also, in (1.3.5)

(1.3.7) numbered Display Equation

Thus, Equation (1.3.3) is proven.

(ii) Since Bn Bn + 1, n ≥ 1, n n+1, n ≥ 1. . Hence,

Unnumbered Display Equation QED

Sets in a probability space are called events.

1.4 CONDITIONAL PROBABILITIES AND INDEPENDENCE

The conditional probability of an event A given an event B such that P {B} > 0, is defined as

(1.4.1)

We see first that P{· | B} is a probability function on . Indeed, for every A , 0 ≤ P{A|B} ≤ 1. Moreover, P{| B} = 1 and if A1 and A2 are disjoint events in , then

(1.4.2) numbered Display Equation

If P{B} > 0 and P{A} ≠ P{A|B}, we say that the events A and B are dependent. On the other hand, if P{A} = P{A|B} we say that A and B are independent events. Notice that two events are independent if and only if

(1.4.3)

Given n events in , namely A1, …, An, we say that they are pairwise independent if P{Ai Aj} = P{Ai} P{Aj} for any i ≠ j. The events are said to be independent in triplets if

for any i ≠ j≠ k. Example 1.4 shows that pairwise independence does not imply independence in triplets.

Given n events A1, …, An of , we say that they are independent if, for any 2 ≤ k ≤ n and any k–tuple (1 ≤ i1 < i2 < ··· < ik ≤ n),

(1.4.4) numbered Display Equation

Events in an infinite sequence {A1, A2, … } are said to be independent if {A1, …, An} are independent, for each n ≥ 2. Given a sequence of events A1, A2, … of a σ–field , we have seen that

This event means that points w in , An belong to infinitely many of the events {An}. Thus, the event , An is denoted also as {An, i.o. }, where i.o. stands for “infinitely often.”

The following important theorem, known as the Borel–Cantelli Lemma, gives conditions under which P{An, i.o.} is either 0 or 1.

Theorem 1.4.1 (Borel–Cantelli) Let {An} be a sequence of sets in .

(i) If

P{An} < ∞, then P{An, i.o.} = 0.

(ii) If

P{An} = ∞ and {An} are independent, then P{An, i.o. } = 1.

Proof. (i) Notice that is a decreasing sequence. Thus

Unnumbered Display Equation

But

Unnumbered Display Equation

The assumption that P{An} < ∞ implies that P{Ak} = 0.

(ii) Since A1, A2, … are independent, 1, 2, … are independent. This implies that

Unnumbered Display Equation

If 0 < x ≤ 1 then log (1−x) ≤ −x. Thus,

Unnumbered Display Equation

since P{An} = ∞. Thus inline = 0 for all n ≥ 1. This implies that P{An, i.o.} = 1. QED

We conclude this section with the celebrated Bayes Theorem.

Let = {Bi, i J} be a partition of , where J is an index set having a finite or countable number of elements. Let Bj and P{Bj} > 0 for all j J. Let A , P{A} > 0. We are interested in the conditional probabilities P{Bj| A}, j J.

Theorem 1.4.2 (Bayes).

(1.4.5) numbered Display Equation

Proof. Left as an exercise. QED

Bayes Theorem is widely used in scientific inference. Examples of the application of Bayes Theorem are given in many elementary books. Advanced examples of Bayesian inference will be given in later chapters.

1.5 RANDOM VARIABLES AND THEIR DISTRIBUTIONS

Random variables are finite real value functions on the sample space , such that measurable subsets of are mapped into Borel sets on the real line and thus can be assigned probability measures. The situation is simple if contains only a finite or countably infinite number of points.

In the general case, might contain non–countable infinitely many points. Even if is the space of all infinite binary sequences w = (i1, i2, …), the number of points in is non–countable. To make our theory rich enough, we will require that the probability space will be (, , P), where is a σ–field. A random variable X is a finite real value function on . We wish to define the distribution function of X, on , as

(1.5.1)

For this purpose, we must require that every Borel set on has a measurable inverse image with respect to . More specifically, given (, , P), let (, ) be Borel measurable space where is the real line and the Borel σ–field of subsets of . A subset of (, B) is called a Borel set if B belongs to . Let X: → . The inverse image of a Borel set B with respect to X is

(1.5.2)

A function X: → is called –measurable if X−1 (B) for all B . Thus, a random variable with respect to (, , P) is an –measurable function on . The class X = {X−1(B): B } is also a σ–field, generated by the random variable X. Notice that X .

By definition, every random variable X has a distribution function FX. The probability measure PX{·} induced by X on (, B) is

(1.5.3)

A distribution function FX is a real value function satisfying the properties

(i)

FX(x) = 0;

(ii)

FX(x) = 1;

(iii) If x1 < x2 then FX (x1) ≤ FX(x2); and

(iv)

FX(x +

) = FX(x), and

F(x −

) = FX (x−), all −∞ < x < ∞.

Thus, a distribution function F is right–continuous.

Given a distribution function FX, we obtain from (1.5.1), for every −∞ < a < b < ∞,

(1.5.4)

and

(1.5.5)

Thus, if FX is continuous at a point x0, then P{w: X(w) = x0} = 0. If X is a random variable, then Y = g(X) is a random variable only if g is –(Borel) measurable, i.e., for any B , g−1 (B) . Thus, if Y = g(X), g is –measurable and X –measurable, then Y is also –measurable. The distribution function of Y is

(1.5.6)

Any two random variables X, Y having the same distribution are equivalent. We denote this by Y ~ X.

A distribution function F may have a countable number of distinct points of discontinuity. If x0 is a point of discontinuity, F(x0) − F(x0−) > 0. In between points of discontinuity, F is continuous. If F assumes a constant value between points of discontinuity (step function), it is called discrete. Formally, let −∞ < x1 < x2 < ··· < ∞ be points of discontinuity of F. Let IA(x) denote the indicator function of a set A, i.e.,

Then a discrete F can be written as

(1.5.7) numbered Display Equation

Let μ1 and μ2 be measures on (, ). We say that μ1 is absolutely continuous with respect to μ2, and write μ1 μ2, if B and μ2 (B) = 0 then μ1(B) = 0. Let λ denote the Lebesgue measure on (, ). For every interval (a, b], −∞ < a < b < ∞, λ ((a, b]) = b−a. The celebrated Radon–Nikodym Theorem (see Shiryayev, 1984, p. 194) states that if μ1 μ2 and μ1, μ2 are σ–finite measures on (, ), there exists a –measurable nonnegative function f(x) so that, for each B ,

(1.5.8)

where the Lebesgue integral in (1.5.8) will be discussed later. In particular, if Pc is absolutely continuous with respect to the Lebesgue measure λ, then there exists a function f ≥ 0 so that

(1.5.9)

Moreover,

(1.5.10)

A distribution function F is called absolutely continuous if there exists a nonnegative function f such that

(1.5.11) numbered Display Equation

The function f, which can be represented for “almost all x” by the derivative of F, is called the probability density function (p.d.f.) corresponding to F.

If F is absolutely continuous, then f(x) = F(x) “almost everywhere.” The term “almost everywhere” or “almost all” x means for all x values, excluding maybe on a set N of Lebesgue measure zero. Moreover, the probability assigned to any interval (α, β], α ≤ β, is

(1.5.12) numbered Display Equation

Due to the continuity of F we can also write

Often the density functions f are Riemann integrable, and the above integrals are Riemann integrals. Otherwise, these are all Lebesgue integrals, which are defined in the next section.

There are continuous distribution functions that are not absolutely continuous. Such distributions are called singular. An example of a singular distribution is the Cantor distribution (see Shiryayev, 1984, p. 155).

Finally, every distribution function F(x) is a mixture of the three types of distributions—discrete distribution Fd(·), absolutely continuous distributions Fac(·), and singular distributions Fs(·). That is, for some 0 ≤ p1, p2, p3 ≤ 1 such that p1 + p2 + p3 = 1,

In this book we treat only mixtures of Fd(x) and Fac(x).

1.6 THE LEBESGUE AND STIELTJES INTEGRALS

1.6.1 General Definition of Expected Value: The Lebesgue Integral

Let (, , P) be a probability space. If X is a random variable, we wish to define the integral

(1.6.1)

We define first E{X} for nonnegative random variables, i.e., X(w) ≥ 0 for all w . Generally, X = X+ − X−, where X+ (w) = max (0, X(w)) and X−(w) = −min (0, X(w)).

Given a nonnegative random variable X we construct for a given finite integer n the events

and

These events form a partition of . Let Xn, n ≥ 1, be the discrete random variable defined as

(1.6.2) numbered Display Equation

Notice that for each w, Xn (w) ≤ Xn+1(w) ≤ … ≤ X(w) for all n. Also, if w Ak, n, k = 1, …, n2n, then |X(w) − Xn(w)| ≤ . Moreover, An2n+1, n A(n+1)2n+1, n+1, all n ≥ 1. Thus

Unnumbered Display Equation

Thus for all w

(1.6.3)

Now, for each discrete random variable Xn(w)

(1.6.4) numbered Display Equation

Obviously E {Xn} ≤ n, and E{Xn+1} ≥ E{Xn}. Thus, E{Xn} exists (it might be +∞). Accordingly, the Lebesgue integral is defined as

(1.6.5) numbered Display Equation

The Lebesgue integral may exist when the Riemann integral does not. For example, consider the probability space (, , P) where = {x: 0 ≤ x ≤ 1}, the Borel σ–field on , and P the Lebesgue measure on []. Define

Let B0 = {x: 0 ≤ x ≤ 1, f(x) = 0}, B1 = [0, 1]− B0. The Lebesgue integral of f is

since the Lebesgue measure of B1 is zero. On the other hand, the Riemann integral of f(x) does not exist. Notice that, contrary to the construction of the Riemann integral, the Lebesgue integral f(x)P{dx} of a nonnegative function f is obtained by partitioning the range of the function f to 2n subintervals n = {} and constructing a discrete random variable = I{x }, where fn, j = inf{f(x): x }. The expected value of is E{} = P(X ). The sequence {E{}, n≥ 1} is nondecreasing, and its limit exists (might be +∞). Generally, we define

(1.6.6)

if either E{X+} < ∞ or E{X−} < ∞.

If E{X+} = ∞ and E{X−} = ∞, we say that E{X} does not exist. As a special case, if F is absolutely continuous with density f, then

provided |x| f>(x)dx < ∞. If F is discrete then

provided it is absolutely convergent.

From the definition (1.6.4), it is obvious that if P{X(w) ≥ 0} = 1 then E{X} ≥ 0. This immediately implies that if X and Y are two random variables such that P{w: X(w) ≥ Y(w)} = 1, then E{X−Y} ≥ 0. Also, if E{X} exists then, for all A ,

and E{XIA(X)} exists. If E{X} is finite, E{XIA(X)} is also finite. From the definition of expectation we immediately obtain that for any finite constant c,

(1.6.7) numbered Display Equation

Equation (1.6.7) implies that the expected value is a linear functional, i.e., if X1, …, Xn are random variables on (, , P) and β0, β1, …, βn are finite constants, then, if all expectations exist,

(1.6.8) numbered Display Equation

We present now a few basic theorems on the convergence of the expectations of sequences of random variables.

Theorem 1.6.1 (Monotone Convergence) Let {Xn} be a monotone sequence of random variables and Y a random variable.

(i) Suppose that Xn(w)

X(w), Xn(w) ≥ Y(w) for all n and all w

, and E{Y} > −∞. Then

(ii) If Xn(w)

X(w), Xn(w) ≤ Y(w), for all n and all w

, and E {Y} < ∞, then

Proof. See Shiryayev (1984, p. 184). QED

Corollary 1.6.1. If X1, X2, … are nonnegative random variables, then

(1.6.9) numbered Display Equation

Theorem 1.6.2. (Fatou) Let Xn, n ≥ 1 and Y be random variables.

(i) If Xn(w) ≥ Y(w), n ≥ 1, for each w and E{Y} > −∞, then

(ii) if Xn(w) ≤ Y(w), n ≥ 1, for each w and E {Y} <∞, then

(iii) if |Xn(w)| ≤ Y(w) for each w, and E{Y} <∞, then

(1.6.10)

Proof. (i)

The sequence Zn(w) = Xm(w), n ≥ 1 is monotonically increasing for each w, and Zn(w) ≥ Y(w), n ≥ 1. Hence, by Theorem 1.6.1,

The proof of (ii) is obtained by defining Zn(w) = Xm(w), and applying the previous theorem. Part (iii) is a result of (i) and (ii). QED

Theorem 1.6.3. (Lebesgue Dominated Convergence) Let Y, X, Xn, n ≥ 1, be random variables such that |Xn(w)| ≤ Y(w), n ≥ 1 for almost all w, and E{Y} < ∞. Assume also that P. Then E{|X|} < ∞ and

(1.6.11)

and

(1.6.12)

Proof. By Fatou’s Theorem (Theorem 1.6.2)

But since Xn(w) = X(w), with probability 1,

Moreover, |X(w)| < Y(w) for almost all w (with probability 1). Hence, E{|X|} < ∞. Finally, since |Xn(w) − X(w)| ≤ 2Y(w), with probability 1

QED

We conclude this section with a theorem on change of variables under Lebesgue integrals.

Theorem 1.6.4 Let X be a random variable with respect to (, , P). Let g: → be a Borel measurable function. Then for each B ,

(1.6.13) numbered Display Equation

The proof of the theorem is based on the following steps.

1. If A

and g (x) = IA(x) then

Unnumbered Display Equation

2. Show that Equation (1.6.13) holds for simple random variables.

3. Follow the steps of the definition of the Lebesgue integral.

1.6.2 The Stieltjes–Riemann Integral

Let g be a function of a real variable and F a distribution function. Let (α, β] be a half–closed interval. Let

be a partition of (α, β] to n subintervals (xi−1, xi], i = 1, …, n. In each subinterval choose x’i, xi−1 < x’i ≤ xi and consider the sum

(1.6.14) numbered Display Equation

If, as n → ∞, |xi − xi−1| → 0 and if Sn exists (finite) independently of the partitions, then the limit is called the Stieltjes–Riemann integral of g with respect to F. We denote this integral as

This integral has the usual linear properties, i.e.,

(i)

(ii)

(1.6.15)

and

(iii)

g(x) d(γ F1(x) + δ F2(x)) = γ

g(x)dF1(x) + δ

g(x)dF2(x).

One can integrate by parts, if all expressions exist, according to the formula

(1.6.16)

where g’(x) is the derivative of g(x). If F is strictly discrete, with jump points −∞ < ξ1 < ξ2 < ··· <∞,

(1.6.17) numbered Display Equation

where pj = F(ξj) − F(ξj−), j = 1, 2, …. If F is absolutely continuous, then at almost all points,

as dx → 0. Thus, in the absolutely continuous case

(1.6.18) numbered Display Equation

Finally, the improper Stieltjes–Riemann integral, if it exists, is

(1.6.19) numbered Display Equation

If B is a set obtained by union and complementation of a sequence of intervals, we can write, by setting g(x) = I{x B},

(1.6.20) numbered Display Equation

where F is either discrete or absolutely continuous.

1.6.3 Mixtures of Discrete and Absolutely Continuous Distributions

Let Fd be a discrete distribution and let Fac be an absolutely continuous distribution function. Then for all α 0 ≤ α ≤ 1,

(1.6.21)

is also a distribution function, which is a mixture of the two types. Thus, for such mixtures, if −∞ < ξ1 < ξ2 < ··· < ∞ are the jump points of Fd, then for every −∞ < γ ≤ δ < ∞ and B = (γ, δ],

(1.6.22) numbered Display Equation

Moreover, if B+ = [γ, δ] then

The expected value of X, when F(x) = pFd(x) + (1−p) Fac(x) is,

(1.6.23) numbered Display Equation

where {ξj} is the set of jump points of Fd; fd and fac are the corresponding p.d.f.s. We assume here that the sum and the integral are absolutely convergent.

1.6.4 Quantiles of Distributions

The p–quantiles or fractiles of distribution functions are inverse points of the distributions. More specifically, the p–quantile of a distribution function F, designated by xp or F−1(p), is the smallest value of x at which F(x) is greater or equal to p, i.e.,

(1.6.24)

The inverse function defined in this fashion is unique. The median of a distribution, x.5, is an important parameter characterizing the location of the distribution. The lower and upper quartiles are the .25– and .75–quantiles. The difference between these quantiles, RQ = x.75 − x.25, is called the interquartile range. It serves as one of the measures of dispersion of distribution functions.

1.6.5 Transformations

From the distribution function F(x) = α Fd(x) + (1−α) Fac(x), 0 ≤ α ≤ 1, we can derive the distribution function of a transformed random variable Y = g(X), which is

(1.6.25) numbered Display Equation

where

In particular, if F is absolutely continuous and if g is a strictly increasing differentiable function, then the p.d.f. of Y, h(y), is

(1.6.26) numbered Display Equation

where g−1(y) is the inverse function. If g’(x) < 0 for all x, then

(1.6.27) numbered Display Equation

Suppose that X is a continuous random variable with p.d.f. f(x). Let g(x) be a differentiable function that is not necessarily one–to–one, like g(x) = x2. Excluding cases where g(x) is a constant over an interval, like the indicator function, let m(y) denote the number of roots of the equation g(x) = y. Let ξj(y), j = 1, …, m(y) denote the roots of this equation. Then the p.d.f. of Y = g(x) is

(1.6.28) numbered Display Equation

if m(y) > 0 and zero otherwise.

1.7 JOINT DISTRIBUTIONS, CONDITIONAL DISTRIBUTIONS AND INDEPENDENCE

1.7.1 Joint Distributions

Let (X1, …, Xk) be a vector of k random variables defined on the same probability space. These random variables represent variables observed in the same experiment. The joint distribution function of these random variables is a real value function F of k real arguments (ξ1, …, ξk) such that

(1.7.1)

The joint distribution of two random variables is called a bivariate distribution function.

Every bivariate distribution function F has the following properties.

(1.7.2) numbered Display Equation

Property (iii) is the right continuity of F(ξ1, ξ2). Property (iv) means that the probability of every rectangle is nonnegative. Moreover, the total increase of F(ξ1, ξ2) is from 0 to 1. The similar properties are required in cases of a larger number of variables.

Given a bivariate distribution function F. The univariate distributions of X1 and X2 are F1 and F2 where

(1.7.3)

F1 and F2 are called the marginal distributions of X1 and X2, respectively. In cases of joint distributions of three variables, we can distinguish between three marginal bivariate distributions and three marginal univariate distributions. As in the univariate case, multivariate distributions are either discrete, absolutely continuous, singular, or mixtures of the three main types. In the discrete case there are at most a countable number of points {(, …, ), j = 1, 2, … } on which the distribution concentrates. In this case the joint probability function is

(1.7.4) numbered Display Equation

Such a discrete p.d.f. can be written as

Unnumbered Display Equation

where pj = P{X1 =, …, Xk = }.

In the absolutely continuous case there exists a nonnegative function f(x1, …, xk) such that

(1.7.5) numbered Display Equation

The function f(x1, …, xk) is called the joint density function.

The marginal probability or density functions of single variables or of a subvector of variables can be obtained by summing (in the discrete case) or integrating, in the absolutely continuous case, the joint distribution functions (densities) with respect to the variables that are not under consideration, over their range of variation.

Although the presentation here is in terms of k discrete or k absolutely continuous random variables, the joint distributions can involve some discrete and some continuous variables, or mixtures.

If X1 has an absolutely continuous marginal distribution and X2 is discrete, we can introduce the function N(B) on , which counts the number of jump points of X2 that belong to B. N(B) is a σ–finite measure. Let λ (B) be the Lebesgue measure on . Consider the σ–finite measure on (2), μ (B1×B2) = λ (B1)N(B2). If X1 is absolutely continuous and X2 discrete, their joint probability measure PX is absolutely continuous with respect to μ. There exists then a nonnegative function fX such that

The function fX is a joint p.d.f. of X1, X2 with respect to μ. The joint p.d.f. fX is positive only at jump point of X2.

If X1, …, Xk have a joint distribution with p.d.f. f(x1, …, xk), the expected value of a function g(X1, …, Xk) is defined as

(1.7.6)

We have used here the conventional notation for Stieltjes integrals.

Notice that if (X, Y) have a joint distribution function F(x, y) and if X is discrete with jump points of F1(x) at ξ1, ξ2, …, and Y is absolutely continuous, then, as in the previous example,

Unnumbered Display Equation

where f(x, y) is the joint p.d.f. A similar formula holds for the case of X, absolutely continuous and Y, discrete.

1.7.2 Conditional Expectations: General Definition

Let X(w) ≥ 0, for all w , be a random variable with respect to (, , P). Consider a σ–field , . The conditional expectation of X given is defined as a –measurable random variable E{X| } satisfying

(1.7.7)

for all A . Generally, E{X| } is defined if min {E{X+| }, E{X−| }} < ∞ and E{X| } = E{X+| } − E{X−| }. To see that such conditional expectations exist, where X(w) ≥ 0 for all w, consider the σ–finite measure on ,

(1.7.8)

Obviously Q P and by Radon–Nikodym Theorem, there exists a nonnegative, –measurable random variable E{X| } such that

(1.7.9)

According to the Radon–Nikodym Theorem, E{X| } is determined only up to a set of P–measure zero.

If B and X(w) = IB(w), then E{X| } = P{B| } and according to (1.6.13),

(1.7.10) numbered Display Equation

Notice also that if X is –measurable then X = E{X| } with probability 1.

On the other hand, if = {, } is the trivial algebra, then E{X| } = E{X} with probability 1.

From the definition (1.7.7), since ,

Unnumbered Display Equation

This is the law of iterated expectation; namely, for all ,

(1.7.11)

Furthermore, if X and Y are two random variables on (, , P), the collection of all sets {Y−1 (B), B }, is a σ–field generated by Y. Let Y denote this σ–field. Since Y is a random variable, Y . We define

(1.7.12)

Let y0 be such that fY(y0) > 0.

Consider the Y–measurable set Aδ = {w: y0 < Y(w) ≤ y0 + δ }. According to (1.7.7)

(1.7.13) numbered Display Equation

The left–hand side of (1.7.13) is, if E{|X|} <∞,

Unnumbered Display Equation

where = 0. The right–hand side of (1.7.13) is

Dividing both sides of (1.7.13) by fY(y0)δ, we obtain that

Unnumbered Display Equation

We therefore define for fY(y0) > 0

(1.7.14)

More generally, for k > 2 let f(x1, …, xk) denote the joint p.d.f. of (X1, …, Xk). Let 1 ≤ r < k and g(x1, …, xr) denote the marginal joint p.d.f. of (X1, …, Xr). Suppose that (ξ1, …, ξr) is a point at which g(ξ1, …, ξr) > 0. The conditional p.d.f. of Xr+1, …, Xk given {X1 = ξ1, …, Xr = ξr} is defined as

(1.7.15)

We remark that conditional distribution functions are not defined on points (ξ1, …, ξr) such that g(ξ1, …, ξr) = 0. However, it is easy to verify that the probability associated with this set of points is zero. Thus, the definition presented here is sufficiently general for statistical purposes. Notice that f(xr+1, …, xk | ξ1, …, ξr) is, for a fixed point (ξ1, …, ξr) at which it is well defined, a nonnegative function of (xr+1, …, xk) and that

Thus, f(xr+1, …, xk| ξ1, …, ξr) is indeed a joint p.d.f. of (Xr+1, …, Xk). The point (ξ1, …, ξr) can be considered a parameter of the conditional distribution.

If (Xr+1, …, Xk) is an (integrable) function of (Xr+1, …, Xk), the conditional expectation of (Xr+1, …, Xk) given {X1 = ξ1, …, Xr = ξr} is

(1.7.16)

This conditional expectation exists if the integral is absolutely convergent.

1.7.3 Independence

Random variables X1, …, Xn, on the same probability space, are called mutually independent if, for any Borel sets B1, …, Bn,

(1.7.17) numbered Display Equation

Accordingly, the joint distribution function of any k–tuple (Xi1, …, Xik) is a product of their marginal distributions. In particular,

(1.7.18) numbered Display Equation

Equation (1.7.18) implies that if X1, …, Xn have a joint p.d.f. fX(x1, …, xn) and if they are independent, then

(1.7.19) numbered Display Equation

Moreover, if g(X1, …, Xn) = gj(Xj), where g(x1, …, xn) is (n)–measurable and gj(x) are –measurable, then under independence

(1.7.20) numbered Display Equation

Probability models with independence structure play an important role in statistical theory. From (1.7.12) and (1.7.21), we imply that if X(r) = (X1, …, Xr) and Y(r) = (Xr+1, …, Xn) are independent subvectors, then the conditional distribution of X(r) given Y(r) is independent of Y(r), i.e.,

(1.7.21)

with probability one.

1.8 MOMENTS AND RELATED FUNCTIONALS

A moment of order r, r = 1, 2, …, of a distribution F(x) is

(1.8.1)

The moments of Y = X − μ1 are called central moments and those of |X| are called absolute moments. It is simple to prove that the existence of an absolute moment of order r, r > 0, implies the existence of all moments of order s, 0 < s ≤ r, (see Section 1.13.3).

Let μ* r = E{(X − μ1)r}, r = 1, 2, … denote the rth central moment of a distribution. From the binomial expansion and the linear properties of the expectation operator we obtain the relationship between moments (about the origin) μr and center moments mr

(1.8.2) numbered Display Equation

where μ0 ≡ 1.

A distribution function F is called symmetric about a point ξ0 if its p.d.f. is symmetric about ξ0, i.e.,

From this definition we immediately obtain the following results.

(i) If F is symmetric about ξ0 and E{|X|} <∞, then ξ0 = E{X}.

(ii) If F is symmetric, then all central moments of odd order are zero, i.e., E{(X − E{X})2m+1} = 0, m = 0, 1, …, provided E|X|2m+1 < ∞.

The central moment of the second order occupies a central role in the theory of statistics and is called the variance of X. The variance is denoted by V{X}. The square–root of the variance, called the standard deviation, is a measure of dispersion around the expected value. We denote the standard deviation by σ. The variance of X is equal to

(1.8.3)

The variance is always nonnegative, and hence for every distribution having a finite second moment E{X2} ≥ (E{X})2. One can easily verify from the definition that if X is a random variable and a and b are constants, then V{a + bX} = b2V{X}.

The variance is equal to zero if and only if the distribution function is concentrated at one point (a degenerate distribution).

A famous inequality, called the Chebychev inequality, relates the probability of X concentrating around its mean, and the standard deviation σ.

Theorem 1.8.1. (Chebychev) If FX has a finite standard deviation σ, then, for every a > 0,

(1.8.4)

where μ = E{X}.

Proof.

(1.8.5) numbered Display Equation

Hence,

QED

Notice that in the proof of the theorem, we used the Riemann–Stieltjes integral. The theorem is true for any type of distribution for which 0 ≤ σ < ∞. The Chebychev inequality is a crude inequality. Various types of better inequalities are available, under additional assumptions (see Zelen and Severv, 1968; Rohatgi, 1976, p. 102).

The moment generating function (m.g.f.) of a random variable X, denoted by M, is defined as

(1.8.6)

where t is such that M(t) < ∞. Obviously, at t = 0, M(0) = 1. However, M(t) may not exist when t ≠ 0. Assume that M(t) exists for all t in some interval (a, b), a < 0 < b. There is a one–to–one correspondence between the distribution function F and the moment generating function M. M is analytic on (a, b), and can be differentiated under the expectation integral. Thus

(1.8.7)

Under this assumption the rth derivative of M(t) evaluated at t = 0 yields the moment of order r.

To overcome the problem of M being undefined in certain cases, it is useful to use the characteristic function

(1.8.8)

where i = . The characteristic function exists for all t since

(1.8.9) numbered Display Equation

Indeed, |eitx| = 1 for all x and all t.

If X assumes nonnegative integer values, it is often useful to use the probability generating function (p.g.f.)

(1.8.10) numbered Display Equation

which is convergent if |t| < 1. Moreover, given a p.g.f. of a nonnegative integer value random variable X, its p.d.f. can be obtained by the formula

(1.8.11)

The logarithm of the moment generating function is called cumulants generating function. We denote this generating function by K. K exists for all t for which M is finite. Both M and K are analytic functions in the interior of their domains of convergence. Thus we can write for t close to zero

(1.8.12) numbered Display Equation

The coefficients {κj} are called cumulants. Notice that κ0 = 0, and κj, j ≥ 1, can be obtained by differentiating K(t) j times, and setting t = 0. Generally, the relationships between the cumulants and the moments of a distribution are, for j =1, …, 4

(1.8.13) numbered Display Equation

The following two indices

(1.8.14)

and

(1.8.15)

where σ2 = is the variance, are called coefficients of skewness (asymmetry) and kurtosis (steepness), respectively. If the distribution is symmetric, then β1 = 0. If β1 > 0 we say that the distribution is positively skewed; if β1 < 0, it is negatively skewed. If β2 > 3 we say that the distribution is steep, and if β2 < 3 we say that the distribution is flat.

The following equation is called the law of total variance.

If E{X2} < ∞ then

(1.8.16)

where V{X | Y} denotes the conditional variance of X given Y.

It is often the case that it is easier to find the conditional mean and variance, E{X | Y} and V{X | Y}, than to find E{X} and V{X} directly. In such cases, formula (1.8.16) becomes very handy.

The product central moment of two variables (X, Y) is called the covariance and denoted by cov (X, Y). More specifically

(1.8.17) numbered Display Equation

Notice that cov (X, Y) = cov(Y, X), and cov(X, X) = V{X}. Notice that if X is a random variable having a finite first moment and a is any finite constant, then cov(a, X) = 0. Furthermore, whenever the second moments of X and Y exist the covariance exists. This follows from the Schwarz inequality (see Section 1.13.3), i.e., if F is the joint distribution of (X, Y) and FX, FY are the marginal distributions of X and Y, respectively, then

(1.8.18) numbered Display Equation

whenever E{g2(X)} and E{h2(Y)} are finite. In particular, for any two random variables having second moments

The ratio

(1.8.19) numbered Display Equation

is called the coefficient of correlation (Pearson’s product moment correlation). From (1.8.18) we deduce that −1 ≤ ρ ≤ 1. The sign of ρ is that of cov(X, Y).

The m.g.f. of a multivariate distribution is a function of k variables

(1.8.20) numbered Display Equation

Let X1, …, Xk be random variables having a joint distribution. Consider the linear transformation Y = βj Xj, where β1, …, βk are constants. Some formulae for the moments and covariances of such linear functions are developed here. Assume that all the moments under consideration exist. Starting with the expected value of Y we prove:

(1.8.21) numbered Display Equation

This result is a direct implication of the definition of the integral as a linear operator.

Let X denote a random vector in a column form and X’ its transpose. The expected value of a random vector X’ = (X1, …, Xk) is defined as the corresponding vector of expected values, i.e.,

(1.8.22)

Furthermore, let denote a k × k matrix with elements that are the variances and covariances of the components of X. In symbols

(1.8.23)

where σij = cov(Xi, Xj), σii = V{Xi}. If Y = β’X where β is a vector of constants, then

(1.8.24) numbered Display Equation

The result given by (1.8.24) can be generalized in the following manner. Let Y1 = β’ X and Y2 = α’ X, where α and β are arbitrary constant vectors. Then

(1.8.25)

Finally, if X is a k–dimensional random vector with covariance matrix and Y is an m–dimensional vector Y = A X, where A is an m × k matrix of constants, then the covariance matrix of Y is

(1.8.26)

In addition, if the covariance matrix of X is , then the covariance matrix of Y = ξ + AX is V, where ξ is a vector of constants, and A is a matrix of constants. Finally, if Y = AX and Z = BX, where A and B are matrices of constants with compatible dimensions, then the covariance matrix of Y and Z is

(1.8.27)

We conclude this section with an important theorem concerning a characteristic function. Recall that is generally a complex valued function on , i.e.,

Theorem 1.8.2 A characteristic function , of a distribution function F, has the following properties.

(i) |

(t) | ≤

(0) = 1;

(ii)

(t) is a uniformly continuous function of t, on

;

(iii)

(t) =

, where

denotes the complex conjugate of z;

(iv)

(t) is real valued if and only if F is symmetric around x0 = 0;

(v) if E{ | X | n} < ∞ for some n ≥ 1, then the rth order derivative

(r)(t) exists for every 1 ≤ r ≤ n, and

(1.8.28)

(1.8.29)

and

(1.8.30) numbered Display Equation

where |Rn(t)| ≤ 3E{|X|n}, Rn(t) → 0 as t→ 0;

(vi) if

(2n)(0) exists and is finite, then μ2n < ∞;

(vii) if E{|X|n} < ∞ for all n ≥ 1 and

(1.8.31)

then

(1.8.32) numbered Display Equation

Proof. The proof of (i) and (ii) is based on the fact that |eitx| = 1 for all t and all x. Now, e−itx dF(x) = (−t) = . Hence (iii) is proven.

(iv) Suppose F(x) is symmetric around x0 = 0. Then dF(x) = dF(−x) for all x. Therefore, since sin (−tx) = −sin (tx) for all x, sin (tx) dF(x) = 0, and (t) is real. If (t) is real, (t) = . Hence X(t) = −X(t). Thus, by the one–to–one correspondence between and F, for any Borel set B, P{ X B} = P{−X B} = P{X −B}. This implies that F is symmetric about the origin.

(v) If E{|X|n} <∞, then E{|X|r} < ∞ for all 1 ≤ r ≤ n. Consider

Since ≤ |x|, and E{|X|}<∞, we obtain from the Dominated Convergence Theorem that

Unnumbered Display Equation

Hence μ1 = (1)(0).

Equations (1.8.28)–(1.8.29) follow by induction. Taylor expansion of eiy yields

Unnumbered Display Equation

where |θ1| ≤ 1 and |θ2| ≤ 1. Hence

Unnumbered Display Equation

where

Since |cos (ty)| ≤ 1, |sin (ty)| ≤ 1, evidently Rn(t) ≤ 3E{|X|n}. Also, by dominated convergence, , Rn(t) = 0.

(vi) By induction on n. Suppose (2)(0) exists. By L’Hospital’s rule,

Unnumbered Display Equation

By Fatou’s Lemma,

Unnumbered Display Equation

Thus, μ2 ≤ − (2) (0) < ∞. Assume that 0 < μ2k < ∞. Then, by (v),

Unnumbered Display Equation

where dG(x) = x2kdF(x), or

Notice that G(∞) = μ2k < ∞. Thus, is the characteristic function of the distribution G(x)/G(∞). Since > 0, x2h + 2 dF(x) = x2 dG(x) < ∞. This proves that μ2k < ∞ for all k = 1, …, n.

(vii) Assuming (1.8.31), if 0 < t0 < R, . Therefore,

By Stirling’s approximation, (n!)1/n = 1. Thus, for 0 < t0 < R,

Accordingly, by Cauchy’s test, < ∞. By (iv), for any n ≥ 1, for any t, |t| ≤ t0

where |(t)| ≤ 3 E{|X|n}. Thus, for every t, , which implies that

QED

1.9 MODES OF CONVERGENCE

In this section we formulate many definitions and results in terms of random vectors X = (X1, X2, ···, Xk)’, 1 ≤ k < ∞. The notation ||X|| is used for the Euclidean norm, i.e., ||x||2 = .

We discuss here four modes of convergence of sequences of random vectors to a random vector.

(i) Convergence in distribution, Xn

(ii) Convergence in probability, Xn

(iii) Convergence in rth mean, Xn

X; and

(iv) Convergence almost surely, Xn

A sequence Xn is said to converge in distribution to X, Xn X if the corresponding distribution functions Fn and F satisfy

(1.9.1)

for every continuous bounded function g on k.

One can show that this definition is equivalent to the following statement.

A sequence {Xn} converges in distribution to X, Xn X if Fn(x) = F(x) at all continuity points x of F.

If Xn X we say that Fn converges to F weakly. The notation is Fn F or Fn F.

We define now convergence in probability.

A sequence {Xn} converges in probability to X, Xn X if, for each > 0,

(1.9.2)

We define now convergence in rth mean.

A sequence of random vectors {Xn} converges in rth mean, r > 0, to X, Xn X if E{||Xn − X||r} → 0 as n→ ∞.

A fourth mode of convergence is

A sequence of random vectors {Xn} converges almost–surely to X, Xn X, as n→ ∞ if

(1.9.3)

The following is an equivalent definition.

Xn X as n → ∞ if and only if, for any > 0,

(1.9.4)

Equation (1.9.4) is equivalent to

But,

Unnumbered Display Equation

By the Borel–Cantelli Lemma (Theorem 1.4.1), a sufficient condition for Xn X is

(1.9.5) numbered Display Equation

for all > 0.

Theorem 1.9.1. Let {Xn} be a sequence of random vectors. Then

(a.) Xn

X implies Xn

(b.) Xn

X, r > 0, implies Xn

(c.) Xn

X implies Xn

Proof. (a) Since Xn X, for any > 0,

(1.9.6) numbered Display Equation

The inequality (1.9.6) implies that Xn X.

(b) It can be immediately shown that, for any > 0,

Thus, Xn X implies Xn X.

Similarly,

Finally, since Xn X,

Thus, if x0 is a continuity point of F, by letting → 0, we obtain

QED

Theorem 1.9.2 Let {Xn} be a sequence of random vectors. Then

(a.) if c

k, then Xn

c implies Xn

(b.) if Xn

X and ||Xn||r ≤ Z, for some r > 0 and some (positive) random variable Z, with E{Z}<∞, then Xn

For proof, see Ferguson (1996, p. 9). Part (b) is implied also from Theorem 1.13.3.

Theorem 1.9.3 Let {Xn} be a sequence of nonnegative random variables such that Xn X and E{Xn} → E{X}, E{X} < ∞. Then

(1.9.7)

Proof. Since E{Xn} → E{X}<∞, for sufficiently large n, E{Xn} < ∞. For such n,

But,

Therefore, by the Lebesgue Dominated Convergence Theorem,

This implies (1.9.7). QED

1.10 WEAK CONVERGENCE

The following theorem plays a major role in weak convergence.

Theorem 1.10.1. The following conditions are equivalent.

(a.) Xn

(b.) E{g(Xn)} → E{g(X)}, for all continuous functions, g, that vanish outside a compact set;

(c.) E{g(Xn)} → E{g(X)}, for all continuous bounded functions g;

(d. E{g(Xn)} → E{g(X)}, for all measurable functions g such that P{X

C(g)} = 1, where C(g) is the set of all points at which g is continuous.

For proof, see Ferguson (1996, pp. 14–16).

Theorem 1.10.2. Let {Xn} be a sequence of random vectors in k, and Xn X. Then

(i) f(Xn)

f(X);

(ii) if {Yn} is a sequence such that Xn − Yn

0, then Yn

(iii) if Xn

k and Yn

l and Yn

c, then

Proof. (i) Let g: l → be bounded and continuous. Let h(x) = g(f(x)). If x is a continuity point of f, then x is a continuity point of h, i.e., C f C(h). Hence P{X C(h)} = 1. By Theorem 1.10.1 (c), it is sufficient to show that E{g(f(Xn))} → E{g(f(X))}. Theorem 1.10.1 (d) implies, since P{X C(h)} = 1 and Xn X, that E{h(Xn)} → E{h(X)}.

(ii) According to Theorem 1.10.1 (b), let g be a continuous function on k vanishing outside a compact set. Thus g is uniformly continuous and bounded. Let > 0, find δ > 0 such that, if ||x − y|| < δ then |g(x) − g(y)| < . Also, g is bounded, say |g(x)| ≤ B < ∞. Thus,

Unnumbered Display Equation

Hence Yn X.

(iii)

Hence, from part (ii), . QED

As a special case of the above theorem we get

Theorem 1.10.3 (Slutsky’s Theorem) Let {Xn} and {Yn} be sequences of random variables, Xn X and Yn c. Then

(1.10.1) numbered Display Equation

A sequence of distribution functions may not converge to a distribution function. For example, let Xn be random variables with

Unnumbered Display Equation

Then, Fn(x) = for all x. F(x) = for all x is not a distribution function. In this example, half of the probability mass escapes to −∞ and half the mass escapes to +∞. In order to avoid such situations, we require from collections (families) of probability distributions to be tight.

Let = {Fu, u } be a family of distribution functions on k. is tight if, for any > 0, there exists a compact set C k such that

In the above, the sequence Fn(x) is not tight.

If is tight, then every sequence of distributions of contains a subsequence converging weakly to a distribution function. (see Shiryayev, 1984, p. 315).

Theorem 1.10.4. Let {Fn} be a tight family of distribution functions on . A necessary and sufficient condition for Fn F is that, for each t , n(t) exists, where n(t) = eitx dFn(x) is the characteristic function corresponding to Fn.

For proof, see Shiryayev (1984, p. 321).

Theorem 1.10.5 (Continuity Theorem) Let {Fn} be a sequence of distribution functions and {n} the corresponding sequence of characteristic functions. Let F be a distribution function, with characteristic function . Then Fn F if and only if n (t)→ (t) for all t k. (Shiryayev, 1984, p. 322).

1.11 LAWS OF LARGE NUMBERS

1.11.1 The Weak Law of Large Numbers (WLLN)

Let X1, X2, … be a sequence of identically distributed uncorrelated random vectors. Let μ = E{X1} and let = E{(X1−μ)(X1−μ)’} be finite. Then the means n = converge in probability to μ, i.e.,

(1.11.1)

The proof is simple. Since cov (Xn, Xn’) = 0 for all n≠ n’, the covariance matrix of n is . Moreover, since E{n} = μ,

Hence n μ, which implies that n μ. Here tr.{, } denotes the trace of .

If X1, X2, … are independent, and identically distributed, with E{X1} = μ, then the characteristic function of n is

(1.11.2)

where (t) is the characteristic function of X1. Fix t. Then for large values of n,

Therefore,

(1.11.3) numbered Display Equation

(t) = eit’μ is the characteristic function of X, where P{X = μ} = 1. Thus, since eit’μ is continuous at t = 0, n μ. This implies that n μ (left as an exercise).

1.11.2 The Strong Law of Large Numbers (SLLN)

Strong laws of large numbers, for independent random variables having finite expected values are of the form

where μi = E{Xi}.

Theorem 1.11.1 (Cantelli) Let {Xn} be a sequence of independent random variables having uniformly bounded fourth–central moments, i.e.,

(1.11.4)

for all n ≥ 1. Then

(1.11.5) numbered Display Equation

Proof. Without loss of generality, we can assume that μn = E{Xn} = 0 for all n ≥ 1.

Unnumbered Display Equation

where μ4, i = E{} and = E{}. By the Schwarz inequality, ≤ (μ4, i· μ4j)1/2 for all i≠ j. Hence,

By Chebychev’s inequality,

Unnumbered Display Equation

Hence, for any > 0,

where C* is some positive finite constant. Finally, by the Borel–Cantelli Lemma (Theorem 1.4.1),

Thus, P{|n| < , i.o.} = 1. QED

Cantelli’s Theorem is quite stringent, in the sense, that it requires the existence of the fourth moments of the independent random variables. Kolmogorov had relaxed this condition and proved that, if the random variables have finite variances, 0 < < ∞ and

(1.11.6) numbered Display Equation

then (Xi − μi) 0 as n→ ∞.

If the random variables are independent and identically distributed (i.i.d.), then Kolmogorov showed that E{|X1|} < ∞ is sufficient for the strong law of large numbers. To prove Kolmogorov’s strong law of large numbers one has to develop more theoretical results. We refer the reader to more advanced probability books (see Shiryayev, 1984).

1.12 CENTRAL LIMIT THEOREM

The Central Limit Theorem (CLT) states that, under general valid conditions, the distributions of properly normalized sample means converge weakly to the standard normal distribution.

A continuous random variable Z is said to have a standard normal distribution, and we denote it Z ~ N(0, 1) if its distribution function is absolutely continuous, having a p.d.f.

(1.12.1)

The c.d.f. of N(0, 1), called the standard normal integral is

(1.12.2) numbered Display Equation

The general family of normal distributions is studied in Chapter 2. Here we just mention that if Z ~ N(0, 1), the moments of Z are

(1.12.3) numbered Display Equation

The characteristic function of N(0, 1) is

(1.12.4) numbered Display Equation

A random vector = (Z1, …, Zk)’ is said to have a multivariate normal distribution with mean μ = E{Z} = 0 and covariance matrix V (see Chapter 2), Z ~ N(0, V) if the p.d.f. of Z is

The corresponding characteristic function is

(1.12.5)

t k.

Using the method of characteristic functions, with the continuity theorem we prove the following simple two versions of the CLT. A proof of the Central Limit Theorem, which is not based on the continuity theorem of characteristic functions, can be obtained by the method of Stein (1986) for approximating expected values or probabilities.

Theorem 1.12.1. (CLT) Let {Xn} be a sequence of i.i.d. random variables having a finite positive variance, i.e., μ = E{X1}, V{X1} = σ2, 0 < σ2 < ∞. Then

(1.12.6)

Proof. Notice that inline , where , i ≥ 1. Moreover, E{Zi} = 0 and V{Zi} = 1, i ≥ 1. Let Z(t) be the characteristic function of Z1. Then, since E{Z} = 0, V{Z} = 1, (1.8.33) implies that

Accordingly, since {Zn} are i.i.d.,

Unnumbered Display Equation

Hence, n N(0, 1). QED

Theorem 1.12.1 can be generalized to random vector. Let n = inline , n ≥ 1. The generalized CLT is the following theorem.

Theorem 1.12.2 Let {Xn} be a sequence of i.i.d. random vectors with E{Xn} = 0, and covariance matrix E{Xn X’n} = V, n ≥ 1, where V is positive definite with finite eigenvalues. Then

(1.12.7)

Proof. Let X (t) be the characteristic function of X1. Then, since E{X1} = 0,

Unnumbered Display Equation

as n→ ∞. Hence

QED

When the random variables are independent but not identically distributed, we need a stronger version of the CLT. The following celebrated CLT is sufficient for most purposes.

Theorem 1.12.3 (Lindeberg–Feller) Consider a triangular array of random variables {Xn, k}, k = 1, …, n, n ≥ 1 such that, for each n ≥ 1, {Xn, k, k = 1, …, n} are independent, with E{Xn, k} = 0 and V{Xn, k} = . Let Sn = Xn, k and = . Assume that Bn > 0 for each n ≥ 1, and Bn ∞, as n→ ∞. If, for every > 0,

(1.12.8) numbered Display Equation

as n→ ∞, then Sn/Bn N(0, 1) as n → ∞. Conversely, if as n → ∞ and Sn/Bn N(0, 1), then (1.12.8) holds.

For a proof, see Shiryayev (1984, p. 326). The following theorem, known as Lyapunov’s Theorem, is weaker than the Lindeberg–Feller Theorem, but is often sufficient to establish the CLT.

Theorem 1.12.4 (Lyapunov) Let {Xn} be a sequence of independent random variables. Assume that E{Xn} = 0, V{Xn} > 0 and E{|Xn|3}<∞, for all n≥ 1. Moreover, assume that = V{Xj} ∞. Under the condition

(1.12.9) numbered Display Equation

the CLT holds, i.e., Sn/Bn N(0, 1) as n→ ∞.

Proof. It is sufficient to prove that (1.12.9) implies the Lindberg–Feller condition (1.12.8). Indeed,

Unnumbered Display Equation

Thus,

Unnumbered Display Equation QED

Stein (1986, p. 97) proved, using a novel approximation to expectation, that if X1, X2, … are independent and identically distributed, with EX1 = 0, EX = 1 and γ = E{|X1|3}<∞, then, for all −∞ < x < ∞ and all n = 1, 2, …,

Unnumbered Display Equation

where Φ(x) is the c.d.f. of N(0, 1). This immediately implies the CLT and shows that the convergence is uniform in x.

1.13 MISCELLANEOUS RESULTS

In this section we review additional results.

1.13.1 Law of the Iterated Logarithm

We denote by log2(x) the function log(log(x)), x > e.

Theorem 1.13.1 Let {Xn} be a sequence of i.i.d. random variables, such that E{X1} = 0 and V{X1} = σ2, 0 < σ < ∞. Let Sn = Xi. Then

(1.13.1)

where (n) = (2σ2n log2(n))1/2, n ≥ 3.

For proof, in the normal case, see Shiryayev (1984, p. 372).

The theorem means the sequence |Sn| will cross the boundary (n), n ≥ 3, only a finite number of times, with probability 1, as n → ∞. Notice that although E{Sn} = 0, n ≥ 1, the variance of Sn is V{Sn} = nσ2 and P{|Sn| ∞ } = 1. However, if we consider then by the SLLN, 0. If we divide only by then, by the CLT, N(0, 1). The law of the iterated logarithm says that, for every > 0, P = 0. This means, that the fluctuations of Sn are not too wild. In Example 1.19 we see that if {Xn} are i.i.d. with P{X1 = 1} = P{X1 = −1} = , then 0 as n→ ∞. But n goes to infinity faster than . Thus, by (1.13.1), if we consider the sequence then P{|Wn| < 1+, i.o.} = 1. {Wn} fluctuates between −1 and 1 almost always.

1.13.2 Uniform Integrability

A sequence of random variables {Xn} is uniformly integrable if

(1.13.2)

Clearly, if |Xn| ≤ Y for all n ≥ 1 and E{Y}<∞, then {Xn} is a uniformly integrable sequence. Indeed, |Xn| I{|Xn| > c} ≤ |Y|I{|Y| > c} for all n ≥ 1. Hence,

as c → ∞ since E{Y} < ∞.

Theorem 1.13.2 Let {Xn} be uniformly integrable. Then,

(i)

(1.13.3)

(ii) if in addition Xn

X, as n→ ∞, then X is integrable and

(1.13.4)

(1.13.5)

Proof. (i) For every c > 0

(1.13.6)

By uniform integrability, for every > 0, take c sufficiently large so that

By Fatou’s Lemma (Theorem 1.6.2),

(1.13.7)

But Xn I{Xn ≥ −c}} ≥ Xn. Therefore,

(1.13.8)

From (1.13.6)–(1.13.8), we obtain

(1.13.9)

In a similar way, we show that

(1.13.10)

Since is arbitrary we obtain (1.13.3). Part (ii) is obtained from (i) as in the Dominated Convergence Theorem (Theorem 1.6.3). QED

Theorem 1.13.3. Let Xn ≥ 0, n ≥ 1, and Xn X, E{Xn} < ∞. Then E{Xn} → E{X} if and only if {Xn} is uniformly integrable.

Proof. The sufficiency follows from part (ii) of the previous theorem.

To prove necessity, let

Then, for each c A

The family {XnI{Xn < c} } is uniformly integrable. Hence, by sufficiency,

for c A, n → ∞. A has a countable number of jump points. Since E {X}<∞, we can choose c0 A sufficiently large so that, for a given > 0, E{XI{X≥ c0}} < . Choose N0() sufficiently large so that, for n ≥ N0 (),

Choose c1 > c0 sufficiently large so that E{XnI{Xn ≥ c1}} ≤ , n ≤ N0. Then E{XnI{Xn ≥ c1}} ≤ . QED

Lemma 1.13.1 If {Xn} is a sequence of uniformly integrable random variables, then

(1.13.11)

Proof.

Unnumbered Display Equation

for 0 < c < ∞ sufficiently large. QED

Theorem 1.13.4 A necessary and sufficient condition for a sequence {Xn} to be uniformly integrable is that

(1.13.12)

and

(1.13.13)

Proof. (i) Necessity: Condition (1.13.12) was proven in the previous lemma. Furthermore, for any 0 < c <∞,

(1.13.14) numbered Display Equation

Choose c sufficiently large, so that E{|Xn|I{|Xn|≥ c}} < and A so that P{A} < , then E{|Xn|IA} < . This proves the necessity of (1.13.13).

(ii) Sufficiency: Let > 0 be given. Choose δ () so that P{A} < δ (), and E{|Xn|IA} ≤ .

By Chebychev’s inequality, for every c > 0,

Hence,

(1.13.15) numbered Display Equation

The right–hand side of (1.13.15) goes to zero, when c→ ∞. Choose c sufficiently large so that P{|Xn|≥ c} < . Such a value of c exists, independently of n, due to (1.13.15). Let inline . For sufficiently large c, P{A} < and, therefore,

as c→ ∞. This establishes the uniform integrability of {Xn}. QED

Notice that according to Theorem 1.13.3, if E|Xn|r <∞, r ≥ 1 and Xn X, E{} = E{Xr} if and only if {Xn} is a uniformly integrable sequence.

1.13.3 Inequalities

In previous sections we established several inequalities. The Chebychev inequality, the Kolmogorov inequality. In this section we establish some useful additional inequalities.

1. The Schwarz Inequality

Let (X, Y) be random variables with joint distribution function FXY and marginal distribution functions FX and FY, respectively. Then, for every Borel measurable and integrable functions g and h, such that E{g2(X)} < ∞ and E{h2(Y)}<∞,

(1.13.16) numbered Display Equation

To prove (1.13.16), consider the random variable Q(t) = (g(X) + th(Y))2, −∞ < t < ∞. Obviously, Q(t) ≥ 0, for all t, −∞ < t < ∞. Moreover,

for all t. But, E{Q(t)} ≥ 0 for all t if and only if

This establishes (1.13.16).

2. Jensen’s Inequality

A function g: → is called convex if, for any −∞ < x < y < ∞ and 0 ≤ α ≤ 1,

Suppose X is a random variable and E{|X|} < ∞. Then, if g is convex,

(1.13.17)

To prove (1.13.17), notice that since g is convex, for every x0, −∞ < x0 <∞, g(x) ≥ g(x0) + (x−x0)g* (x0) for all x, −∞ < x <∞, where g* (x0) is finite. Substitute x0 = E{X}. Then

with probability one. Since E{X−E{X}} = 0, we obtain (1.13.17).

3. Lyapunov’s Inequality

If 0 < s < r and E{|X|r}<∞, then

(1.13.18)

To establish this inequality, let t = r/s. Notice that g(x) = |x|t is convex, since t > 1. Let ξ = E{|X|s}, and (|X|s)t = |X|r. Thus, by Jensen’s inequality,

Hence, E{|X|s}1/s ≤ (E{|X|r})1/r. As a result of Lyapunov’s inequality we have the following chain of inequalities among absolute moments.

(1.13.19)

4. Hölder’s Inequality

Let 1 < p < ∞ and 1 < q <∞, such that . E{|X|p} < ∞ and E{|Y|q} < ∞. Then

(1.13.20)

Notice that the Schwarz inequality is a special case of Holder’s inequality for p = q =2.

For proof, see Shiryayev (1984, p. 191).

5. Minkowsky’s Inequality

If E{|X|p} < ∞ and E{|Y|p} < ∞ for some 1 ≤ p <∞, then E{|X + Y|p} < ∞ and

(1.13.21)

For proof, see Shiryayev (1984, p. 192).

1.13.4 The Delta Method

The delta method is designed to yield large sample approximations to nonlinear functions g of the sample mean n and its variance. More specifically, let {Xn} be a sequence of i.i.d. random variables. Assume that 0 < V{X} < ∞. By the SLLN, n μ, as n→ ∞, where n = Xj, and by the CLT, N(0, 1). Let g: → having third order continuous derivative. By the Taylor expansion of g(n) around μ,

(1.13.22)

where Rn = (n−μ)3g(3)(), where is a point between n and μ, i.e., |n − | < |n − μ|. Since we assumed that g(3)(x) is continuous, it is bounded on the closed interval [μ −Δ, μ +Δ]. Moreover, g(3)() g(3)(μ), as n → ∞. Thus Rn 0, as n→ ∞. The distribution of g(μ) + g(1)(μ)(n − μ) is asymptotically N(g(μ), (g(1)(μ))2σ2/n). (n −μ)2 0, as n→ ∞. Thus, (g(n)−g(μ)) N(0, σ2(g(1)(μ))2). Thus, if n satisfies the CLT, an approximation to the expected value of g(n) is

(1.13.23)

An approximation to the variance of g(n) is

(1.13.24)

Furthermore, from (1.13.22)

(1.13.25)

where

(1.13.26)

and | − n| ≤ |μ − n| with probability one. Thus, since n − μ → 0 a.s., as n→ ∞, and since |g(2) ()| is bounded, Dn 0, as n→ ∞, then

(1.13.27) numbered Display Equation

1.13.5 The Symbols op and Op

Let {Xn} and {Yn} be two sequences of random variables, Yn > 0 a.s. for all n ≥ 1. We say that Xn = op(Yn), i.e., Xn is of a smaller order of magnitude than Yn in probability if

(1.13.28)

We say that Xn = Op(Yn), i.e., Xn has the same order of magnitude in probability as Yn if, for all > 0, there exists K such that supn P.

One can verify the following relations.

(1.13.29) numbered Display Equation

1.13.6 The Empirical Distribution and Sample Quantiles

Let X1, X2, …, Xn be i.i.d. random variables having a distribution F. The function

(1.13.30) numbered Display Equation

is called the empirical distribution function (EDF).

Notice that E{I{Xi ≤ x}} = F(x). Thus, the SLLN implies that at each x, Fn(x) F(x) as n → ∞. The question is whether this convergence is uniform in x. The answer is given by

Theorem 1.13.5 (Glivenko–Cantelli) Let X1, X2, X3, … be i.i.d. random variables. Then

(1.13.31)

For proof, see Sen and Singer (1993, p. 185).

The pth sample quantile xn, p is defined as

(1.13.32) numbered Display Equation

for 0 < p < 1, where Fn(x) is the EDF. When F(x) is continuous then, the points of increase of Fn(x) are the order statistics X(1:n) < ··· < X(n:n) with probability one. Also, Fn(X(i:n)) = , i = 1, …, n. Thus,

(1.13.33) numbered Display Equation

Theorem 1.13.6 Let F be a continuous distribution function, and ξp = F−1(p), and suppose that F(ξp) = p and for any > 0, F(ξp − ) < p < F(ξp + ). Let X1, …, Xn be i.i.d. random variables from this distribution. Then

For proof, see Sen and Singer (1993, p. 167).

The following theorem establishes the asymptotic normality of xn, p.

Theorem 1.13.7 Let F(x) be an absolutely continuous distribution, with continuous p.d.f. f(x). Let p, 0 < p < 1, ξp = F−1(p) and f(ξp) > 0. Then

(1.13.34)

For proof, see Sen and Singer (1993, p. 168).

The results of Theorems 1.13.6–1.13.7 will be used in Chapter 7 to establish the asymptotic relative efficiency of the sample median, relative to the sample mean.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1: Basic Probability Theory

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 1: Basic Probability Theory