Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7

Sampling

Abstract

The objective of this chapter is to present the main sampling concepts and methods, characterizing the differences between a population and a sample. The main random and nonrandom sampling techniques are described here, as well as their advantages and disadvantages. Thus, it is possible to select the most suitable sampling technique for the study we are interested in. For each type of random sampling, we calculate the sample size based on accuracy and on the confidence level we are interested in having.

Keywords

Population; Sample; Sampling techniques; Random sampling; Nonrandom sampling; Sample size

Our reason becomes obscure when we consider that the countless fixed stars that shine in the sky do not have any other purpose besides illuminating worlds in which weeping and pain rule, and, in the best case scenario, only unpleasantness exists; at least, judging by the sample we know.

Arthur Schopenhauer

7.1 Introduction

As discussed in the Introduction of this book, population is the set that has all the individuals, objects, or elements to be studied, which have one or more characteristics in common. A census is the study of data related to all the elements of the population.

According to Bruni (2011), populations can be finite or infinite. Finite populations have a limited size, allowing their elements to be counted; infinite populations, on the other hand, have an unlimited size, not allowing us to count their elements.

As examples of finite populations, we can mention the number of employees in a certain company, the number of members in a club, the number of products manufactured during a certain period, etc. When the number of elements in a population, even though they can be counted, is too high, we assume that the population is infinite. Examples of populations considered infinite are the number of inhabitants in the world, the number of residences in Rio de Janeiro, the number of points on a straight line, etc.

Therefore, there are situations in which a study with all the elements in a population is impossible or unwanted. Hence, the alternative is to extract a subset from the population under analysis, which is called a sample. The sample must be representative of the population being studied, therein is the importance of this chapter. From the information gathered in the sample and using suitable statistical procedures, the results obtained can be used to generalize, infer, or draw conclusions regarding the population (statistical inference).

For Fávero et al. (2009) and Bussab and Morettin (2011), it is rarely possible to obtain the exact distribution of a variable, due to the high costs, the time needed and the difficulties in collecting the data. Hence, the alternative is to select part of the elements in the population (sample) and, after that, infer the properties for the whole (population).

Essentially, there are two types of sampling: (1) probability or random sampling, and (2) nonprobability or nonrandom sampling. In random sampling, samples are obtained randomly, that is, the probability of each element of the population being a part of the sample is the same. In nonrandom sampling, on the other hand, the probability of some or all the elements of the population being in the sample is unknown.

Fig. 7.1 shows the main random and nonrandom sampling techniques.

Fávero et al. (2009) show the advantages and disadvantages of random and nonrandom techniques. Regarding random sampling techniques, the main advantages are: a) the selection criteria of the elements are rigorously defined, not allowing the researchers’ or the interviewer’s subjectivity to interfere in the selection of the elements; b) the possibility to mathematically determine the sample size based on accuracy and on the confidence level desired for the results. On the other hand, the main disadvantages are: a) difficulty in obtaining current and complete listings or regions of the population; b) geographically speaking, a random selection can generate a highly disperse sample, increasing the costs, the time needed for the study, and the difficulty in collecting the data.

As regards nonrandom sampling techniques, the advantages are lower costs, less time to carry out the study, and less need of human resources. As disadvantages, we can mention: a) there are units in the population that cannot be chosen; b) a personal bias may happen; c) we do not know with what level of confidence the conclusions arrived at can be inferred for the population. These techniques do not use a random method to select the elements of the sample, so, there is no guarantee that the sample selected is a good representative of the population (Fávero et al., 2009).

Choosing the sampling technique must consider the goals of the survey, the acceptable error in the results, accessibility to the elements of the population, the desired representativeness, the time needed, and the availability of financial and human resources.

7.2 Probability or Random Sampling

In this type of sampling, samples are obtained randomly, that is, the probability of each element of the population being a part of the sample is the same, and all of the samples selected are equally probable.

In this section, we will study the main probability or random sampling techniques: (a) simple random sampling, (b) systematic sampling, (c) stratified sampling, and (d) cluster sampling.

7.2.1 Simple Random Sampling

According to Bolfarine and Bussab (2005), simple random sampling (SRS) is the simplest and most important method for selecting a sample.

Consider a population or universe (U) with N elements:

$U = \{12 \dots N\}$

According to Bolfarine and Bussab (2005), planning and selecting the sample include the following steps:

(a) Using a random procedure (as, for example, through a table with random numbers or a gravity-pick machine), we must draw an element from population U with the same probability;
(b) We repeat the previous process until a sample with n observations is generated (the calculation of the size of a simple random sample will be studied in Section 7.4);
(c) When the value drawn is removed from U before of the next draw, we have the SRS without replacement process. In case drawing a unit more than once is allowed, we have the SRS with replacement process.

According to Bolfarine and Bussab (2005), from a practical point of view, an SRS without replacement is much more interesting, because it satisfies the intuitive principle that we do not gain more information in case the same unit appears more than once in the sample. On the other hand, an SRS with replacement has mathematical and statistical advantages, such as, the independence between the units drawn. Let’s now study each of them.

7.2.1.1 Simple Random Sampling Without Replacement

According to Bolfarine and Bussab (2005), an SRS without replacement works as follows:

(a) All of the elements in the population are numbered from 1 to N:

$U = \{12 \dots N\}$

(b) Using a procedure to generate random numbers, we must draw, with the same probability, one of the N observations of the population;
(c) We draw the following element, with the previous value being removed from the population;
(d) We repeat the procedure until n observations have been drawn (how to calculate n is explained in Section 7.4.1).

In this type of sampling, there are $C_{N, n} = (\begin{array}{l} N \\ n \end{array}) = \frac{N!}{n! (N - n)!}$ possible samples of n elements that can be obtained from the population, and each sample has the same probability of being selected, $1 / (\begin{array}{l} N \\ n \end{array})$ .

Example 7.1

Simple Random Sampling without Replacement

Table 7.E.1 shows the weight (kg) of 30 parts. Draw, without any replacements, a random sample of size n = 5. How many different samples of size n can be obtained from the population? What is the probability of a sample being selected?

Table 7.E.1

Weight (kg) of 30 parts
6.4	6.2	7.0	6.8	7.2	6.4	6.5	7.1	6.8	6.9	7.0	7.1	6.6	6.8	6.7
6.3	6.6	7.2	7.0	6.9	6.8	6.7	6.5	7.2	6.8	6.9	7.0	6.7	6.9	6.8

Unlabelled Table

Solution

All 30 parts were numbered from 1 to 30, as shown in Table 7.E.2.

Table 7.E.2

Numbers given to the parts
1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
6.4	6.2	7.0	6.8	7.2	6.4	6.5	7.1	6.8	6.9	7.0	7.1	6.6	6.8	6.7
16	17	18	19	20	21	22	23	24	25	26	27	28	29	30
6.3	6.6	7.2	7.0	6.9	6.8	6.7	6.5	7.2	6.8	6.9	7.0	6.7	6.9	6.8

Unlabelled Table

Through a random procedure (as, for example, the RANDBETWEEN function in Excel), the following numbers were selected:

02 03 14 24 28

The parts associated to these numbers form the random sample selected.

There are $(\begin{array}{l} 30 \\ 5 \end{array}) = \frac{30 \cdot 29 \cdot 28 \cdot 27 \cdot 26}{5!} = 142, 506$ different samples.

The probability of a certain sample being selected is 1/142, 506.

7.2.1.2 Simple Random Sampling With Replacement

According to Bolfarine and Bussab (2005), an SRS with replacement works as follows:

(a) All of the elements in the population are numbered from 1 to N:
$U = \{12 \dots N\}$
(b) Using a procedure to generate random numbers, we must draw, with the same probability, one of the N observations of the population;
(c) We put this unit back into the population and draw the following value;
(d) We repeat the procedure until n observations have been drawn (how to calculate n is explained in Section 7.4.1).

In this type of sampling, there are Nⁿ possible samples of n elements that can be obtained from the population, and each sample has the same probability 1/Nⁿ of being selected.

Example 7.2

Simple Random Sampling with Replacement

Redo Example 7.1 considering a simple random sampling with replacement.

Solution

The 30 parts were numbered from 1 to 30. Through a random procedure (for example, we can use the RANDBETWEEN function in Excel), we drew the first part from the sample (12). This part is put back and the second element is drawn (33). The procedure is repeated until five parts have been drawn:

$12 33 02 25 33$

The parts associated to these numbers form the random sample selected.

There are 30⁵ = 24,300,000 different samples.

The probability of a certain sample being selected is 1/24,300,000.

7.2.2 Systematic Sampling

According to Costa Neto (2002), when the elements of the population are sorted and periodically removed, we have a systematic sampling. Hence, for example, in a production line, we can remove an element at every 50 items produced.

As advantages of systematic sampling, in comparison to simple random sampling, we can mention that it is carried out in a much faster and cheaper way, besides being less susceptible to errors made by the interviewer during the survey. The main disadvantage is the possibility of having variation cycles, especially if these cycles coincide with the period when the elements are removed from the sample. For example, let’s suppose that at every 60 parts produced in a certain machine, one part is inspected; however, in this machine a certain flaw usually happens, so, at every 20 parts produced, one is defective.

Assuming that the elements of the population are sorted from 1 to N and that we already know the sample size (n), systematic sampling works as follows:

(a) We must determine the sampling interval (k), obtained by the quotient of the population size and the sample size:
$k = \frac{N}{n}$

This value must be rounded to the closest integer.
(b) In this phase, we introduce an element of randomness, choosing the starting unit. The first element chosen {X₁} can be an any element between 1 and k;
(c) After choosing the first element, after each k element, a new element is removed from the population. The process is repeated until it reaches the sample size (n):
$X_{1}, X_{1} + k, X_{1} + 2 k, \dots, X_{1} + (n - 1) k$

Example 7.3

Systematic Sampling

Imagine a population with N = 500 sorted elements. We wish to remove a sample with n = 20 elements from this population. Use the systematic sampling procedure.

Solution

(a) The sampling interval (k) is:

$k = \frac{N}{n} = \frac{500}{20} = 25$

si25_e

(b) The first element chosen {X} can be any element between 1 and 25; suppose that X = 5;
(c) Since the first element of the sample is X = 5, the second element will be X = 5 + 25 = 30, the third element X = 5 + 50 = 55, and so on, and so forth, so, the last element of the sample will be X = 5 + 19 × 25 = 480:

$A = \{\begin{array}{l} 5, 30, 55, 80, 105, 130, 155, 180, 205, 230, 255, 280, 305, 330, 355, 380, 405, 430, 455, 480 \end{array}\}$

7.2.3 Stratified Sampling

In this type of sampling, a heterogeneous population is stratified or divided into subpopulations or homogeneous strata, and, in each stratum, a sample is drawn. Hence, initially, we define the number of strata and, by doing that, we obtain the size of each stratum. For each stratum, we specify how many elements will be drawn from the subpopulation, and this can be a uniform or proportional allocation. According to Costa Neto (2002), uniform stratified sampling, from which we draw an equal number of elements in each stratum, is recommended when the strata are approximately the same size. In proportional stratified sampling, on the other hand, the number of elements in each stratum is proportional to the number of elements in the stratum.

According to Freund (2006), if the elements selected in each stratum are simple random samples, the global process (stratification followed by random sampling) is called (simple) stratified random sampling.

According to Freund (2006), stratified sampling works as follows:

(a) A population of size N is divided into k strata of sizes N₁, N₂, …, N_k;
(b) For each stratum, a random sample of size n_i (i = 1, 2, …, k) is selected, resulting in k subsamples of sizes n₁, n₂, …, n_k.

In uniform stratified sampling, we have:

$n_{1} = n_{2} = \dots = n_{k}$

(7.1)

where the sample size obtained from each stratum is:

$n_{i} = \frac{n}{k}, para i = 1, 2, \dots, k$

(7.2)

where n = n₁ + n₂ + … + n_k

In proportional stratified sampling, on the other hand, we have:

$\frac{n_{1}}{N_{1}} = \frac{n_{2}}{N_{2}} = \dots = \frac{n_{k}}{N_{k}}$

si29_e (7.3)

In proportional sampling, the sample size obtained from each stratum can be obtained according to the following expression:

$n_{i} = \frac{N_{i}}{N} \cdot n, for i = 1, 2, \dots, k$

si30_e (7.4)

As examples of stratified sampling, we can mention the stratification of a city into neighborhoods, of a population by gender or age group, of customers by social class or of students by school.

The calculation of the size of a stratified sample will be studied in Section 7.4.3.

Example 7.4

Stratified Sampling

Consider a club that has N = 5000 members. The population can be divided by age group, aiming at identifying the main activities practiced by each group: from 0 to 4 years of age; from 5 to 11; from 12 to 17; from 18 to 25; from 26 to 36; from 37 to 50; from 51 to 65; and over 65 years of age. We have N₁ = 330, N₂ = 350, N₃ = 400, N₄ = 520, N₅ = 650, N₆ = 1030, N₇ = 980, N₈ = 740. We would like to draw a stratified sample from the population of size n = 80. What should be the size of the sample drawn from each stratum in case of uniform sampling and proportional sampling?

Solution

For uniform sampling, n_i = n/k = 80/8 = 10. Therefore, n₁ = … = n₈ = 10.

For proportional sampling, we calculate $n_{i} = \frac{N_{i}}{N} \cdot n, for i = 1, 2, \dots, 8$ :

$n_{1} = \frac{N_{1}}{N} \cdot n = \frac{330}{5,000} \cdot 80 = 5.3 ≅ 6, n_{2} = \frac{N_{2}}{N} \cdot n = \frac{350}{5,000} \cdot 80 = 5.6 ≅ 6$

$n_{3} = \frac{N_{3}}{N} \cdot n = \frac{400}{5,000} \cdot 80 = 6.4 ≅ 7, n_{4} = \frac{N_{4}}{N} \cdot n = \frac{520}{5,000} \cdot 80 = 8.3 ≅ 9$

$n_{5} = \frac{N_{5}}{N} \cdot n = \frac{650}{5,000} \cdot 80 = 10.4 ≅ 11, n_{6} = \frac{N_{6}}{N} \cdot n = \frac{1.030}{5,000} \cdot 80 = 16.5 ≅ 17$

$n_{7} = \frac{N_{7}}{N} \cdot n = \frac{980}{5,000} \cdot 80 = 15.7 ≅ 16, n_{8} = \frac{N_{8}}{N} \cdot n = \frac{740}{5,000} \cdot 80 = 11.8 ≅ 12$

7.2.4 Cluster Sampling

In cluster sampling, the total population must be subdivided into groups of elementary units, called clusters. The sampling is done from the groups and not from the individuals in the population. Hence, we must randomly draw a sufficient number of clusters and the objects from these will form the sample. This type of sampling is called one-stage cluster sampling.

According to Bolfarine and Bussab (2005), one of the inconveniences of cluster sampling is the fact that elements in the same cluster tend to have similar characteristics. The authors show that the more similar the elements in the cluster are, the less efficient the procedure is. Each cluster must be a good representative of the population, that is, it must be heterogeneous, containing all kinds of participants. It is the opposite of stratified sampling.

According to Martins and Domingues (2011), cluster sampling is a simple random sampling in which the sample units are the clusters; however, it is less expensive.

When we draw elements in the clusters selected, we have a two-stage cluster sampling: in the first stage, we draw the clusters and, in the second, we draw the elements. The number of elements to be drawn depends on the variability in the cluster. The higher the variability, the more elements must be drawn. On the other hand, when the units in the cluster are very similar, it is not advisable nor necessary to draw all the elements, because they will bring the same kind of information (Bolfarine and Bussab, 2005).

Cluster sampling can be generalized to several stages.

The main advantages that justify the wide use of cluster sampling are: a) many populations are already grouped into natural or geographic subgroups, facilitating its application; b) it allows a substantial reduction in the costs to obtain the sample, without compromising its accuracy. In short, it is fast, cheap, and efficient. The only disadvantage is that clusters are rarely the same size, making it difficult to control the range of the sample. However, to overcome this problem, we have to use certain statistical techniques.

As examples of clusters, we can mention the production in a factory divided into assembly lines, company employees divided by area, students in a municipality divided by schools, or the population in a municipality divided into districts.

Consider the following notation for cluster sampling:

N: population size;
M: number of clusters into which the population was divided;
N_i: cluster size i (i = 1, 2, ..., M);
n: sample size;
m: number of clusters drawn (m < M);
n_i: cluster size i of the sample (i = 1, 2, ..., m), where n_i = N_i;
b_i: cluster size i of the sample (i = 1, 2, ..., m), where b_i < n_i.

In short, one-stage cluster sampling adopts the following procedure:

(a) The population is divided into M clusters (C₁, …, C_M) with sizes that are not necessarily the same;
(b) According to a sample plan, usually SRS, we draw m clusters (m < M);
(c) All the elements of each cluster drawn constitute the global sample $(n_{i} = N_{i} and \sum_{i = 1}^{m} n_{i} = n)$ .

The calculation of the number of clusters (m) will be studied in Section 7.4.4.

On the other hand, two-stage cluster sampling works as follows:

(a) The population is divided into M clusters (C₁, …, C_M) with sizes that are not necessarily the same;
(b) We must draw m clusters in the first stage, according to some kind of sample plan, usually SRS;
(c) From each cluster i drawn, of size n_i, we draw b_i elements in the second stage, according to the same or to another sample plan $(b_{i} < n_{i} and n = \sum_{i = 1}^{m} b_{i})$ .

Example 7.5

One-Stage Cluster Sampling

Consider a population with N = 20 elements, U = {1, 2, …, 20}. The population is divided into 7 clusters: C₁ = {1, 2}, C₂ = {3, 4, 5}, C₃ = {6, 7, 8}, C₄ = {9, 10, 11}, C₅ = {12, 13, 14}, C₆ = {15, 16}, C₇ = {17, 18, 19, 20}. The sample plan adopted says that we should draw three clusters (m = 3) by simple random sampling without replacement. Assuming that clusters C₁, C₃, and C₄ were drawn, determine the sample size, besides the elements that will constitute the one-stage cluster sampling.

Solution

In one-stage cluster sampling, all the elements of each cluster drawn constitute the sample, so, M = {C₁, C₃, C₄} = {(1, 2), (6, 7, 8), (9, 10, 11)}. Therefore, n₁ = 2, n₂ = 3 and n₃ = 3, and $n = \sum_{i = 1}^{3} n_{i} = 8$ .

Example 7.6

Two-Stage Cluster Sampling

Example 7.5 will be extended to the case of two-stage cluster sampling. Thus, from the clusters drawn in the first stage, the sample plan adopted tells us to draw a single element with equal probability from each cluster $(b_{i} = 1, i = 1, 2, 3 and n = \sum_{i = 1}^{m} b_{i} = 3)$ , which results in the following:

Stage 1: M = {C₁, C₃, C₄} = {(1, 2), (6, 7, 8), (9, 10, 11)}

Stage 2: M = {1, 8, 10}

7.3 Nonprobability or Nonrandom Sampling

In nonprobability sampling methods, samples are obtained in a nonrandom way, that is, the probability of some or all elements of the population belonging to the sample is unknown. Thus, it is not possible to estimate the sample error, nor to generalize the results of the sample to the population, since the former is not representative of the latter.

For Costa Neto (2002), this type of sampling is used many times due to its simplicity or impossibility to obtain probability samples, as would be the most desirable.

Therefore, we must be careful when deciding to use this type of sampling, since it is subjective, based on the researcher’s criteria and judgment, and sample variability cannot be established with accuracy.

In this section, we will study the main nonprobability or nonrandom sampling techniques: (a) convenience sampling, (b) judgmental or purposive sampling, (c) quota sampling, (d) geometric propagation or snowball sampling.

7.3.1 Convenience Sampling

Convenience sampling is used when participation is voluntary or the sample elements are chosen due to convenience or simplicity, such as, friends, neighbors, or students. The advantage this method offers is that it allows researcher to obtain information in a quick and cheap way.

However, the sample process does not guarantee that the sample is representative of the population. It should only be employed in extreme situations and in special cases that justify its use.

Example 7.7

Convenience Sampling

A researcher wishes to study customer behavior in relation to a certain brand and, in order to do that, he develops a sampling plan. The collection of data is done through interviews with friends, neighbors, and workmates. This represents convenience sampling, since this sample is not representative of the population.

It is important to highlight that, if the population is very heterogeneous, the results of the sample cannot be generalized to the population.

7.3.2 Judgmental or Purposive Sampling

In judgmental or purposive sampling, the sample is chosen according to an expert’s opinion or previous judgment. It is a risky method due to possible mistakes made by the researcher in his prejudgment.

Using this type of sampling requires knowledge of the population and of the elements selected.

Example 7.8

Judgmental or Purposive Sampling

A survey is trying to identify the reasons why a group of employees of a certain company went on strike. In order to do that, the researcher interviews the main leaders of the trade union and of political movements, as well as the employees that are not involved in such movements.

Since the sample size is small, it is not possible to generalize the results to the population, since the sample is not representative of this population.

7.3.3 Quota Sampling

Quota sampling presents greater rigor when compared to other nonrandom samplings. For Martins and Domingues (2011), it is one of the most used sampling methods in market surveys and election polls.

Quota sampling is a variation of judgmental sampling. Initially, we set the quotas based on a certain criterion. Within the quotas, the selection of the sample items depends on the interviewer’s judgment.

Quota sampling can also be considered a nonprobability version of stratified sampling.

Quota sampling consists of three steps:

(a) We select the control variables or the population’s characteristics considered relevant for the study in question;
(b) We determine the percentage of the population (%) for each one of the relevant variable categories;
(c) We establish the size of the quotas (number of people to be interviewed that have the characteristics needed) for each interviewer, so that the sample can have the same proportions as the population.

The main advantages of quota sampling are the low costs, speed, and convenience or ease in which the interviewer can select elements. However, since the selection of elements is not random, there are no guarantees that the sample will be representative of the population. Hence, it is not possible to generalize the results of the survey to the population.

Example 7.9

Quota Sampling

We would like to carry out municipal election polls regarding a certain municipality with 14,253 voters. The survey has as its main objective to identify how people intend to vote based on their gender and age group. Table 7.E.3 shows the absolute frequencies for each pair of variable category analyzed. Apply quota sampling, considering that the sample size is 200 voters and that there are two interviewers.

Table 7.E.3

Absolute Frequencies for Each Pair of Categories
Age Group	Male	Female	Total
16 and 17	50	48	98
from 18 to 24	1097	1063	2160
from 25 to 44	3409	3411	6820
from 45 to 69	2269	2207	4476
> 69	359	331	690
Total	7184	7060	14,244

Unlabelled Table

Solution

(a) The variables that are relevant for the study are gender and age;
(b) The percentage of the population (%) for each pair of categories of analyzed variables is shown in Table 7.E.4.

(c) If we multiply each cell in Table 7.E.4 by the sample size (200), we get the dimensions of the quotas that compose the global sample, as shown in Table 7.E.5.

Table 7.E.4

Percentage of the Population for Each Pair of Categories
Age Group	Male	Female	Total
16 and 17	0.35%	0.34%	0.69%
from 18 to 24	7.70%	7.46%	15.16%
from 25 to 44	23.93%	23.95%	47.88%
from 45 to 69	15.93%	15.49%	31.42%
> 69	2.52%	2.32%	4.84%
% of the Total	50.44%	49.56%	100.00%

Unlabelled Table

Table 7.E.5

Dimensions of the Quotas
Age Group	Male	Female	Total
16 and 17	1	1	2
from 18 to 24	16	15	31
from 25 to 44	48	48	96
from 45 to 69	32	31	63
> 69	5	5	10
Total	102	100	202

Unlabelled Table

Considering that there are two interviewers, the quota for each one will be:

Table 7.E.6

Dimensions of the Quotas per Interviewer
Age Group	Male	Female	Total
16 and 17	1	1	2
from 18 to 24	8	8	16
from 25 to 44	24	24	48
from 45 to 69	16	16	32
> 69	3	3	6
Total	52	52	104

Unlabelled Table

Note: The data in Tables 7.E.5 and 7.E.6 were rounded up, resulting in a total number of 202 voters in Table 7.E.5 and 104 voters in Table 7.E.6.

7.3.4 Geometric Propagation or Snowball Sampling

Geometric propagation or snowball sampling is widely used when the elements of the population are rare, difficult to access, or unknown.

In this method, we must identify one or more individuals from the target population, and these will identify the other individuals that are in the same population. The process is repeated until the objective proposed is achieved, that is, the point of saturation. The point of saturation is reached when the last respondents do not add new relevant information to the research, thus, repeating the content of previous interviews.

As advantages, we can mention: a) it allows the researcher to find the desired characteristic in the population; b) it is easy to apply, because the recruiting is done through referrals from other people who are in the population; c) low cost, because we need less planning and people; and d) it is efficient to enter populations that are difficult to access.

Example 7.10

Snowball Sampling

A company is recruiting professionals with a specific profile. The group hired initially recommends other professionals with the same profile. The process is repeated until the number of employees needed is hired. Therefore, we have an example of snowball sampling.

7.4 Sample Size

According to Cabral (2006), there are six decisive factors when calculating the sample size:

1) Characteristics of the population, such as, variance (σ²) and dimension (N);
2) Sample distribution of the estimator used;
3) The accuracy and reliability required in the results, being necessary to specify the estimation error (B), which is the maximum difference that the researcher accepts between the population parameter and the estimate obtained from the sample;
4) The costs: the larger the sample size, the higher the costs;
5) Costs vs. sample error: must we select a larger sample to reduce the sample error or must we reduce the sample size in order to minimize the resources and efforts necessary, thus ensuring better control for the interviewers, a higher response rate, and a precise and better processing of the information?
6) The statistical techniques that will be used: some statistical techniques demand larger samples than others.

The sample selected must be representative of the population. Based on Ferrão et al. (2001), Bolfarine and Bussab (2005), and Martins and Domingues (2011), this section discusses how to calculate the sample size for the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B, and for each type of random sampling (simple, systematic, stratified, and cluster).

In the case of nonrandom samples, either we set the sample size based on a possible budget or we adopt a certain dimension that has already been used successfully in previous studies with the same characteristics. A third alternative would be to calculate the size of a random sample and use that dimension as a reference.

7.4.1 Size of a Simple Random Sample

This section discusses how to calculate the size of a simple random sample to estimate the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B.

The estimation error (B) for the mean is the maximum difference that the researcher accepts between μ (population mean) and $\bar{X}$ (sample mean), that is, $B \geq |μ - \bar{X}|$ .

On the other hand, the estimation error (B) for the proportion is the maximum difference that the researcher accepts between p (proportion of the population) and $\hat{p}$ (proportion of the sample), that is, $B \geq |p - \hat{p}|$ .

7.4.1.1 Sample Size to Estimate the Mean of an Infinite Population

If the variable chosen is quantitative and the population infinite, the size of a simple random sample, where $P (|\bar{X} - μ| \leq B) = 1 - α$ , can be calculated as follows:

$n = \frac{σ^{2}}{B^{2} / z_{α}^{2}}$

si45_e (7.5)

where:

σ²: population variance;
B: maximum estimation error;
z_α: abscissa (coordinate) of the standard normal distribution, at the significance level α.

According to Bolfarine and Bussab (2005), to determine the sample size it is necessary to set the maximum estimation error (B), the significance level α (translated by the value of z_α), and to have some previous knowledge of the population variance (σ²). The first two are set by the researcher, while the third demands more work.

When we do not know σ², its value must be substituted for a reasonable initial estimator. In many cases, a pilot sample can provide sufficient information about the population. In other cases, sample surveys done previously about the population can also provide satisfactory initial estimates for σ². Finally, some authors suggest the use of an approximate value for the standard deviation, given by σ ≅ range/4.

7.4.1.2 Sample Size to Estimate the Mean of a Finite Population

If the variable chosen is quantitative and the population finite, the size of a simple random sample, where $P (|\bar{X} - μ| \leq B) = 1 - α$ , can be calculated as follows:

$n = \frac{N . σ^{2}}{(N - 1) . \frac{B^{2}}{z_{α}^{2}} + σ^{2}}$

si47_e (7.6)

where:

N: size of the population;
σ²: population variance;
B: maximum estimation error;
z_α: coordinate of the standard normal distribution, at the significance level α.

7.4.1.3 Sample Size to Estimate the Proportion of an Infinite Population

If the variable chosen is binary and the population infinite, the size of a simple random sample, where $P (|\hat{p} - p| \leq B) = 1 - α$ , can be calculated as follows:

$n = \frac{p . q}{B^{2} / z_{α}^{2}}$

si49_e (7.7)

where:

p: proportion of the population that contains the characteristic desired;
$q = 1 - p;$
B: maximum estimation error;
z_α: coordinate of the standard normal distribution, at the significance level α.

In practice, we do not know the value of p and we must, therefore, find its estimate ( $\hat{p}$ ). If, however, this value is also unknown, we must admit that $\hat{p} = 0.50$ , hence obtaining a conservative size, that is, larger than what is necessary to ensure the accuracy required.

7.4.1.4 Sample Size to Estimate the Proportion of a Finite Population

If the variable chosen is binary and the population finite, the size of a simple random sample, where $P (|\hat{p} - p| \leq B) = 1 - α$ , can be calculated as follows:

$n = \frac{N . p . q}{(N - 1) . \frac{B^{2}}{z_{α}^{2}} + p . q}$

si54_e (7.8)

where:

N: size of the population;
p: proportion of the population that contains the characteristic desired;
$q = 1 - p;$
B: maximum estimation error;
z_α: coordinate of the standard normal distribution, at the significance level α.

Example 7.11

Calculating the Size of a Simple Random Sample

Consider the population of residents in a condominium (N = 540). We would like to estimate the average age of these residents. Based on previous surveys, we can obtain an estimate for σ² of 463.32. Assume that a simple random sample will be drawn from the population. Assuming that the difference between the sample mean and the real population mean is 4 years, at the most, with a confidence level of 95%, determine the sample size to be collected.

Solution

The value of z_α for α = 5% (a bilateral test) is 1.96. From expression (7.6), the sample size is:

$n = \frac{N . σ^{2}}{(N - 1) . \frac{B^{2}}{z_{α}^{2}} + σ^{2}} = \frac{540 \times 463.32}{539 \times \frac{4^{2}}{{1.96}^{2}} + 463.32} = 92.38 ≅ 93$

si56_e

Therefore, if we collect a simple random sample of at least 93 residents from the population, we can infer, with a confidence level of 95%, that the sample mean ( $\bar{X}$ ) will differ 4 years, at the most, from the real population mean (μ).

Example 7.12

Calculating the Size of a Simple Random Sample

We would like to estimate the proportion of voters who are dissatisfied with a certain politician’s administration. We admit that the real proportion is unknown, as well as its estimate. Assuming that a simple random sample will be drawn from an infinite population and admitting a sample error of 2%, and a significance level of 5%, determine the sample size.

Solution

Since we do not know the real value of p nor its estimate, let’s assume that $\hat{p} = 0.50$ . Applying Expression (7.7) to estimate the proportion of an infinite population, we have:

$n = \frac{p . q}{B^{2} / z_{α}^{2}} = \frac{0.5 \times 0.5}{{0.02}^{2} / {1.96}^{2}} = 2, 401$

si59_e

Therefore, by randomly interviewing 2401 voters, we can infer the real proportion of voters who are dissatisfied, with a maximum estimation error of 2%, and a confidence level of 95%.

7.4.2 Size of the Systematic Sample

In systematic sampling, we use the same expressions as in simple random sampling (as studied in Section 7.4.1), according to the type of variable (quantitative or qualitative) and population (infinite or finite).

7.4.3 Size of the Stratified Sample

This section discusses how to calculate the size of a stratified sample to estimate the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B.

The estimation error (B) for the mean is the maximum difference that the researcher accepts between μ (population mean) and $\bar{X}$ (sample mean), that is, $B \geq |μ - \bar{X}|$ .

Let’s use the following notation to calculate the size of the stratified sample, as follows:

k: number of strata;
N_i: size of stratum i, i = 1, 2,..., k;
N = N₁ + N₂ + … + N_k (population size);
W_i = N_i/N (weight or proportion of stratum i, with $\sum_{i = 1}^{k} W_{i} = 1$ );
μ_i: population mean of stratum i;
σ_i²: population variance of stratum i;
n_i: number of elements randomly selected from stratum i;
n = n₁ + n₂ + … + n_k (sample size);
${\bar{X}}_{i}$ : sample mean of stratum i;
S_i²: sample variance of stratum i;
p_i: proportion of elements that have the characteristic desired in stratum i;
$q_{i} = 1 - p_{i} .$

7.4.3.1 Sample Size to Estimate the Mean of an Infinite Population

If the variable chosen is quantitative and the population infinite, the size of the stratified sample, where $P (|\bar{X} - μ| \leq B) = 1 - α$ , can be calculated as:

$n = \frac{\sum_{i = 1}^{k} W_{i} . σ_{i}^{2}}{B^{2} / z_{α}^{2}}$

si68_e (7.9)

where:

W_i = N_i/N (weight or proportion of stratum i, where $\sum_{i = 1}^{k} W_{i} = 1$ );
σ_i²: population variance of stratum i;
B: maximum estimation error;
z_α: coordinate of the standard normal distribution, at the significance level α.

7.4.3.2 Sample Size to Estimate the Mean of a Finite Population

If the variable chosen is quantitative and the population finite, the size of the stratified sample, where $P (|\bar{X} - μ| \leq B) = 1 - α$ , can be calculated as:

$n = \frac{\sum_{i = 1}^{k} N_{i}^{2} . σ_{i}^{2} / W_{i}}{N^{2} . \frac{B^{2}}{z_{α}^{2}} + \sum_{i = 1}^{k} N_{i} . σ_{i}^{2}}$

si71_e (7.10)

where:

N_i: size of stratum i, i = 1, 2,..., k;
σ_i²: population variance of stratum i;
W_i = N_i/N (weight or proportion of stratum i, where $\sum_{i = 1}^{k} W_{i} = 1$ );
N: size of the population;
B: maximum estimation error;
z_α: coordinate of the standard normal distribution, at the significance level α.

7.4.3.3 Sample Size to Estimate the Proportion of an Infinite Population

If the variable chosen is binary and the population infinite, the size of the stratified sample, where $P (|\hat{p} - p| \leq B) = 1 - α$ , can be calculated as:

$n = \frac{\sum_{i = 1}^{k} W_{i} . p_{i} . q_{i}}{B^{2} / z_{α}^{2}}$

si74_e (7.11)

where:

W_i = N_i/N (weight or proportion of stratum i, where $\sum_{i = 1}^{k} W_{i} = 1$ );
p_i: proportion of elements that have the characteristic desired in stratum i;
$q_{i} = 1 - p_{i};$
B: maximum estimation error;
z_α: coordinate of the standard normal distribution, at the significance level α.

7.4.3.4 Sample Size to Estimate the Proportion of a Finite Population

If the variable chosen is binary and the population finite, the size of the stratified sample, where $P (|\hat{p} - p| \leq B) = 1 - α$ , can be calculated as:

$n = \frac{\sum_{i = 1}^{k} N_{i}^{2} . p_{i} . q_{i} / W_{i}}{N^{2} . \frac{B^{2}}{z_{α}^{2}} + \sum_{i = 1}^{k} N_{i} . p_{i} . q_{i}}$

si78_e (7.12)

where:

N_i: size of stratum i, i = 1, 2,..., k;
p_i: proportion of elements that have the characteristic desired in stratum i;
$q_{i} = 1 - p_{i};$
W_i = N_i/N (weight or proportion of stratum i, where $\sum_{i = 1}^{k} W_{i} = 1$ );
N: size of the population;
B: maximum estimation error;
z_α: coordinate of the standard normal distribution, at the significance level α.

Example 7.13

Calculating the Size of a Stratified Sample

A university has 11,886 students enrolled in 14 undergraduate courses, divided into three major areas: Exact Sciences, Human Sciences, and Biological Sciences. Table 7.E.7 shows the number of students enrolled per area. A survey will be carried out in order to estimate the average time students spend studying per week (in hours). Based on pilot samples, we obtain the following estimates for the variances in the areas of Exact, Human, and Biological Sciences: 124.36, 153.22, and 99.87, respectively. The samples selected must be proportional to the number of students per area. Determine the sample size, considering an estimation error of 0.8, and a confidence level of 95%.

Table 7.E.7

Number of students enrolled per area
Area	Number of students enrolled
Exact Sciences	5285
Human Sciences	3877
Biological Sciences	2724
Total	11,886

Solution

From the data, we have:

$k = 3, N_{1} = 5, 285, N_{2} = 3, 877, N_{3} = 2, 724, N = 11, 886, B = 0.8$

$W_{1} = \frac{5, 285}{11, 886} = 0.44, W_{2} = \frac{3, 877}{11, 886} = 0.33, W_{3} = \frac{2, 724}{11, 886} = 0.23$

si82_e

For α = 5%, we have z_α = 1.96. Based on the pilot sample, we must use the estimates for σ₁², σ₂², and σ₃². The sample size is calculated from Expression (7.10):

$n = \frac{\sum_{i = 1}^{k} N_{i}^{2} σ_{i}^{2} / W_{i}}{N^{2} \frac{B^{2}}{z_{α}^{2}} + \sum_{i = 1}^{k} N_{i} σ_{i}^{2}}$

si83_e

$n = \frac{(\frac{5, 285^{2} \times 124.36}{0.44} + \frac{3, 877^{2} \times 153.22}{0.33} + \frac{2, 724^{2} \times 99.87}{0.23})}{11, 886^{2} \times \frac{{0.8}^{2}}{{1.96}^{2}} + (5, 285 \times 124.36 + 3, 877 \times 153.22 + 2, 724 \times 99.87)} = 722.52 ≅ 723$

si84_e

Since the sampling is proportional, we can obtain the size of each stratum by using the expression n_i = W_i × n (i = 1, 2, 3):

$n_{1} = W_{1} \times n = 0.44 \times 723 = 321.48 ≅ 322$

$n_{2} = W_{2} \times n = 0.33 \times 723 = 235.83 ≅ 236$

$n_{3} = W_{3} \times n = 0.23 \times 723 = 165.70 ≅ 166$

Thus, to carry out the survey, we must select 322 students from the area of Exact Sciences, 236 from the area of Human Sciences, and 166 from Biological Sciences. From the sample selected, we can infer, with a 95% confidence level, that the difference between the sample mean and the real population mean will be a maximum of 0.8 hours.

Example 7.14

Calculating the Size of a Stratified Sample

Consider the same population from the previous example; however, the objective now is to estimate the proportion of students who work, for each area. Based on a pilot sample, we have the following estimates per area: ${\hat{p}}_{1} = 0.3$ (Exact Sciences), ${\hat{p}}_{2} = 0.6$ (Human Sciences), and ${\hat{p}}_{3} = 0.4$ (Biological Sciences). The type of sampling used in this case is uniform. Determine the sample size, considering an estimation error of 3%, and a 90% confidence level.

Solution

Since we do not know the real value of p for each area, we can use its estimate. For a 90% confidence level, we have z_α = 1.645. Applying Expression (7.12) from the stratified sampling to estimate the proportion of a finite population, we have:

$n = \frac{\sum_{i = 1}^{k} N_{i}^{2} . p_{i} . q_{i} / W_{i}}{N^{2} . \frac{B^{2}}{z_{α}^{2}} + \sum_{i = 1}^{k} N_{i} . p_{i} . q_{i}}$

si78_e

$n = \frac{5, 285^{2} \times 0.3 \times 0.7 / 0.44 + 3, 877^{2} \times 0.6 \times 0.4 / 0.33 + 2, 724^{2} \times 0.4 \times 0.6 / 0.23}{11, 886^{2} \times \frac{{0.03}^{2}}{{1.645}^{2}} + 5, 285 \times 0.3 \times 0.7 + 3, 877 \times 0.6 \times 0.4 + 2, 724 \times 0.4 \times 0.6}$

si92_e

$n = 644.54 ≅ 645$

Since the sampling is uniform, we have n₁ = n₂ = n₃ = 215.

Therefore, to carry out the survey, we must randomly select 215 students from each area. From the sample selected, we can infer, with a 90% confidence level, that the difference between the sample proportion and the real population proportion will be a maximum of 3%.

7.4.4 Size of a Cluster Sample

This section discusses how to calculate the size of a one-stage and a two-stage cluster sample.

Let’s consider the following notation to calculate the size of a cluster sample:

N: population size;
M: number of clusters into which the population was divided;
N_i: size of cluster i (i = 1, 2, ..., M);
n: sample size;
m: number of clusters drawn (m < M);
n_i: size of cluster i from the sample drawn in the first stage (i = 1, 2, ..., m), where n_i = N_i;
b_i: size of cluster i from the sample drawn in the second stage (i = 1, 2, ..., m), where b_i < n_i;
$\bar{N} = N / M$ (average size of the population clusters);
$\bar{n} = n / m$ (average size of the sample clusters);
X_ij: j-th observation in cluster i;
σ_dc²: population variance in the clusters;
σ_ec²: population variance between clusters;
σ_i²: population variance in cluster i;
μ_i: population mean in cluster i;
σ_c² = σ_dc² + σ_ec² (total population variance).

According to Bolfarine and Bussab (2005), the calculation of σ_dc² and σ_ec² is given by:

$σ_{dc}^{2} = \frac{\sum_{i = 1}^{M} \sum_{j = 1}^{N_{i}} {(X_{ij} - μ_{i})}^{2}}{N} = \frac{1}{M} \cdot \sum_{i = 1}^{M} \frac{N_{i}}{\bar{N}} \cdot σ_{i}^{2}$

si96_e (7.13)

$σ_{ec}^{2} = \frac{1}{N} . \sum_{i = 1}^{M} N_{i} . {(μ_{i} - μ)}^{2} = \frac{1}{M} . \sum_{i = 1}^{M} \frac{N_{i}}{\bar{N}} . {(μ_{i} - μ)}^{2}$

si97_e (7.14)

Assuming that all the clusters are the same size, the previous expressions can be summarized as follows:

$σ_{dc}^{2} = \frac{1}{M} . \sum_{i = 1}^{M} σ_{i}^{2}$

si98_e (7.15)

$σ_{ec}^{2} = \frac{1}{M} . \sum_{i = 1}^{M} {(μ_{i} - μ)}^{2}$

si99_e (7.16)

7.4.4.1 Size of a One-Stage Cluster Sample

This section discusses how to calculate the size of a one-stage cluster sample to estimate the mean (a quantitative variable) of a finite and infinite population, with a maximum estimation error B.

The estimation error (B) for the mean is the maximum difference that the researcher accepts between μ (population mean) and $\bar{X}$ (sample mean), that is, $B \geq |μ - \bar{X}|$ .

7.4.4.1.1 Sample Size to Estimate the Mean of an Infinite Population

If the variable chosen is quantitative and the population infinite, the number of the clusters drawn in the first stage (m), where $P (|\bar{X} - μ| \leq B) = 1 - α$ , can be calculated as follows:

$m = \frac{σ_{c}^{2}}{B^{2} / z_{α}^{2}}$

si103_e (7.17)

where:

σ_c² = σ_dc² + σ_ec², according to Expressions (7.13)–(7.16);
B: maximum estimation error;
z_α: coordinate of the standard normal distribution, at the significance level α.

If the clusters are the same size, Bolfarine and Bussab (2005) demonstrate that:

$m = \frac{σ_{e}^{2}}{B^{2} / z_{α}^{2}}$

si104_e (7.18)

According to the authors, generally, σ_c² is unknown and has to be estimated from pilot samples or obtained from previous sample surveys.

7.4.4.1.2 Sample Size to Estimate the Mean of a Finite Population

If the variable chosen is quantitative and the population finite, the number of clusters drawn in the first stage (m), where $P (|\bar{X} - μ| \leq B) = 1 - α$ , can be calculated as follows:

$m = \frac{M . σ_{c}^{2}}{M . \frac{B^{2} . {\bar{N}}^{2}}{z_{α}^{2}} + σ_{c}^{2}}$

si106_e (7.19)

where:

M: number of clusters into which the population was divided;
σ_c² = σ_dc² + σ_ec², according to Expressions (7.13)–(7.16);
B: maximum estimation error;
$\bar{N} = N / M$ (average size of the population clusters);
z_α: coordinate of the standard normal distribution, at the significance level α.

7.4.4.1.3 Sample Size to Estimate the Proportion of an Infinite Population

If the variable chosen is binary and the population infinite, the number of clusters drawn in the first stage (m), where $P (|\hat{p} - p| \leq B) = 1 - α$ , can be calculated as follows:

$m = \frac{1 / M . \sum_{i = 1}^{M} \frac{N_{i}}{\bar{N}} . p_{i} . q_{i}}{B^{2} / z_{α}^{2}}$

si109_e (7.20)

where:

M: number of clusters into which the population was divided;
N_i: size of cluster i (i = 1, 2, ..., M);
$\bar{N} = N / M$ (average size of the population clusters);
p_i: proportion of elements that have the characteristic desired in cluster i;
$q_{i} = 1 - p_{i};$
B: maximum estimation error;
z_α: coordinate of the standard normal distribution, at the significance level α.

7.4.4.1.4 Sample Size to Estimate the Proportion of a Finite Population

If the variable chosen is binary and the population finite, the number of clusters drawn in the first stage (m), where $P (|\hat{p} - p| \leq B) = 1 - α$ , can be calculated as follows:

$m = \frac{\sum_{i = 1}^{M} \frac{N_{i}}{\bar{N}} . p_{i} . q_{i}}{M . \frac{B^{2} . {\bar{N}}^{2}}{z_{α}^{2}} + 1 / M . \sum_{i = 1}^{M} \frac{N_{i}}{\bar{N}} . p_{i} . q_{i}}$

si113_e (7.21)

where:

M: number of clusters into which the population was divided;
N_i: size of cluster i (i = 1, 2, ..., M);
$\bar{N} = N / M$ (average size of the population clusters);
p_i: proportion of elements that have the characteristic desired in cluster i;
$q_{i} = 1 - p_{i};$
B: maximum estimation error;
z_α: coordinate of the standard normal distribution, at the significance level α.

7.4.4.2 Size of a Two-Stage Cluster Sample

In this case, we assume that all the clusters are the same size. Based on Bolfarine and Bussab (2005), let’s consider the following linear cost function:

$C = c_{1} . n + c_{2} . b$

(7.22)

where:

c₁: observation cost of one unit from the first stage;
c₂: observation cost of one unit from the second stage;
n: sample size in the first stage;
b: sample size in the second stage.

The optimal size for b that minimizes the linear cost function is given by:

$b^{*} = \frac{σ_{dc}}{σ_{ec}} . \sqrt{\frac{c_{1}}{c_{2}}}$

si117_e (7.23)

Example 7.15

Calculating the Size of a Cluster Sample

Consider the members of a certain club in Sao Paulo (N = 4,500). We would like to estimate the average evaluation score (0 to 10) given by these members regarding the main features of the club. The population is divided into 10 groups of 450 elements each, based on their membership number. The estimate of the mean and of the population variance per group, based on previous surveys, can be seen in Table 7.E.8. Assuming that the cluster sampling is based on a single stage, determine the number of clusters that must be drawn, considering B = 2% and α = 1%.

Table 7.E.8

Mean and population variance per group
i	1	2	3	4	5	6	7	8	9	10
μ_i	7.4	6.6	8.1	7.0	6.7	7.3	8.1	7.5	6.2	6.9
σ_i²	22.5	36.7	29.6	33.1	40.8	51.7	39.7	30.6	40.5	42.7

Unlabelled Table

Solution

From the data given to us, we have:

$N = 4, 500, M = 10, \bar{N} = 4, 500 / 10 = 450, B = 0.02, and z_{α} = 2.575 .$

Since all the clusters are the same size, the calculation of σ_dc² and σ_ec² is given by:

$σ_{dc}^{2} = \frac{1}{M} . \sum_{i = 1}^{M} σ_{i}^{2} = \frac{22.5 + 36.7 + \dots + 42.7}{10} = 36.79$

si119_e

$σ_{ec}^{2} = \frac{1}{M} . \sum_{i = 1}^{M} {(μ_{i} - μ)}^{2} = \frac{{(7.4 - 7.18)}^{2} + \dots + {(6.9 - 7.18)}^{2}}{10} = 0.35$

si120_e

Therefore, σ_c² = σ_dc² + σ_ec² = 36.79 + 0.35 = 37.14

The number of clusters to be drawn in one stage, for a finite population, is given by Expression (7.19):

$m = \frac{M . σ_{c}^{2}}{M . \frac{B^{2} . {\bar{N}}^{2}}{z_{α}^{2}} + σ_{c}^{2}} = \frac{10 \times 37.14}{10 \times \frac{{0.02}^{2} \times 450^{2}}{{2.575}^{2}} + 37.14} = 2.33 ≅ 3$

si121_e

Therefore, the population of N = 4, 500 members is divided into M = 10 clusters with the same size (N_i = 450, i = 1, ...10). From the total number of clusters, we must randomly draw m = 3 clusters. In one-stage cluster sampling, all the elements of each cluster drawn constitute the global sample (n = 450 × 3 = 1, 350).

From the sample selected, we can infer, with a 99% confidence level, that the difference between the sample mean and the real population mean will be 2%, at the most.

Table 7.1 shows a summary of the expressions used to calculate the sample size for the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B, and for each type of random sampling (simple, systematic, stratified, and cluster).

Table 7.1

Expressions to Calculate the Size of Random Samples
Type of Random Sample	Estimating the Mean (Infinite Population)	Estimating the Mean (Finite Population)	Estimating the Proportion (Infinite Population)	Estimating the Proportion (Finite Population)
Simple	$n = \frac{σ^{2}}{B^{2} / z_{α}^{2}}$	$n = \frac{N . σ^{2}}{(N - 1) . \frac{B^{2}}{z_{α}^{2}} + σ^{2}}$	$n = \frac{p . q}{B^{2} / z_{α}^{2}}$	$n = \frac{N . p . q}{(N - 1) . \frac{B^{2}}{z_{α}^{2}} + p . q}$
Systematic	$n = \frac{σ^{2}}{B^{2} / z_{α}^{2}}$	$n = \frac{N . σ^{2}}{(N - 1) . \frac{B^{2}}{z_{α}^{2}} + σ^{2}}$	$n = \frac{p . q}{B^{2} / z_{α}^{2}}$	$n = \frac{N . p . q}{(N - 1) . \frac{B^{2}}{z_{α}^{2}} + p . q}$
Stratified	$n = \frac{\sum_{i = 1}^{k} W_{i} . σ_{i}^{2}}{B^{2} / z_{α}^{2}}$	$n = \frac{\sum_{i = 1}^{k} N_{i}^{2} . σ_{i}^{2} / W_{i}}{N^{2} . \frac{B^{2}}{z_{α}^{2}} + \sum_{i = 1}^{k} N_{i} . σ_{i}^{2}}$	$n = \frac{\sum_{i = 1}^{k} W_{i} . p_{i} . q_{i}}{B^{2} / z_{α}^{2}}$	$n = \frac{\sum_{i = 1}^{k} N_{i}^{2} . p_{i} . q_{i} / W_{i}}{N^{2} . \frac{B^{2}}{z_{α}^{2}} + \sum_{i = 1}^{k} N_{i} . p_{i} . q_{i}}$
One-stage Cluster	$m = \frac{σ_{c}^{2}}{B^{2} / z_{α}^{2}}$	$m = \frac{M . σ_{c}^{2}}{M . \frac{B^{2} . {\bar{N}}^{2}}{z_{α}^{2}} + σ_{c}^{2}}$	$m = \frac{1 / M . \sum_{i = 1}^{M} \frac{N_{i}}{\bar{N}} . p_{i} . q_{i}}{B^{2} / z_{α}^{2}}$	$m = \frac{\sum_{i = 1}^{M} \frac{N_{i}}{\bar{N}} . p_{i} . q_{i}}{M . \frac{B^{2} . {\bar{N}}^{2}}{z_{α}^{2}} + 1 / M . \sum_{i = 1}^{M} \frac{N_{i}}{\bar{N}} . p_{i} . q_{i}}$

Table 7.1

7.5 Final Remarks

It is rarely possible to obtain the exact distribution of a variable when we select all the elements of the population, due to the high costs, the time needed, and the difficulties in collecting the data. Therefore, the alternative is to select part of the elements of the population (sample) and, after that, infer the properties for the whole (population). Since the sample must be a good representative of the population, choosing the sampling technique is essential in this process.

Sampling techniques can be classified in two major groups: probability or random sampling and nonprobability or nonrandom sampling. Among the main random sampling techniques, we can highlight simple random sampling (with and without replacement), systematic, stratified, and cluster. The main nonrandom sampling techniques are convenience, judgmental or purposive, quota, and snowball sampling. Each one of these techniques has advantages and disadvantages, and choosing the best technique must take the characteristics of each study into consideration.

This chapter also discussed how to calculate the sample size for the mean and the proportion of finite and infinite populations, for each type of random sampling. In the case of nonrandom samples, the researcher must either establish the sample size based on a possible budget or adopt a certain dimension that has already been used successfully in previous studies with similar characteristics. Another alternative would be to calculate the size of a random sample and use it as a reference.

7.6 Exercises

1) Why is sampling important?
2) What are the differences between random and nonrandom sampling techniques? In what cases must they be used?
3) What is the difference between stratified and cluster sampling?
4) What are the advantages and limitations of each sampling technique?
5) What type of sampling is used in the EuroMillions Lottery?
6) To verify if a part meets certain quality specification demands, from every batch with 150 parts produced, we randomly pick a unit and inspect all the quality characteristics. What type of sampling should be used in this case?
7) Assume that the population of the city of Youngstown (OH) is divided by educational level. Thus, for each level, a percentage of the population will be interviewed. What type of sampling should be used in this case?
8) In a production line, one batch with 1500 parts is produced every hour. From each batch, we randomly pick a sample with 125 units. In each sample unit, we inspect all the quality characteristics to check whether the part is defective or not. What type of sampling should be used in this case?
9) The population of the city of Sao Paulo is divided into 96 districts. From this total, 24 districts will be randomly drawn and, for each one of them, a small sample of the population will be interviewed in a public opinion survey. What type of sampling should be used in this case?
10) We would like to estimate the illiteracy rate in a municipality with 4000 inhabitants who are 15 or over 15 years of age. Based on previous surveys, we can estimate that $\hat{p} = 0.24$ . A random sample will be drawn from the population. Assuming a maximum estimation error of 5%, and a 95% confidence level, what should the sample size be?
11) The population of a certain municipality with 120,000 inhabitants is divided into five regions (North, South, Center, East, and West). The table shows the number of inhabitants per region. A random sample will be collected in each region in order to estimate the average age of its inhabitants. The samples selected must be proportional to the number of inhabitants per region. Based on pilot samples, we obtain the following estimates for the variances in the five regions: 44.5 (North), 59.3 (South), 82.4 (Center), 66.2 (East), and 69.5 (West). Determine the sample size, considering an estimation error of 0.6 and a 99% confidence level.

Region	Inhabitants
North	14,060
South	19,477
Center	36,564
East	26,424
West	23,475

12) Consider a municipality with 120,000 inhabitants. We would like to estimate the percentage of the population that lives in urban and rural areas. The sampling plan used divides the municipality into 85 districts of different sizes. From all the districts, we would like to select some and, for each district chosen, all the inhabitants will be selected. The file Districts.xls shows the size of each district, as well as the estimated percentage of the urban and rural population. Determine the total number of districts to be drawn assuming a maximum estimation error of 10% and a 90% confidence level.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7: Sampling

Create new playlist

Sign In

Sign Up

7.1 Introduction

7.2 Probability or Random Sampling

7.2.1 Simple Random Sampling

7.2.1.1 Simple Random Sampling Without Replacement

7.2.1.2 Simple Random Sampling With Replacement

7.2.2 Systematic Sampling

7.2.3 Stratified Sampling

7.2.4 Cluster Sampling

7.3 Nonprobability or Nonrandom Sampling

7.3.1 Convenience Sampling

7.3.2 Judgmental or Purposive Sampling

7.3.3 Quota Sampling

7.3.4 Geometric Propagation or Snowball Sampling

7.4 Sample Size

7.4.1 Size of a Simple Random Sample

7.4.1.1 Sample Size to Estimate the Mean of an Infinite Population

7.4.1.2 Sample Size to Estimate the Mean of a Finite Population

7.4.1.3 Sample Size to Estimate the Proportion of an Infinite Population

7.4.1.4 Sample Size to Estimate the Proportion of a Finite Population

7.4.2 Size of the Systematic Sample

7.4.3 Size of the Stratified Sample

7.4.3.1 Sample Size to Estimate the Mean of an Infinite Population

7.4.3.2 Sample Size to Estimate the Mean of a Finite Population

7.4.3.3 Sample Size to Estimate the Proportion of an Infinite Population

7.4.3.4 Sample Size to Estimate the Proportion of a Finite Population

7.4.4 Size of a Cluster Sample

7.4.4.1 Size of a One-Stage Cluster Sample

7.4.4.1.1 Sample Size to Estimate the Mean of an Infinite Population

7.4.4.1.2 Sample Size to Estimate the Mean of a Finite Population

7.4.4.1.3 Sample Size to Estimate the Proportion of an Infinite Population

7.4.4.1.4 Sample Size to Estimate the Proportion of a Finite Population

7.4.4.2 Size of a Two-Stage Cluster Sample

7.5 Final Remarks

7.6 Exercises

Table of Contents for
Chapter 7: Sampling