The objective of this chapter is to present the main sampling concepts and methods, characterizing the differences between a population and a sample. The main random and nonrandom sampling techniques are described here, as well as their advantages and disadvantages. Thus, it is possible to select the most suitable sampling technique for the study we are interested in. For each type of random sampling, we calculate the sample size based on accuracy and on the confidence level we are interested in having.
Population; Sample; Sampling techniques; Random sampling; Nonrandom sampling; Sample size
Our reason becomes obscure when we consider that the countless fixed stars that shine in the sky do not have any other purpose besides illuminating worlds in which weeping and pain rule, and, in the best case scenario, only unpleasantness exists; at least, judging by the sample we know.
Arthur Schopenhauer
As discussed in the Introduction of this book, population is the set that has all the individuals, objects, or elements to be studied, which have one or more characteristics in common. A census is the study of data related to all the elements of the population.
According to Bruni (2011), populations can be finite or infinite. Finite populations have a limited size, allowing their elements to be counted; infinite populations, on the other hand, have an unlimited size, not allowing us to count their elements.
As examples of finite populations, we can mention the number of employees in a certain company, the number of members in a club, the number of products manufactured during a certain period, etc. When the number of elements in a population, even though they can be counted, is too high, we assume that the population is infinite. Examples of populations considered infinite are the number of inhabitants in the world, the number of residences in Rio de Janeiro, the number of points on a straight line, etc.
Therefore, there are situations in which a study with all the elements in a population is impossible or unwanted. Hence, the alternative is to extract a subset from the population under analysis, which is called a sample. The sample must be representative of the population being studied, therein is the importance of this chapter. From the information gathered in the sample and using suitable statistical procedures, the results obtained can be used to generalize, infer, or draw conclusions regarding the population (statistical inference).
For Fávero et al. (2009) and Bussab and Morettin (2011), it is rarely possible to obtain the exact distribution of a variable, due to the high costs, the time needed and the difficulties in collecting the data. Hence, the alternative is to select part of the elements in the population (sample) and, after that, infer the properties for the whole (population).
Essentially, there are two types of sampling: (1) probability or random sampling, and (2) nonprobability or nonrandom sampling. In random sampling, samples are obtained randomly, that is, the probability of each element of the population being a part of the sample is the same. In nonrandom sampling, on the other hand, the probability of some or all the elements of the population being in the sample is unknown.
Fig. 7.1 shows the main random and nonrandom sampling techniques.
Fávero et al. (2009) show the advantages and disadvantages of random and nonrandom techniques. Regarding random sampling techniques, the main advantages are: a) the selection criteria of the elements are rigorously defined, not allowing the researchers’ or the interviewer’s subjectivity to interfere in the selection of the elements; b) the possibility to mathematically determine the sample size based on accuracy and on the confidence level desired for the results. On the other hand, the main disadvantages are: a) difficulty in obtaining current and complete listings or regions of the population; b) geographically speaking, a random selection can generate a highly disperse sample, increasing the costs, the time needed for the study, and the difficulty in collecting the data.
As regards nonrandom sampling techniques, the advantages are lower costs, less time to carry out the study, and less need of human resources. As disadvantages, we can mention: a) there are units in the population that cannot be chosen; b) a personal bias may happen; c) we do not know with what level of confidence the conclusions arrived at can be inferred for the population. These techniques do not use a random method to select the elements of the sample, so, there is no guarantee that the sample selected is a good representative of the population (Fávero et al., 2009).
Choosing the sampling technique must consider the goals of the survey, the acceptable error in the results, accessibility to the elements of the population, the desired representativeness, the time needed, and the availability of financial and human resources.
In this type of sampling, samples are obtained randomly, that is, the probability of each element of the population being a part of the sample is the same, and all of the samples selected are equally probable.
In this section, we will study the main probability or random sampling techniques: (a) simple random sampling, (b) systematic sampling, (c) stratified sampling, and (d) cluster sampling.
According to Bolfarine and Bussab (2005), simple random sampling (SRS) is the simplest and most important method for selecting a sample.
Consider a population or universe (U) with N elements:
U={1,2,…,N}
According to Bolfarine and Bussab (2005), planning and selecting the sample include the following steps:
According to Bolfarine and Bussab (2005), from a practical point of view, an SRS without replacement is much more interesting, because it satisfies the intuitive principle that we do not gain more information in case the same unit appears more than once in the sample. On the other hand, an SRS with replacement has mathematical and statistical advantages, such as, the independence between the units drawn. Let’s now study each of them.
According to Bolfarine and Bussab (2005), an SRS without replacement works as follows:
U={1,2,…,N}
In this type of sampling, there are CN,n=(Nn)=N!n!(N−n)! possible samples of n elements that can be obtained from the population, and each sample has the same probability of being selected, 1/(Nn).
According to Bolfarine and Bussab (2005), an SRS with replacement works as follows:
U={1,2,…,N}
In this type of sampling, there are Nn possible samples of n elements that can be obtained from the population, and each sample has the same probability 1/Nn of being selected.
According to Costa Neto (2002), when the elements of the population are sorted and periodically removed, we have a systematic sampling. Hence, for example, in a production line, we can remove an element at every 50 items produced.
As advantages of systematic sampling, in comparison to simple random sampling, we can mention that it is carried out in a much faster and cheaper way, besides being less susceptible to errors made by the interviewer during the survey. The main disadvantage is the possibility of having variation cycles, especially if these cycles coincide with the period when the elements are removed from the sample. For example, let’s suppose that at every 60 parts produced in a certain machine, one part is inspected; however, in this machine a certain flaw usually happens, so, at every 20 parts produced, one is defective.
Assuming that the elements of the population are sorted from 1 to N and that we already know the sample size (n), systematic sampling works as follows:
k=Nn
X1,X1+k,X1+2k,…,X1+(n−1)k
In this type of sampling, a heterogeneous population is stratified or divided into subpopulations or homogeneous strata, and, in each stratum, a sample is drawn. Hence, initially, we define the number of strata and, by doing that, we obtain the size of each stratum. For each stratum, we specify how many elements will be drawn from the subpopulation, and this can be a uniform or proportional allocation. According to Costa Neto (2002), uniform stratified sampling, from which we draw an equal number of elements in each stratum, is recommended when the strata are approximately the same size. In proportional stratified sampling, on the other hand, the number of elements in each stratum is proportional to the number of elements in the stratum.
According to Freund (2006), if the elements selected in each stratum are simple random samples, the global process (stratification followed by random sampling) is called (simple) stratified random sampling.
According to Freund (2006), stratified sampling works as follows:
In uniform stratified sampling, we have:
n1=n2=…=nk
where the sample size obtained from each stratum is:
ni=nk,parai=1,2,…,k
where n = n1 + n2 + … + nk
In proportional stratified sampling, on the other hand, we have:
n1N1=n2N2=…=nkNk
In proportional sampling, the sample size obtained from each stratum can be obtained according to the following expression:
ni=NiN⋅n,fori=1,2,…,k
As examples of stratified sampling, we can mention the stratification of a city into neighborhoods, of a population by gender or age group, of customers by social class or of students by school.
The calculation of the size of a stratified sample will be studied in Section 7.4.3.
In cluster sampling, the total population must be subdivided into groups of elementary units, called clusters. The sampling is done from the groups and not from the individuals in the population. Hence, we must randomly draw a sufficient number of clusters and the objects from these will form the sample. This type of sampling is called one-stage cluster sampling.
According to Bolfarine and Bussab (2005), one of the inconveniences of cluster sampling is the fact that elements in the same cluster tend to have similar characteristics. The authors show that the more similar the elements in the cluster are, the less efficient the procedure is. Each cluster must be a good representative of the population, that is, it must be heterogeneous, containing all kinds of participants. It is the opposite of stratified sampling.
According to Martins and Domingues (2011), cluster sampling is a simple random sampling in which the sample units are the clusters; however, it is less expensive.
When we draw elements in the clusters selected, we have a two-stage cluster sampling: in the first stage, we draw the clusters and, in the second, we draw the elements. The number of elements to be drawn depends on the variability in the cluster. The higher the variability, the more elements must be drawn. On the other hand, when the units in the cluster are very similar, it is not advisable nor necessary to draw all the elements, because they will bring the same kind of information (Bolfarine and Bussab, 2005).
Cluster sampling can be generalized to several stages.
The main advantages that justify the wide use of cluster sampling are: a) many populations are already grouped into natural or geographic subgroups, facilitating its application; b) it allows a substantial reduction in the costs to obtain the sample, without compromising its accuracy. In short, it is fast, cheap, and efficient. The only disadvantage is that clusters are rarely the same size, making it difficult to control the range of the sample. However, to overcome this problem, we have to use certain statistical techniques.
As examples of clusters, we can mention the production in a factory divided into assembly lines, company employees divided by area, students in a municipality divided by schools, or the population in a municipality divided into districts.
Consider the following notation for cluster sampling:
In short, one-stage cluster sampling adopts the following procedure:
The calculation of the number of clusters (m) will be studied in Section 7.4.4.
On the other hand, two-stage cluster sampling works as follows:
In nonprobability sampling methods, samples are obtained in a nonrandom way, that is, the probability of some or all elements of the population belonging to the sample is unknown. Thus, it is not possible to estimate the sample error, nor to generalize the results of the sample to the population, since the former is not representative of the latter.
For Costa Neto (2002), this type of sampling is used many times due to its simplicity or impossibility to obtain probability samples, as would be the most desirable.
Therefore, we must be careful when deciding to use this type of sampling, since it is subjective, based on the researcher’s criteria and judgment, and sample variability cannot be established with accuracy.
In this section, we will study the main nonprobability or nonrandom sampling techniques: (a) convenience sampling, (b) judgmental or purposive sampling, (c) quota sampling, (d) geometric propagation or snowball sampling.
Convenience sampling is used when participation is voluntary or the sample elements are chosen due to convenience or simplicity, such as, friends, neighbors, or students. The advantage this method offers is that it allows researcher to obtain information in a quick and cheap way.
However, the sample process does not guarantee that the sample is representative of the population. It should only be employed in extreme situations and in special cases that justify its use.
In judgmental or purposive sampling, the sample is chosen according to an expert’s opinion or previous judgment. It is a risky method due to possible mistakes made by the researcher in his prejudgment.
Using this type of sampling requires knowledge of the population and of the elements selected.
Quota sampling presents greater rigor when compared to other nonrandom samplings. For Martins and Domingues (2011), it is one of the most used sampling methods in market surveys and election polls.
Quota sampling is a variation of judgmental sampling. Initially, we set the quotas based on a certain criterion. Within the quotas, the selection of the sample items depends on the interviewer’s judgment.
Quota sampling can also be considered a nonprobability version of stratified sampling.
Quota sampling consists of three steps:
The main advantages of quota sampling are the low costs, speed, and convenience or ease in which the interviewer can select elements. However, since the selection of elements is not random, there are no guarantees that the sample will be representative of the population. Hence, it is not possible to generalize the results of the survey to the population.
Geometric propagation or snowball sampling is widely used when the elements of the population are rare, difficult to access, or unknown.
In this method, we must identify one or more individuals from the target population, and these will identify the other individuals that are in the same population. The process is repeated until the objective proposed is achieved, that is, the point of saturation. The point of saturation is reached when the last respondents do not add new relevant information to the research, thus, repeating the content of previous interviews.
As advantages, we can mention: a) it allows the researcher to find the desired characteristic in the population; b) it is easy to apply, because the recruiting is done through referrals from other people who are in the population; c) low cost, because we need less planning and people; and d) it is efficient to enter populations that are difficult to access.
According to Cabral (2006), there are six decisive factors when calculating the sample size:
The sample selected must be representative of the population. Based on Ferrão et al. (2001), Bolfarine and Bussab (2005), and Martins and Domingues (2011), this section discusses how to calculate the sample size for the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B, and for each type of random sampling (simple, systematic, stratified, and cluster).
In the case of nonrandom samples, either we set the sample size based on a possible budget or we adopt a certain dimension that has already been used successfully in previous studies with the same characteristics. A third alternative would be to calculate the size of a random sample and use that dimension as a reference.
This section discusses how to calculate the size of a simple random sample to estimate the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B.
The estimation error (B) for the mean is the maximum difference that the researcher accepts between μ (population mean) and ˉX (sample mean), that is, B≥|μ−ˉX|.
On the other hand, the estimation error (B) for the proportion is the maximum difference that the researcher accepts between p (proportion of the population) and ˆp (proportion of the sample), that is, B≥|p−ˆp|.
If the variable chosen is quantitative and the population infinite, the size of a simple random sample, where P(|ˉX−μ|≤B)=1−α, can be calculated as follows:
n=σ2B2/z2α
where:
According to Bolfarine and Bussab (2005), to determine the sample size it is necessary to set the maximum estimation error (B), the significance level α (translated by the value of zα), and to have some previous knowledge of the population variance (σ2). The first two are set by the researcher, while the third demands more work.
When we do not know σ2, its value must be substituted for a reasonable initial estimator. In many cases, a pilot sample can provide sufficient information about the population. In other cases, sample surveys done previously about the population can also provide satisfactory initial estimates for σ2. Finally, some authors suggest the use of an approximate value for the standard deviation, given by σ ≅ range/4.
If the variable chosen is quantitative and the population finite, the size of a simple random sample, where P(|ˉX−μ|≤B)=1−α, can be calculated as follows:
n=N.σ2(N−1).B2z2α+σ2
where:
If the variable chosen is binary and the population infinite, the size of a simple random sample, where P(|ˆp−p|≤B)=1−α, can be calculated as follows:
n=p.qB2/z2α
where:
In practice, we do not know the value of p and we must, therefore, find its estimate (ˆp). If, however, this value is also unknown, we must admit that ˆp=0.50, hence obtaining a conservative size, that is, larger than what is necessary to ensure the accuracy required.
If the variable chosen is binary and the population finite, the size of a simple random sample, where P(|ˆp−p|≤B)=1−α, can be calculated as follows:
n=N.p.q(N−1).B2z2α+p.q
where:
In systematic sampling, we use the same expressions as in simple random sampling (as studied in Section 7.4.1), according to the type of variable (quantitative or qualitative) and population (infinite or finite).
This section discusses how to calculate the size of a stratified sample to estimate the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B.
The estimation error (B) for the mean is the maximum difference that the researcher accepts between μ (population mean) and ˉX (sample mean), that is, B≥|μ−ˉX|.
On the other hand, the estimation error (B) for the proportion is the maximum difference that the researcher accepts between p (proportion of the population) and ˆp (proportion of the sample), that is, B≥|p−ˆp|.
Let’s use the following notation to calculate the size of the stratified sample, as follows:
If the variable chosen is quantitative and the population infinite, the size of the stratified sample, where P(|ˉX−μ|≤B)=1−α, can be calculated as:
n=k∑i=1Wi.σ2iB2/z2α
where:
If the variable chosen is quantitative and the population finite, the size of the stratified sample, where P(|ˉX−μ|≤B)=1−α, can be calculated as:
n=k∑i=1N2i.σ2i/WiN2.B2z2α+k∑i=1Ni.σ2i
where:
If the variable chosen is binary and the population infinite, the size of the stratified sample, where P(|ˆp−p|≤B)=1−α, can be calculated as:
n=k∑i=1Wi.pi.qiB2/z2α
where:
If the variable chosen is binary and the population finite, the size of the stratified sample, where P(|ˆp−p|≤B)=1−α, can be calculated as:
n=k∑i=1N2i.pi.qi/WiN2.B2z2α+k∑i=1Ni.pi.qi
where:
This section discusses how to calculate the size of a one-stage and a two-stage cluster sample.
Let’s consider the following notation to calculate the size of a cluster sample:
According to Bolfarine and Bussab (2005), the calculation of σdc2 and σec2 is given by:
σ2dc=M∑i=1Ni∑j=1(Xij−μi)2N=1M⋅M∑i=1NiˉN⋅σ2i
σ2ec=1N.M∑i=1Ni.(μi−μ)2=1M.M∑i=1NiˉN.(μi−μ)2
Assuming that all the clusters are the same size, the previous expressions can be summarized as follows:
σ2dc=1M.M∑i=1σ2i
σ2ec=1M.M∑i=1(μi−μ)2
This section discusses how to calculate the size of a one-stage cluster sample to estimate the mean (a quantitative variable) of a finite and infinite population, with a maximum estimation error B.
The estimation error (B) for the mean is the maximum difference that the researcher accepts between μ (population mean) and ˉX (sample mean), that is, B≥|μ−ˉX|.
If the variable chosen is quantitative and the population infinite, the number of the clusters drawn in the first stage (m), where P(|ˉX−μ|≤B)=1−α, can be calculated as follows:
m=σ2cB2/z2α
where:
If the clusters are the same size, Bolfarine and Bussab (2005) demonstrate that:
m=σ2eB2/z2α
According to the authors, generally, σc2 is unknown and has to be estimated from pilot samples or obtained from previous sample surveys.
If the variable chosen is quantitative and the population finite, the number of clusters drawn in the first stage (m), where P(|ˉX−μ|≤B)=1−α, can be calculated as follows:
m=M.σ2cM.B2.ˉN2z2α+σ2c
where:
If the variable chosen is binary and the population infinite, the number of clusters drawn in the first stage (m), where P(|ˆp−p|≤B)=1−α, can be calculated as follows:
m=1/M.M∑i=1NiˉN.pi.qiB2/z2α
where:
If the variable chosen is binary and the population finite, the number of clusters drawn in the first stage (m), where P(|ˆp−p|≤B)=1−α, can be calculated as follows:
m=M∑i=1NiˉN.pi.qiM.B2.ˉN2z2α+1/M.M∑i=1NiˉN.pi.qi
where:
In this case, we assume that all the clusters are the same size. Based on Bolfarine and Bussab (2005), let’s consider the following linear cost function:
C=c1.n+c2.b
where:
The optimal size for b that minimizes the linear cost function is given by:
b∗=σdcσec.√c1c2
From the sample selected, we can infer, with a 99% confidence level, that the difference between the sample mean and the real population mean will be 2%, at the most.
Table 7.1 shows a summary of the expressions used to calculate the sample size for the mean (a quantitative variable) and the proportion (a binary variable) of finite and infinite populations, with a maximum estimation error B, and for each type of random sampling (simple, systematic, stratified, and cluster).
Table 7.1
Type of Random Sample | Estimating the Mean (Infinite Population) | Estimating the Mean (Finite Population) | Estimating the Proportion (Infinite Population) | Estimating the Proportion (Finite Population) |
---|---|---|---|---|
Simple | n=σ2B2/z2α | n=N.σ2(N−1).B2z2α+σ2 | n=p.qB2/z2α | n=N.p.q(N−1).B2z2α+p.q |
Systematic | n=σ2B2/z2α | n=N.σ2(N−1).B2z2α+σ2 | n=p.qB2/z2α | n=N.p.q(N−1).B2z2α+p.q |
Stratified | n=∑ki=1Wi.σ2iB2/z2α | n=∑ki=1N2i.σ2i/WiN2.B2z2α+∑ki=1Ni.σ2i | n=∑ki=1Wi.pi.qiB2/z2α | n=∑ki=1N2i.pi.qi/WiN2.B2z2α+∑ki=1Ni.pi.qi |
One-stage Cluster | m=σ2cB2/z2α | m=M.σ2cM.B2.ˉN2z2α+σ2c | m=1/M.∑Mi=1NiˉN.pi.qiB2/z2α | m=∑Mi=1NiˉN.pi.qiM.B2.ˉN2z2α+1/M.∑Mi=1NiˉN.pi.qi |
It is rarely possible to obtain the exact distribution of a variable when we select all the elements of the population, due to the high costs, the time needed, and the difficulties in collecting the data. Therefore, the alternative is to select part of the elements of the population (sample) and, after that, infer the properties for the whole (population). Since the sample must be a good representative of the population, choosing the sampling technique is essential in this process.
Sampling techniques can be classified in two major groups: probability or random sampling and nonprobability or nonrandom sampling. Among the main random sampling techniques, we can highlight simple random sampling (with and without replacement), systematic, stratified, and cluster. The main nonrandom sampling techniques are convenience, judgmental or purposive, quota, and snowball sampling. Each one of these techniques has advantages and disadvantages, and choosing the best technique must take the characteristics of each study into consideration.
This chapter also discussed how to calculate the sample size for the mean and the proportion of finite and infinite populations, for each type of random sampling. In the case of nonrandom samples, the researcher must either establish the sample size based on a possible budget or adopt a certain dimension that has already been used successfully in previous studies with similar characteristics. Another alternative would be to calculate the size of a random sample and use it as a reference.