2
Building a Valid and Reliable Experiment

In Chapter 1, we reviewed the characteristics of an experimental study. We saw that carrying out an experiment comes down to manipulating at least one variable, in a controlled manner, in order to bring to light its effect(s) on one or more other variables. We also stressed the fact that each experiment should be used for testing a precise research hypothesis in which the observed variables are defined via an operationalization process. In this chapter, we discuss the different stages involved in the operationalization of a research question in detail. We see that the operationalization process requires making many choices as to the variables studied and the manner of measuring them and the conditions examined and the experimental design chosen. At the same time, these choices have consequences for the validity and reliability of the experiment. Thus, we begin this chapter with a presentation of the key concepts of validity and reliability. Secondly, we develop the notion of a variable introduced in the first chapter, in order to accurately define the types of variables involved in an experiment. Once this framework has been set, the rest of the chapter will deal with the stages for operationalizing a research question.

2.1. Validity and reliability of an experiment

In Chapter 1, we saw that the purpose of an experiment is to collect data in order to test a research hypothesis that states a cause-and-effect relationship between variables. In order to be valid, an experiment should lead to a trustworthy conclusion about this relationship, while ensuring that the results are not influenced by other variables not considered in the study. In other words, the cause identified in the research hypothesis must be the origin of the effects observed in the results. This is called the internal validity of an experiment. For example, in the case of an experiment aimed at showing a relationship between word length and reading time, it is necessary to ensure that the other variables that could have an impact on reading time do not influence the results. In this experiment, only word length should vary, whereas word frequency, the grammatical category or the number of phonological neighbors, for example, should be controlled. We will later return to this notion of control.

In the first chapter, we also saw that the results of an experimental study must be generalizable, and make it possible to draw conclusions concerning the relationship between two variables, regardless of the sample of participants and items included in the study, and of the conditions under which the study was carried out. This is called the external validity of an experiment. For example, going back to the afore-mentioned study on the relationship between word length and reading time, if the subjects studied were all 35–45-year-old women, the external validity would not be met, since the results of the study could not be generalized to men or women belonging to other age groups.

In addition to being valid, an experiment must also be reliable, that is, it must produce consistent results. In other words, if the same experiment was carried out several times, the results should demonstrate the same effects. For this reason, the results obtained in an experiment should be replicated in successive experiments before being communicated. However, in practice, this has rarely been the case, due to the fact that replicating an experiment is costly in terms of time and resources. Thus, the use of similar methodologies or similar tasks by the same or by different research teams has long been considered as a roundabout way of ensuring the reliability of a result. This practice is now being called into question, and more and more voices are rising in favor of the application of different means for ensuring reliability. One of these means is, for example, pre-registering the hypotheses, the research method and the analyses planned for each study. Another means is the establishment of open science platforms for sharing the data collected, the analyses carried out or even the different versions of the scientific articles reporting on the study. The discussion of the problem of reproducibility goes beyond the scope of this book, but we can only encourage readers to learn about these practices before embarking on the path of research. A good starting point is the Center for Open Science1 site which presents the steps to be followed for conducting transparent and open research.

Internal and external validity, as well as the reliability of an experiment, are influenced by many factors, which we will address throughout this chapter, and which we will illustrate by numerous studies in the following chapters. The validity of an experiment also crucially depends on the way in which the variables are chosen. In section 2.2, we will detail the types of variables that can be included in an experiment.

2.2. Independent and dependent variables

Let us recall that experimental research aims to empirically verify a cause-and-effect relationship between at least two variables. In general, a distinction is made between independent variables (the causes) and dependent variables (the effects). Independent variables are the parameters we identify as being responsible for influencing the value of one or more dependent variables. In other words, an independent variable is the variable whose effect we want to evaluate, the one that is manipulated in the experiment. The dependent variable, on the other hand, is the variable that is modified in accordance with the independent variable, the one whose change we want to measure. Let us take a first intuitive example. Let us imagine that we wish to find out whether the lack of sunshine causes seasonal depression. In that case, the independent variable would be the sunshine rate, and the dependent variable, depression. Let us now take linguistic examples. If the research question is “Do Australian people speak faster than American people?” the independent variable is the person’s nationality (Australian or American), and the dependent variable is the articulation rate. For the question “At what age do children begin to understand scalar implicatures in the same way as adults?” the independent variable is the age of the children and the dependent variable is the understanding of scalar implicatures. Each experiment includes at least one independent variable and one dependent variable, but it can also include several independent variables and/or several dependent variables.

2.3. Different measurement scales for variables

Experimental research is based on the quantification of observable responses or types of behavior. Whether independent or dependent, a variable must be measured in order to be included in experimental research. According to the type of measurement scale used, a variable can be either qualitative or quantitative. These two general categories can, in turn, be subdivided, as we will see later.

2.3.1. Qualitative variables

Qualitative variables correspond to variables that are not numerical, but describe categories, such as having a specific mother tongue or a certain nationality. This type of variable includes variables which can be measured on two types of scales: nominal and ordinal.

The values of nominal scales correspond to categories including individuals or similar things sharing some characteristic, for example, the fact of defining oneself as male or female. These values can be defined by nouns (masculine gender or feminine gender), or numbers (e.g. 1 for the masculine gender and 2 for the feminine gender), which bear no size relationship to each other. In the case of our example, value nos 1 and 2 do not offer any indication of a size difference between the feminine and masculine genders (the feminine gender is not worth twice the masculine gender), but simply corresponds to a means of defining or of categorizing a group. Numbers assigned to nominal scale values are often used for data coding purposes and should not be subjected to arithmetic tests. It would indeed be inappropriate to calculate an average of the kind of people participating in a study. On the other hand, it is possible to calculate a number for each of the variable’s condition or modality, in other words, from the total, how many people taking part in the experiment defined themselves as male or defined themselves as female.

Other examples of nominal scales can be mother tongue, marital status or a yes/no answer to a question. In all these cases, there is no hierarchy between the different categories, which are simply a list of possibilities. Frequently, the independent variable of a research project is measured on a nominal scale for creating two or more conditions under which the dependent variable can be observed. Most of the examples presented so far and in the previous chapter illustrate this scenario: studies comparing monolingual versus bilingual people, less frequent words versus very frequent words, people who have had a language stay versus those who haven’t, people with a high working memory capacity versus those with a lower capacity, or people who have to perform a verbal working memory task while reading a text versus those who do not have to perform this task.

The second type of scale associated with qualitative variables offers more information on the relationship between the different values of the scale. This is the ordinal scale, whose values can be ordered, although the size of the difference between the values cannot be evaluated. These values can also correspond to tags such as a little, a lot and passionately, or to categories.

For example, imagine that people taking part in an experiment are included in the following age categories: 15–25 years old, 25–35 years old, 35–45 years old. By assigning every participant to their age category, it is possible to classify them. For example, if participants 1, 4, 6 and 7 belong to the 15–25 years old category and participants 2, 5, 8 and 9 to the 25–35 years old category, participants 1, 4, 6 and 7 should be ranked before the others on the age scale. However, in this configuration, it is not possible to determine the order in which participants 1, 4, 6 and 7 appear, because the only known indicator is that they belong to the same category. In other words, even if these participants are not the same age, this type of information cannot be retrieved from the data.

Ordinal scales do not offer an indication of the size of the difference between the values of the scale. In the case of our example, even if every category spans 10 years, it is not possible to conclude that a participant in the first category and a participant in the second category are 10 years apart. It is indeed possible that the first participant is 15 years old and the second 35 years old. The response scales typically used in questionnaires are another illustration of the impossibility of assessing the size of the difference between the values in ordinal scales. For example, in order to measure language proficiency, one could ask people to assess their level on a scale between 1 and 7, where 1 represents poor fluency and 7 represents perfect fluency. By observing the values chosen by people on such a scale, it would be possible to deduce that people with a score of 4 have better skills than people with a score of 2. However, it would not be possible to say that people with a score of 4 have twice the language proficiency than people with a score of 2. For this reason, it is not appropriate to perform arithmetic tests on the data obtained by means of ordinal scales.

In experimental linguistics, some independent variables are often measured on an ordinal scale through the use of categories. This is the case for the age of speakers and the number of years of residence in a country, for example.

2.3.2. Quantitative variables

Unlike qualitative variables, quantitative variables can be subjected to arithmetic tests because they use scales based on quantifiable values. There are two types of scales: interval scales and ratio scales.

Interval scales are similar to the ordinal scales we have already described, but differ from those in that the interval between the different categories always has the same value. It is thus possible to perform arithmetic operations on the differences between the values of the scale. Conversely, interval scales do not have an absolute zero. Therefore, it is not possible to perform operations on scale values. A simple illustration of an interval scale and its properties is the temperature scale. On this scale, the difference between 5°C and 10°C is the same as that between 20°C and 25°C, that of 5°C. On the other hand, a temperature of 30°C does not correspond to a heat three times higher than a temperature of 10°C.

The difference between ordinal scales and interval scales is simple on paper, but complicated in some cases. Let us take the example of an experiment in which the participants have to judge the acceptability of sentences, on a scale of 1–9, with 1 being equivalent to totally unacceptable and 9 to totally acceptable. In order to be able to consider this scale as an interval scale, we should assume that the difference between values 1 and 2 is the same as between values 4 and 5, or 8 and 9, for example. It would also imply that a sentence with a score of 9 is considered as more acceptable than a sentence with a score of 6, in the same proportion that the latter would be more acceptable than a sentence with a score of 3. As we can see here, it is impossible to formally verify these conditions and, in absolute terms, the scale of acceptability of this example should be considered as an ordinal scale. However, it is accepted that if a scale is presented in such a way as to highlight equal differences between the different scores, it is likely that people will assess the differences between the scores as being equal. This is why, in practice, we often consider these response scales as interval scales.

The second type of scale that quantitative variables are based on corresponds to the ratio scale. Just like the interval scale, this scale has equivalent intervals between its values. In addition, this scale has an absolute zero. This means that it is possible to perform operations on the values themselves. For example, sentence length calculated per number of letters is a ratio scale. If a sentence contains 178 letters, it is twice as long as a sentence containing 89 letters. Likewise, when we measure it continuously, age is considered as a ratio scale. A 60-year-old woman is three times older than a 20-year-old woman. Online measurements used in experimental linguistics, such as response time or reading time, are typically considered as ratio scales. It is also possible to measure the opinion of people using a ratio scale, for example, by presenting a non-gradation line on which the interviewees have to indicate their degree of agreement. We can then measure the distance between the start of the line and the answer in order to obtain the value representing the level of agreement with each statement.

Schematic illustration of nominal, ordinal, interval and ratio scales.

Figure 2.1. Illustrations of nominal, ordinal, interval and ratio scales

To sum up, the different measurement scales have divergent properties and cannot be subjected to the same arithmetic operations. The simplest scales, called nominal scales, only make it possible to differentiate between categories. With ordinal scales, categories can also be ranked. Interval scales also make it possible to take into account the distance between the different categories. Finally, ratio scales make it possible to perform all arithmetic operations on the scale values. As a corollary, ratio and interval scales can be transformed into ordinal or nominal scales. To do this, values can simply be grouped into categories. For example, if the exact age of the participants is recorded at the time of the experiment, it is then possible to set up different categories. Similarly, an ordinal scale can be transformed into a nominal scale, by simply decreasing the number of categories. In this way, we can see that it is always possible to go from a more accurate scale to a less accurate scale, but the opposite cannot be done.

Due to their different properties, the types of scales we have just described do not allow the same statistical tests. It is useful to know that data from ratio scales and interval scales can be subjected to what are known as parametric tests. These are compatible with a wide variety of statistical analyses and are most commonly used in research. We will present them in Chapter 7. The data obtained from nominal and ordinal scales can be subjected to non-parametric tests, which offer fewer possibilities for analysis. From the very beginning of research, it is extremely important to proceed with caution when choosing the types of scales to be used, since these will not only shape the analyses that can be performed on the data, but also the conclusions that will be drawn on the basis of the analyses.

2.4. Operationalizing variables

Now that we have seen the characteristics of the variables involved in an experiment, we will describe the operationalization process, whereby the variables of interest are defined in terms of measurements. While some variables can be easily measured using objective indicators, others are more difficult to operationalize. On the one hand, age, mother tongue, word length, sound frequency and reading time are variables that can be measured directly and objectively. On the other hand, language proficiency, the derivation of inferences, the understanding of discourse connectives and even the access to the meaning of a word cannot be measured directly. As a matter of fact, this type of variable refers to abstract concepts which (1) cannot be observed directly and (2) are based on definitions or theoretical models. These variables require a process of reflection on how to operationalize them, by which we try to define a signal of the abstract concept. Measuring this signal amounts to measuring the abstract concept, in a roundabout manner.

To illustrate this process with a concrete example, let us imagine a plane flying in the sky and leaving a white trace behind. By seeing its trace, we know that an airplane has passed, even if the airplane is no longer visible. Thus, the trace works as a signal, making it possible to deduce the presence of the aircraft in a more or less precise manner, depending on its quality. If we apply this idea to some of the abstract concepts mentioned above, we can consider the results of language tests as a signal of language proficiency and certain eye movement patterns while reading a text (e.g. going back on certain words, longer reading time) as a signal of inference construction. Drawing a parallel, the goal of researchers is to measure the signal that best reflects an abstract concept, in a similar way as a clear and good quality trace should be as close as possible to the path of the plane.

During operationalization, we have to define not only the signal we want to measure, but also the scale on which this signal will be measured. We have already pointed out that experiments very often compare different groups of participants in relation to the values of the dependent variable. In general, the independent variables involved in research are measured on nominal (monolingual vs. bilingual, for example) or ordinal (different age categories) scales, in order to be able to create conditions, whereas dependent variables are preferably measured on interval or ratio scales (e.g. the number of correct answers given to a linguistic task). We will first review the general choices to make when choosing a measure for the variables before turning to the specifics of the independent variable.

2.5. Choosing a measure for every variable

There are different ways to operationalize the linguistic concepts investigated in experimental research. The choice of the measure mainly depends on the process that one wishes to examine, as well as the possibility of gaining access to it in a more or less direct manner. When the process is accessible to consciousness, one possibility is to use a survey response scale, as is the case of studies aiming to measure the acceptability of sentences according to their syntactic structure. When the process is not accessible to consciousness, or when one wishes to measure it implicitly (see section 1.3.2), it is possible to use behavioral measures. For instance, this is the case of studies using action tasks (such as performing an action on the basis of a sentence), and those measuring reading time or reaction time. Finally, when the process is not accessible through behavioral measures, it is possible to measure physiological reactions, as is the case in studies using an electroencephalogram, or magnetic resonance imaging (MRI), for example. Certainly, these different methods can be combined within the same study, in order to shed light on the same process from different perspectives.

These multiple operationalization methods do not only concern different concepts. In fact, there are many ways of operationalizing the same abstract concept through the use of different types of measures. Let us suppose that you are interested in the theory of linguistic relativity, according to which the speaker’s language may have an influence on their world view or cognition. Studies have shown that the way a language encodes different phenomena such as time, colors or gender has an influence on the representations that speakers have of such phenomena (e.g. Vigliocco et al. 2005; Athanasopoulos et al. 2011; Boroditsky et al. 2011). The representations often examined in these studies are those built on the basis of grammatical gender. Some languages, such as French or Italian, have two grammatical genders, feminine and masculine, and all nouns in these languages are related to a grammatical gender. In French, for example, this association can be based on the person’s gender, such as un infirmier (a nurse, masculine) or une astronaute (an astronaut, feminine). It can also be completely arbitrary, as when talking about une chaise (a chair, feminine), une tomate (a tomato, feminine), un train (a train, masculine), un clavier (a keyboard, masculine) or even une envie (a desire, feminine) or un souhait (a wish, masculine). In other languages, such as English or Japanese, nouns do not have a specific gender (English has certain exceptions, such as a ship or a bell, which can be regarded as feminine). On the basis of this difference, one might wonder whether the speakers of languages with grammatical genders associate feminine or masculine characteristics with certain words in the language, depending on their grammatical gender. In order to carry out a study on this phenomenon, it would be necessary to define what we understand by the presence of grammatical gender in the language and feminine or masculine characteristics assigned to certain words and how these variables would be measured.

A first possibility for operationalizing such a question can be drawn from the study by Konishi (1993) in which German-speaking and Spanish-speaking participants had to evaluate the words man and women in their language, as well as nouns for objects (newspaper, cigarette), places (mountain, desert) or abstract concepts (love, record) on a potency scale. The nouns to be evaluated (except man and women) were selected according to their German grammatical gender, which was always different from their Spanish grammatical gender. Half of the words presented in each language were masculine and the other half feminine. In this study, the independent variable was the words’ grammatical genders in the participants’ language. Words were either masculine in Spanish and feminine in German, or feminine in Spanish and masculine in German. The score on the potency scale made it possible to operationalize the concept of masculinity, in order to measure the dependent variable. In this example, we can see the transformation of an abstract concept, considering an object as more or less masculine, into a response on a potency scale. The results of this study showed that the participants subjectively placed the word man on a higher rank on the potency scale than the word woman. In addition, masculine words were also rated as more powerful than feminine words in both languages.

A second example on how to operationalize this research question comes from a study described by Boroditsky et al. (2003), in which Spanish-speaking and German-speaking participants carried out a memorization task in a language that does not have a grammatical gender, namely English. As in Konishi’s (1993) study, Boroditsky et al. compiled a list of objects with opposite grammatical genders in German and Spanish. Half of the nouns referring to the objects were masculine in Spanish and feminine in German, whereas the other half were feminine in Spanish and masculine in German. Setting up object–name pairs, each object was associated with a first name, congruent with the object’s gender (e.g. apple-Patricia) for half of the participants and incongruent for the other half (e.g. tomato-Peter). The participants had to learn the first names associated with the objects and were then tested on their memory of the pairs. Here, the independent variable was operationalized as the association between a first name and an object’s grammatical gender in the participant’s mother tongue and had two modalities: congruent (masculine object and masculine first name or feminine object and feminine first name), or incongruent (masculine object and feminine first name, or feminine object and masculine first name). The dependent variable corresponded to the number of correct answers provided during the recall task. The results of this study showed that the participants remembered the first names associated with the objects under the congruent condition better than those under the incongruent condition.

A third example of operationalization can also be found in Boroditsky et al. (2003). Once again, participants whose mother tongue was German or Spanish completed a task in English, using a list of words of opposite grammatical genders in German and Spanish, similar to that of the previous study. This time, the participants were asked to list, for each word, the first three adjectives that came to mind. A group of English speakers then evaluated these adjectives to determine whether they predominantly related to female or to male characteristics. In this study, the independent variable was operationalized as the grammatical gender of a noun for an object in the participants’ mother tongue (feminine or masculine). The dependent variable corresponded to the feminine or masculine perception of the adjectives attributed to words. The results showed that the adjectives associated with feminine words in the participants’ mother tongue were assessed as predominantly feminine when compared to the adjectives associated with masculine words. The words that were feminine in Spanish but masculine in German were associated by the Spanish-speaking participants with adjectives perceived as predominantly feminine, whereas they were associated by German-speaking participants with adjectives perceived as predominantly masculine. The opposite was also true for words that were masculine in Spanish and feminine in German.

These three examples illustrate the fact that it is possible to operationalize the same concept in different ways. How can we decide on how to operationalize variables for research? A first clue can be found in the scientific literature already published on the subject of interest. It is therefore strongly advised to build on existing studies and to pay special attention to the way in which the variables have been operationalized. An in-depth literature review should make it possible to identify the different measures that have been used so far, as well as the results obtained on the basis of these measures.

The choice of a measure for operationalizing a variable should also be made keeping in mind the statistical analyses that will later be performed on the data. In fact, the quantitative data acquired in an experimental study must be statistically tested, in order to check whether an effect is real or not. In addition, as we have already discussed, the different measurement scales are not compatible with the application of all the statistical tests. This is something that is very important to think about before collecting the data, in order to avoid reaching the last stage of the research process and realizing that the data cannot be analyzed as they should be.

2.6. Notions of reliability and validity of measurements

Finally, the essential element when choosing how to operationalize variables is to ensure the quality of the measurement. In the same way as for an experiment, this quality can be assessed by means of two concepts: the validity and the reliability of the measurement. The validity of a measurement refers to how well it measures what it intends to measure. Imagine an experiment in which we want to study the effect of word length on reading time. One way of measuring word length could be to count the number of letters in each word. It could also be possible to decide to count the number of syllables, rather than the number of letters. To measure the reading time, you could present isolated words on a computer screen and ask people to press a key when the word has been read. By measuring the time between the word’s appearance on the screen and the key press, it would be possible to deduce how long it took for each word to be read. In this example, the number of letters and the number of syllables are both valid measures for calculating word length. As regards the measurement of reading time, the proposal made here also seems valid, in that it makes it possible to accurately record the amount of time taken by people to read each word.

Let us think about an experiment designed to examine the impact of people’s personality on their ability to learn a foreign language. We can see that the variables of this research question are much more abstract than those previously examined and that they cannot be directly observed. Let us first examine the dependent variable in this question, namely the ease of learning a foreign language, and explore some different ways of operationalizing it. A first option would be to directly ask for people’s opinion by offering them to assess this ease on a scale from 0 to 5, for example. A second option would be to evaluate them after a few months of learning by means of a dictation in their second language and counting the number of errors. A third option would be to assess people’s skills in the areas of language production and comprehension after a few months of learning, using standardized tests, that is, tests developed and validated in previous studies.

The first option is the least valid measurement, because it is based on a subjective assessment and on a single question, which can be interpreted in many different ways among participants. It is therefore very likely that the scores on this scale do not directly reflect the ease of learning, or, in some cases, not the concept that the researchers desire to measure. The second option seems to be a more valid measurement than the first, in the sense that it practically leaves no room for interpretation, since a number of correct answers is a concrete and objective measurement. However, it measures the ease of learning a foreign language on the basis of a single language-related task, a dictation, only reflecting spelling competence. Language-related skills obviously go well beyond the simple fact of not making mistakes in a dictation. Consequently, the validity of this measurement is not suitable, since it is too distant from the construct it is supposed to assess, and only targets one facet of language. On the other hand, the third option, which evaluates different skills in the second language, makes it possible to measure the dependent variable more comprehensively. Besides, as this measurement has already been used and validated, it seems the most adequate one.

Now, let us go back to the examples on how to operationalize the concept assigning feminine or masculine characteristics to words, described above. We can notice that, in all cases, the measurement was a relatively distant signal from the original concept. In the first case, measuring the potency associated with a word as a signal of its masculinity is based on the idea that potency is a good indicator of masculinity, as a result of the gender stereotypes present in society. This measurement also assumes that the perception of potency is directly related to the grammatical gender of the word, rather than to other characteristics, such as its phonology, for example. In the second study, measuring the recall of word-first name associations is based on the idea that people generally remember congruent things, and that congruence is based on the association of the word’s grammatical gender with the gender of the first name, and not on other aspects. Finally, in the third study, the perception of the femininity or masculinity of the adjectives associated with the words also draws on other concepts, which are directly associated with the basic concept to greater or lesser degrees. This measurement is based on the stereotypes pervading society which potentially encourage people to associate feminine words with feminine characteristics and masculine words with masculine characteristics. It also depends on the evaluation of the masculinity or femininity of the adjectives chosen, made by external persons.

If we go back to the metaphor of the plane we introduced above, these measurements would correspond to the plane’s distant signals, which have a lower intensity in the sky and a less clear course. The quality of this signal, or in other words, its validity, could be questioned more easily than that of measurements such as reading time or the actions performed following a set of instructions. This illustrates the fact that the more abstract a concept, the more difficult it becomes to operationalize, and the more its measurements can become a topic for discussion. In cases like these, it would be appropriate to choose different ways of operationalizing the concept and to check that the results are consistent between the measurements, as has been done by the authors of these studies.

In addition to being valid, the measurement must be reliable. The reliability of a measurement denotes the fact that it always produces the same or almost the same result under the same conditions. For example, your scale gives a reliable measurement if, when you weigh yourself several times in the same day, the result is the same. When working with measurements, such as those used in an experimental linguistics study, things can be a bit more complicated. Indeed, these measurements are dependent on many factors, such as the fact that they are carried out on different people and that the conditions are impossible to keep completely constant. Thus, if we take the example of a study aiming to measure foreign language skills using standardized tests to assess production and comprehension skills, it is unlikely that the participants’ responses will be exactly the same if the tests are carried out several times. However, these responses should be similar enough so as to ensure the reliability of the measurement. There are different ways to assess the reliability of a measurement, but their presentation is beyond the scope of this chapter. Those interested can turn to the resources listed at the end of the chapter.

The validity and reliability of a measurement are two distinct concepts, and the fact that a measurement may be reliable does not necessarily make it valid. Let us take the example of the scale again. If you weigh yourself several times a few minutes apart and every time the scale tells you the same weight, which you know is yours, then this can be considered as reliable and valid. If it indicates a weight twenty pounds higher than yours every time, the measurement is reliable, but not valid. A scale indicating your weight with +/- one pound every time would be valid but not completely reliable. Finally, if the result was different every time and far from your weight, the measurement would be neither valid nor reliable. When choosing a measurement, whenever possible, you should try to find a measurement that is both the most valid and the most reliable one.

Finally, we should point out that the validity and reliability of measurements will greatly influence the overall validity and reliability of the experiment. As a matter of fact, the internal validity of an experiment partly depends on its ability to measure the variables, in order to be able to draw solid conclusions on the relationships between them. Likewise, its reliability depends on that of the measurements. In order to be replicated, an experiment requires reliable measurements that assess the phenomenon we want to observe in a consistent manner.

2.7. Choosing the modalities of independent variables

When the different variables have been operationalized, we still have to define the modalities of the independent variable, that is, we have to determine the conditions in which the participants will be included. These conditions must make it possible to clearly evaluate the influence of the independent variable on the dependent variable. For this, every experiment should at least compare two conditions: one in which the independent variable is present and one in which it is absent. For example, we could decide to compare the linguistic competences of children with language impairments with children without language impairments, or high and low level learners.

There are different general ways to build conditions. In a between-subject design, each participant only takes part in one experimental condition (or modality). In a within-subject design, each participant takes part in all the experimental conditions. Between-subject designs must be used for evaluating the influence of an independent variable that cannot be manipulated, for example, being monolingual or bilingual. When the independent variable can be manipulated, it is possible for the same person to take part in all the conditions, or in only one condition. For instance, going back to the examples of studies on linguistic relativity, it is possible to ask a participant to take part only in the word-first name congruent condition, or in both conditions (congruent and incongruent). In the first case, we could observe whether the participants in the congruent condition recall more pairs than the participants in the incongruent condition. In the second case, we could compare the numbers of pairs recalled between the two conditions for all participants. We will return to these different designs, as well as their advantages and disadvantages, in Chapter 6, which will describe the practical aspects of an experiment.

As to the choice of the independent variable modalities, it is often less simple than it seems at first glance to construct a condition in which the independent variable is present and one in which it is absent. To illustrate this difficulty, we will begin by an example outside the field of linguistics, namely the question of drug effectiveness. To test its effectiveness, the drug should be given to one group of patients, not given to another group of patients, and the condition of the two groups should be compared after some time. If the experimental group, who took the drug, reports a decrease in symptoms greater than the reference group, who took nothing, then we can conclude that the drug is effective. However, this conclusion would be wrong in the presence of the so-called placebo effect. This effect reflects the notion that the mere act of taking a pill leads certain people to believe that their condition will improve, something which can actually have an impact on their general condition. For this reason, the results of the experimental group should be compared with those of a control group, made up of people who do not take the medicine, but a pill looking exactly like it except that it lacks the active substance. By comparing the results of the different groups, we can thus show the real effect of the substance (the experimental group vs. the control group) and the placebo effect (the control group vs. the reference group).

In experimental linguistics, an effect similar to the placebo effect could be problematic in an experiment aimed at determining the effectiveness of a language teaching method, for example. Let us imagine the hypothesis that the positive feedback from a teacher improves learners’ skills. Comparing only one condition including a positive comment with a condition where there is no comment would not lead to reliable conclusions. Indeed, if the results of the group receiving positive comments are better than those of the other group, this could be due to the simple presence of a comment. For this experiment to be valid, a group receiving another type of comment (e.g. a neutral or a negative one) should then be added. The results of this group should then be compared with those of the experimental group. This would make it possible to differentiate the effect of the presence of a comment (neutral comment vs. no comment) from that of a positive comment (positive comment vs. neutral comment).

Certain research questions also require the presence of more than two modalities for testing the independent variable. For example, in order to investigate the effect of age on certain linguistic competences, it might seem appropriate to examine more than two age categories, in order to offer a complete vision of the phenomenon. This is illustrated by an experiment in which Zufferey and Gygax (2020) studied the knowledge of connectives such as aussi (which roughly corresponds to therefore) and en outré (which roughly corresponds to in addition) in French-speaking adults. As connectives may differ on many aspects (their preferential use in the spoken or written discourse, their frequency, the type of coherence relation they encode), a choice was made of four connectives mainly used in spoken speech and four connectives mainly used in the written modality, also differing on other aspects. The connectives were inserted into sentences correctly or incorrectly. Participants had to judge whether each sentence was correct or incorrect. Using several connectives made it possible to highlight certain variables which influence the mastery of connectives, something which would not have been possible in an experiment using only one type of connective.

In the above-mentioned examples, we can see that there is no simple answer to the question about the number or type of modalities to be chosen for an independent variable. The choice strongly depends on the research question and the conclusions that the researcher wants to draw. The following chapters, devoted to studies in language production (Chapter 3) and language comprehension (Chapters 4 and 5), will describe research related to different fields in experimental linguistics. These chapters will offer additional illustrations of choices related to the operationalization of different research questions, as well as the characterization of the experimental conditions for the chosen variables.

2.8. Identifying and controlling external and confounding variables

From the examples discussed so far, we can conclude that, when choosing the modalities of the independent variable, it is not only necessary to build a condition in which the variable is present and another in which it is absent, but that these conditions should be comparable in all other respects. If the two conditions differ, it is not possible to draw conclusions on the effect of the independent variable. When reflecting on the operationalization of variables, it is therefore essential to think about external variables to be taken into account during the construction of the experimental design. External variables are the variables which can influence the results but are not directly investigated in an experiment. Let us take an example studying reading time so as to investigate the effect of a variable. In this case, the reading time should vary according to the modality of the independent variable. But the reading time will certainly also be influenced by other variables, such as the participants’ reading habits, their reading speed, their personal reaction to the variable under study, or even the characteristics of the items (word length and word frequency, or sentence complexity, for instance), and those related to the act of conducting the experiment (time of day, place, etc.).

All of these external variables add “noise” to the dependent variable. This means that the measurement does not only depend on the influence of the independent variable, but also on that of all the external variables. One of the ways to minimize the impact of these external variables on the dependent variable is to test a large number of people using a large number of items. By doing this, the measurement portion associated with noise will decrease, since the external variables generally have a random influence on the measurement. This influence could be high for one trial or one participant, and weak for another trial or another participant. By testing many participants and many items, the noise portion in the measurement should therefore tend towards zero. On the other hand, the measurement portion associated with the independent variable should be maximized, since the effect of this variable should be the same for every trial and every participant.

Another way to minimize the effect of external variables is to implement, whenever possible, a within-subject design. As we said above, in this type of design, every participant takes part in all the conditions. Since the external variables associated with each participant (reading speed, reaction to the independent variable, for example) remain constant from condition to condition, in principle, the comparison of the measurement between conditions should provide a precise indication of the effect of the independent variable. Similarly, whenever possible, a within-item design should also be implemented. In such a design, every item is presented under the different conditions, in order to minimize the impact of external variables related to the items themselves. We will return in detail to the means for building such designs in Chapter 6.

When it is not possible to test every item or every participant under the different conditions, it may be useful to control the external variables in other ways. A first possibility would be to randomly choose and include the participants in the experiment’s different conditions. By doing this, we acknowledge the principle that happenstance does things properly and that there is a good chance that the different modalities of the external variables will be evenly distributed in the conditions of the experiment. However, this solution is not suitable for experiments using a limited number of participants.

A second possibility would be to keep the external variable at a constant level, by choosing a modality of the external variable and only testing people or items corresponding to this modality. For example, we could decide to test only people of the same age, of a similar educational level or to take into account only low frequency words or sentences with the same complexity. However, in this case, the external validity of the study might be threatened, since the results cannot be generalized to other groups of people or to other types of items.

Another possibility would be to gather groups of participants in which all the modalities of the external variables are represented. For example, we could include as many women as men, or as many low and high frequency words for every modality of the independent variable examined. This possibility might solve the problems of external validity raised above but could complicate data collection, depending on the number of external variables to be controlled. It is also practically impossible to control external variables such as reading speed, the level of involvement of the participants in the study or their reaction to the independent variable, because the participants belonging to the different modalities of these variables can only be known while or after conducting the experiment.

To conclude, we can note that these different possibilities can be combined in the same experiment, in order to control the effect of several external variables. In general, a perfect command of external variables is unattainable, since these variables are multiple and varied. Therefore, researchers generally only focus on the most relevant external variables for a specific experiment. However, there is a type of external variables that cannot be neglected, namely the confounding variables.

Confounding variables are variables whose levels vary systematically along with the levels of the independent variable. Due to their systematic variation along with the independent variable, confounding variables offer an alternative explanation to the results found in the project and can thereby threaten the internal validity of the experiment.

Let us take a few examples to illustrate the problem of confounding variables. Let us imagine that an experiment has shown that people take it longer to read infrequent words than frequent ones. Let us admit that in this experiment, the participants had to read either frequent words or infrequent words. The words appeared one by one on a computer screen and the participants had to press a key after having read each word, which recorded the reading time. Frequent words were: body, hotel, husband, water, paper, table. Less frequent words were: abdomen, clarinet, eloquence, manuscript, obelisk, rosemary. By examining the words used in the experiment, we quickly perceive that these words differ not only in terms of their frequency of appearance in the language, but also as regards their length, and that this occurs systematically. In other words, infrequent items are longer (three syllables) than frequent items (two syllables). The result could therefore just as easily stem from the fact that longer words take longer to read. The existence of this confounding variable makes it impossible to draw a conclusion as to the relationship between frequency and reading time. In order to overcome this problem, items of similar length should have been chosen when considering high and low frequency conditions.

Imagine another experiment in which we are interested in the role played by practicing a language in tandem, so as to speed up the learning process. Let us admit that, in this study, we decide to recruit learners who take part in exchanges with other learners one evening per week, and a group of learners who do not take part in any activity of this type. These two groups of learners are then compared using language proficiency indicators. Here, a confounding variable could be the motivation to learn a language. It is indeed very likely that people who invest their own time in an additional activity for studying a foreign language are more motivated to learn it. This characteristic will probably have effects on their foreign language proficiency. One way of avoiding the confounding variable would have been to manipulate the independent variable by only choosing participants of the same level who do not practice the language outside the classroom, and then to ask half of them to make tandems. In this way, we could not say that some people are intrinsically more motivated to study than others.

In the examples above, we can see that a confounding variable is more likely to appear in the experiments having between-subject or between-item designs. In these designs, different participants or items are included in the different conditions. As a consequence, there is a higher risk that an additional variable plays a role in the results than in within-subject or within-item designs, where the participants and the items are included in all the conditions. Likewise, a confounding variable is more likely to appear in a quasi-experiment in which the independent variable cannot be manipulated by researchers and is inherent to the participants or the items. When we build conditions on the basis of a variable that cannot be manipulated, the groups have a significant probability of systematically differing on other aspects than the one examined. Therefore it is essential to think about the different designs possible for an experiment. When possible, variables should be manipulated instead of simply observed, and repeated measurements should be used.

2.9. Conclusion

In this chapter, we first saw that a good experiment must be valid and reliable. The validity of an experiment is based on two main aspects: internal validity and external validity. From the point of view of internal validity, an experiment should lead to a clear conclusion concerning the influence of an independent variable on a dependent variable. In other words, the changes observed on the dependent variable should only stem from the manipulation of the independent variable. From the point of view of external validity, the conclusions observed at the sample level should be generalizable beyond the specific conditions of the experiment. We have also seen that an experiment must be reliable, that is, it should lead to similar results if it is conducted several times.

Next, we defined the different variables involved in an experiment and described four types of scales used for measuring them. We saw that the different types of scales do not have the same properties and therefore do not support the same analyses. Finally, we presented the steps involved in the operationalization of a research hypothesis. Firstly, we have to choose a valid and reliable measure for quantifying the variables. Secondly, the modalities of the independent variable must be defined in order to make it easier to reach a clear conclusion as to the effect of this variable. Finally, we discussed the control of those external variables which possibly influence the results of an experiment, as well as the importance of identifying any confounding variables that could jeopardize the conclusions drawn from a study.

2.10. Revision questions and answer key

2.10.1. Questions

  1. 1) Identify the independent and dependent variables for the following hypotheses:
    1. a) bilingual children have better math skills and a better ability to learn a new language than monolingual children;
    2. b) mastery in the use of connectives depends on their frequency in language and the reading habits of speakers.
  2. 2) What type of scale (nominal, ordinal, interval, ratio) correspond to the different variables below?
    1. a) The time required for fixating words, measured using a device that records eye movements.
    2. b) Each participant’s mother tongue.
    3. c) Each participant’s year of birth, from 1990 to 2000.
    4. d) Agreement with a statement, measured on a scale with the following options: strongly disagree, somewhat agree, strongly agree.
  3. 3) List different ways of operationalizing the following question, specifying the measurements used for the different variables and the conditions chosen for the independent variable: does a person’s empathy level (ability to understand the emotions of others) have an influence on understanding emotions when reading?
  4. 4) What type of validity is threatened in the following studies? For what reasons?
    1. a) An experiment carried out on bilingual university students has shown that the comprehension of anaphora depends on verbal working memory capacities.
    2. b) A study has shown that bilingual people change their personality depending on the language used. In their mother tongue, people were described by their friends as being more extroverted and communicative than in their second language.
  5. 5) Which external variables should be controlled to investigate the following hypothesis: sentences conveying an emotional content are read more slowly than neutral sentences?
  6. 6) Identify the confounding variables that may be involved in the following study: an experiment has investigated the influence of private lessons on the reading skills of deaf children and children without any hearing loss. To do this, the two groups of children had to read a half-page text and then answer questions about it. They then benefited from private lessons for two months, after which they had to read a one-page text and answer questions. The results showed that private lessons did not reveal significant benefits for children. Children without any hearing loss provided the same number of correct answers, whereas deaf children gave fewer correct answers on the second test than on the first one.

2.10.2. Answer key

  1. 1) a) The independent variable is the number of languages spoken by children (one vs. two). This hypothesis has two dependent variables, math skills and the ability to learn a new language.
    1. b) In this case, there are two independent variables. The first corresponds to connective frequency in the language and the second to reading habits. The dependent variable is the mastery of connectives.
  2. 2) a) Word fixation time is a quantitative variable measured on a ratio scale. The data acquired on this scale can range from zero to several thousand milliseconds. It is also possible to rank the fixation times and to apply arithmetic operations to them.
    1. b) Each participant’s mother tongue corresponds to a qualitative variable, measured on a nominal scale (e.g. French, Italian, German).
    2. c) Each participant’s birth year is a quantitative variable measured on an interval scale. It is possible to rank the participants and to find out their age difference. However, it would be inappropriate to apply other operations, such as multiplication or division, to birth years.
    3. d) Agreement on a scale with four options is a qualitative variable, measured on an ordinal scale. The answers can be ordered, but the gap (difference in size) between the options cannot be guessed.
  3. 3) In order to operationalize the question, it is necessary to identify the variables and then choose an objective measurement for these variables. Here, the independent variable corresponds to the empathy level of the participants, whereas the dependent variable corresponds to the understanding of emotions while reading. In order to measure the empathy level, one can turn to a standardized test for measuring this ability, for example, the Interpersonal Reactivity Index (Davis 1980) or the Empathetic Quotient (Baron-Cohen and Wheelwright 2004). On the basis of the score obtained in one of these questionnaires, it would be possible to classify the participants into two groups. There are different possibilities for quantifying the understanding of emotions while reading. A first solution could be to make a presentation of short excerpts describing emotions and then to ask the participants to name the emotion of the characters, and then count the number of correct answers. Another possibility would be to turn to online measurements (see Chapter 5) to evaluate the derivation of emotional inferences during reading. To do this, we could present the participants with short excerpts, again in which the character feels an emotion, and then to measure the reading time of target sentences displaying the emotion. Of course, there are other ways of studying this question, using the different methods presented in Chapters 4 and 5.
  4. 4) a) The validity of an experiment depends on the possibility of drawing reliable conclusions concerning the relationship between variables under study (internal validity), as well as on the possibility of generalizing the results beyond the method, the participants and the items examined in the experiment (external validity). In the first case, the external validity of the study would be limited, as only female students were tested. It would not be possible to say that, in general, the understanding of anaphora depends on verbal working memory capacities; this conclusion should be limited to the population the sample of participants comes from.
    1. b) In this case, the study aimed at evaluating changes in the personality of people speaking a foreign language. Personality aspects were assessed by friends of the participants, who probably had to agree with statements such as “This person is more extroverted when they speak their mother tongue than when they speak another language.” The internal validity of this study is compromised, since the validity of the measurement employed can be called into question. It would have been appropriate to choose a more objective personality measurement indicator or to use other personality assessment tools for confirming the results.
  5. 5) The dependent variable of this hypothesis corresponds to the time spent reading sentences (reading time), whereas the independent variable corresponds to whether such sentences convey emotional content or not. In order to isolate the effect of the independent variable on the dependent variable, it is essential to create conditions in which the sentences differ only in terms of emotional content and not on other variables that may influence the reading time. At the item level, the variables to control are typically sentence length, their syntactic complexity and the frequency of the words used. At the participant level, their general reading speed or their comprehension skills can also influence the dependent variable. It would be appropriate to set up a design with repeated measurements, in which each person would take part in all the conditions in order to keep the external variables related to the participants at a constant level. It could also be interesting to measure the general competences associated with the participants’ understanding of emotions (e.g. empathy levels) in order to see whether and how they influence the reading times in the different conditions.
  6. 6) Different points make the conclusions of this study questionable. First, children’s comprehension was operationalized as the number of correct answers given to questions that had to be answered in writing. By asking the children to respond in writing, an additional variable comes into play in the experiment, namely the children’s writing skills. It was therefore not only the comprehension skills that were measured, but also the children’s skills for writing down their understanding. Second, the tests performed two months apart were different. While the first text was half a page long, the second text was one page long, twice as long than the first. This introduced an additional variable into the experiment, namely the memory capacities of children. These are likely to play a more significant role during a test on a one-page excerpt than on a half-page excerpt.

2.11. Further reading

For more detailed explanations on the different types of measurement and the choices to be made during operationalization, we recommend Chapter 2 of Field and Hole (2003). For more details on the different types of validity and reliability, for experiments or measurements, we refer readers to Chapters 5 and 6 of Price et al. (2013). This book illustrates, with simple examples, the different validities involved in research, as well as their influence on each other. For more information on questionnaires, it is possible to turn to Rasinger (2010) and Wagner (2015), among others.

  1. 1 http://cos.io.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset