42 5. EVALUATING INTERACTIVE IR USER STUDIES OF DIFFERENT TYPES
According to the rst premise, the values of the major facets should be determined according
to the specic research focus and problem(s). e second premise suggests that dierent facets or
dimensions cannot be evaluated in a separate, individual manner. Instead, we should focus on how
dierent facet values interact and “collaborate” in facilitate the investigation of research problems.
e existing works on IR study evaluation mainly focus on data analysis and the statistical result re-
porting practice (e.g., Sakai, 2016) or report a small set of user study components separately without
fully revealing the possible connections among them (e.g., Kelly and Sugimoto, 2013). To address
this issue and to emphasize the role of the connections among dierent facets of user studies, we
employ our faceted framework in evaluating dierent combinations of facet values that represent
dierent decisions and even compromises made in varying problem spaces.
In the following sections, we explain the idea of faceted evaluation in the three types of user
studies respectively (i.e., understanding user behavior and experience, IIR system/interface features
evaluation, meta-evaluation of evaluation metrics). To fully illustrate the connections among facets
and evaluate divergent study design decisions, we apply the faceted framework in evaluating a set
of representative user studies reported in recently published research papers. Given the comprehen-
siveness of our faceted framework, it is safe to say that our approach can be applied in evaluating a
wide range of IIR user studies.
5.1 UNDERSTANDING USER BEHAVIOR AND EXPERIENCE
In faceted evaluation, we rst selected and focused on a series of major facets based on the research
focus and questions, aiming to accurately identify and represent the major decisions and compro-
mises made in study design. Specically, for each examined user study under the corresponding
category (here it is understanding user behavior and experience), we reviewed the values of the sub-
facets (i.e., independent variable, quasi-independent variable, dependent variable) under the facet,
variable, and assessed the implicit connections between them and the values of other relevant facets
and subfacets in our framework. Our core argument behind this approach is that the decision on
and evaluation of one facet value should take into consideration other relevant facet values and their
impacts on the procedure and results of the study.
For instance, in Edwards and Kelly (2017), researchers used both task and SERP quality to
manipulate participants’ emotional states in web search in a within-subjects design study. In this
case, our major facets and subfacets of interest in evaluation include: search task and topic, system
interface elements varied (SERP quality), search behavior and experience. Given the goal of this
research (inferring emotion from behavior in search tasks), researchers assigned evaluate tasks in
the study as these tasks are engaging and complex enough to elicit multi-round, rich interactions
between users and systems in laboratory settings. e task topics were varied so that researchers can
better control the eects of specic topics and observe relatively “clean” task eects. Liu et al. (2019)