Getting the First and Second Decimals Right: Psychometrics of Stealth Assessment

Getting the First and Second Decimals Right: Psychometrics of Stealth Assessment

Copyright: © 2023 |Pages: 29
DOI: 10.4018/979-8-3693-0568-3.ch006
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Stealth assessment, like all assessments, must have three essential psychometric properties: validity, reliability, and fairness. Evidence-centered assessment design (ECD) provides a psychometrically sound framework for designing assessments based on a validity argument. This chapter describes how using ECD in the design of a stealth assessment helps designers meet the psychometric goals. It also discusses how to evaluate a stealth assessment's validity, reliability, and fairness after it is designed and implemented.
Chapter Preview
Top

Introduction

The field of psychometrics in education is mainly concerned with quantitative and qualitative methods, techniques, and guidelines leading to designing and developing high-quality measurements and assessments. According to Messick (1994), “…validity, reliability, comparability, and fairness need to be uniformly addressed for all assessments because they are not just measurement principles, they are social values that have meaning and force outside of measurement wherever evaluative judgments and decisions are made” (p. 13). In this chapter, we focus on reliability, validity, and fairness in stealth assessment (Shute, 2011). First, we review what these three psychometric parameters mean, and later in this chapter we discuss common methods for evaluating these parameters.

Reliability refers to the consistency of an assessment. For example, a highly reliable bathroom scale shows a person’s weight as about 140 lb. in the morning, afternoon, and evening. Conversely, a scale with low reliability shows that same person’s weight as 140 lbs. in the morning, 80 lbs. in the afternoon, and 190 lbs. in the evening. Reliability is an inherent property of a measurement. The question is not whether a measure is reliable, but whether it is sufficiently reliable for a given purpose (American Educational Research Association et al., 2014). The extent to which an assessment is reliable (consistent) can be evaluated using various techniques (e.g., correlations between two parallel test forms). We will discuss some of those techniques later in this chapter.

Validity refers to the extent to which an assessment is assessing what it claims to assess (Messick, 1994; Shute, 2009). Similarly, the Standards for Educational Psychological Testing (American Educational Research Association et al., 2014) indicates that, “validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests.” (p. 11). One might say, “I am assessing creativity,” but are they? If they are, how accurately are they assessing creativity (or any other competency), and how accurate are they interpreting the results of their assessment (Shute, 2009)? An alternative word for validity could be accuracy. Kane (2006) approaches validation as constructing a formal argument, accumulating evidence for why the scores support the proposed interpretation. There are several types of validity argument (e.g., content, construct, and criterion validity) and a complete validity argument will use multiple types. Note that reliability is a prerequisite for validity; an assessment cannot consistently measure the target construct (high validity) if it is not consistent (low reliability).

Fairness refers to the extent to which an assessment is equitable and unbiased for various subgroups (DiCerbo et al., 2016; Dorans & Cook, 2016; Mislevy et al., 2013). To say what is fair, one can start by saying what is not fair (Dorans & Cook, 2016). For example, from the assessment-design perspective, an assessment is not fair if it includes items that include culturally sensitive concepts seen as appropriate for some people and inappropriate for others. From the assessment-administration perspective, if an assessment requires certain equipment that some people have and others don’t (e.g., the need to have a computer and Internet access for an assessment in a remote village in Africa), that assessment is not fair.

Complete Chapter List

Search this Book:
Reset