-
Modeling the Intraindividual Relation of Ability and Speed within a Test Journal of Educational Measurement (IF 1.188) Pub Date : 2024-04-20 Augustin Mutak, Robert Krause, Esther Ulitzsch, Sören Much, Jochen Ranger, Steffi Pohl
Understanding the intraindividual relation between an individual's speed and ability in testing scenarios is essential to assure a fair assessment. Different approaches exist for estimating this relationship, that either rely on specific study designs or on specific assumptions. This paper aims to add to the toolbox of approaches for estimating this relationship. We propose the intraindividual speed‐ability‐relation
-
Differential and Functional Response Time Item Analysis: An Application to Understanding Paper versus Digital Reading Processes Journal of Educational Measurement (IF 1.188) Pub Date : 2024-04-09 Sun‐Joo Cho, Amanda Goodwin, Matthew Naveiras, Jorge Salas
Despite the growing interest in incorporating response time data into item response models, there has been a lack of research investigating how the effect of speed on the probability of a correct response varies across different groups (e.g., experimental conditions) for various items (i.e., differential response time item analysis). Furthermore, previous research has shown a complex relationship between
-
Modeling Hierarchical Attribute Structures in Diagnostic Classification Models with Multiple Attempts Journal of Educational Measurement (IF 1.188) Pub Date : 2024-03-30 Tae Yeon Kwon, A. Corinne Huggins-Manley, Jonathan Templin, Mingying Zheng
In classroom assessments, examinees can often answer test items multiple times, resulting in sequential multiple-attempt data. Sequential diagnostic classification models (DCMs) have been developed for such data. As student learning processes may be aligned with a hierarchy of measured traits, this study aimed to develop a sequential hierarchical DCM (sequential HDCM), which combines a sequential DCM
-
A Bayesian Moderated Nonlinear Factor Analysis Approach for DIF Detection under Violation of the Equal Variance Assumption Journal of Educational Measurement (IF 1.188) Pub Date : 2024-03-16 Sooyong Lee, Suhwa Han, Seung W. Choi
Research has shown that multiple‐indicator multiple‐cause (MIMIC) models can result in inflated Type I error rates in detecting differential item functioning (DIF) when the assumption of equal latent variance is violated. This study explains how the violation of the equal variance assumption adversely impacts the detection of nonuniform DIF and how it can be addressed through moderated nonlinear factor
-
Optimal Calibration of Items for Multidimensional Achievement Tests Journal of Educational Measurement (IF 1.188) Pub Date : 2024-03-14 Mahmood Ul Hassan, Frank Miller
Multidimensional achievement tests are recently gaining more importance in educational and psychological measurements. For example, multidimensional diagnostic tests can help students to determine which particular domain of knowledge they need to improve for better performance. To estimate the characteristics of candidate items (calibration) for future multidimensional achievement tests, we use optimal
-
Issue Information Journal of Educational Measurement (IF 1.188) Pub Date : 2024-03-02
Editor CHUN WANG, University of Washington
-
Argument-Based Approach to Validity: Developing a Living Document and Incorporating Preregistration Journal of Educational Measurement (IF 1.188) Pub Date : 2024-02-14 Daria Gerasimova
I propose two practical advances to the argument-based approach to validity: developing a living document and incorporating preregistration. First, I present a potential structure for the living document that includes an up-to-date summary of the validity argument. As the validation process may span across multiple studies, the living document allows future users of the instrument to access the entire
-
DIF Detection for Multiple Groups: Comparing Three-Level GLMMs and Multiple-Group IRT Models Journal of Educational Measurement (IF 1.188) Pub Date : 2024-02-14 Carmen Köhler, Johannes Hartig, Lale Khorramdel, Artur Pokropek
For assessment scales applied to different groups (e.g., students from different states; patients in different countries), multigroup differential item functioning (MG-DIF) needs to be evaluated in order to ensure that respondents with the same trait level but from different groups have equal response probabilities on a particular item. The current study compares two approaches for DIF detection: a
-
A Dual-Purpose Model for Binary Data: Estimating Ability and Misconceptions Journal of Educational Measurement (IF 1.188) Pub Date : 2024-01-04 Wenchao Ma, Miguel A. Sorrel, Xiaoming Zhai, Yuan Ge
Most existing diagnostic models are developed to detect whether students have mastered a set of skills of interest, but few have focused on identifying what scientific misconceptions students possess. This article developed a general dual-purpose model for simultaneously estimating students' overall ability and the presence and absence of misconceptions. The expectation-maximization algorithm was developed
-
Issue Information Journal of Educational Measurement (IF 1.188) Pub Date : 2023-12-05
Editor CHUN WANG, University of Washington
-
A Highly Adaptive Testing Design for PISA Journal of Educational Measurement (IF 1.188) Pub Date : 2023-12-03 Andreas Frey, Christoph König, Aron Fink
The highly adaptive testing (HAT) design is introduced as an alternative test design for the Programme for International Student Assessment (PISA). The principle of HAT is to be as adaptive as possible when selecting items while accounting for PISA's nonstatistical constraints and addressing issues concerning PISA such as item position effects. HAT combines established methods from the field of computerized
-
Computation and Accuracy Evaluation of Comparable Scores on Culturally Responsive Assessments Journal of Educational Measurement (IF 1.188) Pub Date : 2023-11-16 Sandip Sinharay, Matthew S. Johnson
Culturally responsive assessments have been proposed as potential tools to ensure equity and fairness for examinees from all backgrounds including those from traditionally underserved or minoritized groups. However, these assessments are relatively new and, with few exceptions, are yet to be implemented in large scale. Consequently, there is a lack of guidance on how one can compute comparable scores
-
Incorporating Test-Taking Engagement into Multistage Adaptive Testing Design for Large-Scale Assessments Journal of Educational Measurement (IF 1.188) Pub Date : 2023-11-10 Okan Bulut, Guher Gorgun, Hacer Karamese
The use of multistage adaptive testing (MST) has gradually increased in large-scale testing programs as MST achieves a balanced compromise between linear test design and item-level adaptive testing. MST works on the premise that each examinee gives their best effort when attempting the items, and their responses truly reflect what they know or can do. However, research shows that large-scale assessments
-
Information Functions of Rank-2PL Models for Forced-Choice Questionnaires Journal of Educational Measurement (IF 1.188) Pub Date : 2023-10-29 Jianbin Fu, Xuan Tan, Patrick C. Kyllonen
This paper presents the item and test information functions of the Rank two-parameter logistic models (Rank-2PLM) for items with two (pair) and three (triplet) statements in forced-choice questionnaires. The Rank-2PLM model for pairs is the MUPP-2PLM (Multi-Unidimensional Pairwise Preference) and, for triplets, is the Triplet-2PLM. Fisher's information and directional information are described, and
-
Detecting Multidimensional DIF in Polytomous Items with IRT Methods and Estimation Approaches Journal of Educational Measurement (IF 1.188) Pub Date : 2023-10-15 Güler Yavuz Temel
The purpose of this study was to investigate multidimensional DIF with a simple and nonsimple structure in the context of multidimensional Graded Response Model (MGRM). This study examined and compared the performance of the IRT-LR and Wald test using MML-EM and MHRM estimation approaches with different test factors and test structures in simulation studies and applying real data sets. When the test
-
MSAEM Estimation for Confirmatory Multidimensional Four-Parameter Normal Ogive Models Journal of Educational Measurement (IF 1.188) Pub Date : 2023-10-09 Jia Liu, Xiangbin Meng, Gongjun Xu, Wei Gao, Ningzhong Shi
In this paper, we develop a mixed stochastic approximation expectation-maximization (MSAEM) algorithm coupled with a Gibbs sampler to compute the marginalized maximum a posteriori estimate (MMAPE) of a confirmatory multidimensional four-parameter normal ogive (M4PNO) model. The proposed MSAEM algorithm not only has the computational advantages of the stochastic approximation expectation-maximization
-
Sociocognitive Processes and Item Response Models: A Didactic Example Journal of Educational Measurement (IF 1.188) Pub Date : 2023-09-15 Tao Gong, Lan Shuai, Robert J. Mislevy
The usual interpretation of the person and task variables in between-persons measurement models such as item response theory (IRT) is as attributes of persons and tasks, respectively. They can be viewed instead as ensemble descriptors of patterns of interactions among persons and situations that arise from sociocognitive complex adaptive system (CASs). This view offers insights for interpreting and
-
Issue Information Journal of Educational Measurement (IF 1.188) Pub Date : 2023-09-04
Editor CHUN WANG, University of Washington
-
Using Response Time in Multidimensional Computerized Adaptive Testing Journal of Educational Measurement (IF 1.188) Pub Date : 2023-07-07 Yinhong He, Yuanyuan Qi
In multidimensional computerized adaptive testing (MCAT), item selection strategies are generally constructed based on responses, and they do not consider the response times required by items. This study constructed two new criteria (referred to as DT-inc and DT) for MCAT item selection by utilizing information from response times. The new designs maximize the amount of information per unit time. Furthermore
-
Gender Bias in Test Item Formats: Evidence from PISA 2009, 2012, and 2015 Math and Reading Tests Journal of Educational Measurement (IF 1.188) Pub Date : 2023-06-09 Benjamin R. Shear
Large-scale standardized tests are regularly used to measure student achievement overall and for student subgroups. These uses assume tests provide comparable measures of outcomes across student subgroups, but prior research suggests score comparisons across gender groups may be complicated by the type of test items used. This paper presents evidence that among nationally representative samples of
-
Issue Information Journal of Educational Measurement (IF 1.188) Pub Date : 2023-06-06
Editor CHUN WANG, University of Washington
-
Detecting Differential Item Functioning in CAT Using IRT Residual DIF Approach Journal of Educational Measurement (IF 1.188) Pub Date : 2023-04-28 Hwanggyu Lim, Edison M. Choe
The residual differential item functioning (RDIF) detection framework was developed recently under a linear testing context. To explore the potential application of this framework to computerized adaptive testing (CAT), the present study investigated the utility of the RDIFR statistic both as an index for detecting uniform DIF of pretest items in CAT and as a direct measure of the effect size of uniform
-
Controlling the Speededness of Assembled Test Forms: A Generalization to the Three-Parameter Lognormal Response Time Model Journal of Educational Measurement (IF 1.188) Pub Date : 2023-04-27 Benjamin Becker, Sebastian Weirich, Frank Goldhammer, Dries Debeer
When designing or modifying a test, an important challenge is controlling its speededness. To achieve this, van der Linden (2011a, 2011b) proposed using a lognormal response time model, more specifically the two-parameter lognormal model, and automated test assembly (ATA) via mixed integer linear programming. However, this approach has a severe limitation, in that the two-parameter lognormal model
-
A Note on Latent Traits Estimates under IRT Models with Missingness Journal of Educational Measurement (IF 1.188) Pub Date : 2023-04-26 Jinxin Guo, Xin Xu, Tao Xin
Missingness due to not-reached items and omitted items has received much attention in the recent psychometric literature. Such missingness, if not handled properly, would lead to biased parameter estimation, as well as inaccurate inference of examinees, and further erode the validity of the test. This paper reviews some commonly used IRT based models allowing missingness, followed by three popular
-
Online Monitoring of Test-Taking Behavior Based on Item Responses and Response Times Journal of Educational Measurement (IF 1.188) Pub Date : 2023-04-17 Suhwa Han, Hyeon-Ah Kang
The study presents multivariate sequential monitoring procedures for examining test-taking behaviors online. The procedures monitor examinee's responses and response times and signal aberrancy as soon as significant change is identifieddetected in the test-taking behavior. The study in particular proposes three schemes to track different indicators of a test-taking mode—the observable manifest variables
-
Issue Information Journal of Educational Measurement (IF 1.188) Pub Date : 2023-03-16
Editor CHUN WANG, University of Washington
-
Pretest Item Calibration in Computerized Multistage Adaptive Testing Journal of Educational Measurement (IF 1.188) Pub Date : 2023-03-10 Rabia Karatoprak Ersen, Won-Chan Lee
The purpose of this study was to compare calibration and linking methods for placing pretest item parameter estimates on the item pool scale in a 1-3 computerized multistage adaptive testing design in terms of item parameter recovery. Two models were used: embedded-section, in which pretest items were administered within a separate module, and embedded-items, in which pretest items were distributed
-
Classical Item Analysis from a Signal Detection Perspective Journal of Educational Measurement (IF 1.188) Pub Date : 2023-02-27 Lawrence T. DeCarlo
A conceptualization of multiple-choice exams in terms of signal detection theory (SDT) leads to simple measures of item difficulty and item discrimination that are closely related to, but also distinct from, those used in classical item analysis (CIA). The theory defines a “true split,” depending on whether or not examinees know an item, and so it provides a basis for using total scores to split item
-
Corrigendum: A Residual-Based Differential Item Functioning Detection Framework in Item Response Theory Journal of Educational Measurement (IF 1.188) Pub Date : 2023-02-26 Hwanggyu Lim, Edison M. Choe, Kyung T. Han
In the original article, it was written that “Then the MLE scoring and DIF analysis with RDIF statistics were performed using the est_score and rdif functions, respectively, in the R (R Core Team, 2019) package irtplay (p.90).” However, the irtplay package has been removed from the CRAN repository due to intellectual property (IP) violation issues. Instead, a new R package called irtQ (Lim & Wells
-
Using Simulated Retests to Estimate the Reliability of Diagnostic Assessment Systems Journal of Educational Measurement (IF 1.188) Pub Date : 2023-02-19 W. Jake Thompson, Brooke Nash, Amy K. Clark, Jeffrey C. Hoover
As diagnostic classification models become more widely used in large-scale operational assessments, we must give consideration to the methods for estimating and reporting reliability. Researchers must explore alternatives to traditional reliability methods that are consistent with the design, scoring, and reporting levels of diagnostic assessment systems. In this article, we describe and evaluate a
-
Using Linkage Sets to Improve Connectedness in Rater Response Model Estimation Journal of Educational Measurement (IF 1.188) Pub Date : 2023-02-19 Jodi M. Casabianca, John R. Donoghue, Hyo Jeong Shin, Szu-Fu Chao, Ikkyu Choi
Using item-response theory to model rater effects provides an alternative solution for rater monitoring and diagnosis, compared to using standard performance metrics. In order to fit such models, the ratings data must be sufficiently connected in order to estimate rater effects. Due to popular rating designs used in large-scale testing scenarios, there tends to be a large proportion of missing data
-
An Exploration of an Improved Aggregate Student Growth Measure Using Data from Two States Journal of Educational Measurement (IF 1.188) Pub Date : 2023-01-31 Katherine E. Castellano, Daniel F. McCaffrey, J. R. Lockwood
The simple average of student growth scores is often used in accountability systems, but it can be problematic for decision making. When computed using a small/moderate number of students, it can be sensitive to the sample, resulting in inaccurate representations of growth of the students, low year-to-year stability, and inequities for low-incidence groups. An alternative designed to address these
-
Classification Accuracy and Consistency of Compensatory Composite Test Scores Journal of Educational Measurement (IF 1.188) Pub Date : 2023-01-28 J. Carl Setzer, Ying Cheng, Cheng Liu
Test scores are often used to make decisions about examinees, such as in licensure and certification testing, as well as in many educational contexts. In some cases, these decisions are based upon compensatory scores, such as those from multiple sections or components of an exam. Classification accuracy and classification consistency are two psychometric characteristics of test scores that are often
-
Issue Information Journal of Educational Measurement (IF 1.188) Pub Date : 2023-01-06
Editor SANDIP SINHARAY, Educational Testing Service
-
Specifying the Three Ws in Educational Measurement: Who Uses Which Scores for What Purpose? Journal of Educational Measurement (IF 1.188) Pub Date : 2022-12-25 Andrew Ho
I argue that understanding and improving educational measurement requires specificity about actors, scores, and purpose: Who uses which scores for what purpose? I show how this specificity complements Briggs’ frameworks for educational measurement that he presented in his 2022 address as president of the National Council on Measurement in Education.
-
Online Calibration in Multidimensional Computerized Adaptive Testing with Polytomously Scored Items Journal of Educational Measurement (IF 1.188) Pub Date : 2022-12-15 Lu Yuan, Yingshi Huang, Shuhang Li, Ping Chen
Online calibration is a key technology for item calibration in computerized adaptive testing (CAT) and has been widely used in various forms of CAT, including unidimensional CAT, multidimensional CAT (MCAT), CAT with polytomously scored items, and cognitive diagnostic CAT. However, as multidimensional and polytomous assessment data become more common, only a few published reports focus on online calibration
-
Measuring the Uncertainty of Imputed Scores Journal of Educational Measurement (IF 1.188) Pub Date : 2022-12-14 Sandip Sinharay
Technical difficulties and other unforeseen events occasionally lead to incomplete data on educational tests, which necessitates the reporting of imputed scores to some examinees. While there exist several approaches for reporting imputed scores, there is a lack of any guidance on the reporting of the uncertainty of imputed scores. In this paper, several approaches are suggested for quantifying the
-
An Exponentially Weighted Moving Average Procedure for Detecting Back Random Responding Behavior Journal of Educational Measurement (IF 1.188) Pub Date : 2022-12-09 Yinhong He
Back random responding (BRR) behavior is one of the commonly observed careless response behaviors. Accurately detecting BRR behavior can improve test validities. Yu and Cheng (2019) showed that the change point analysis (CPA) procedure based on weighted residual (CPA-WR) performed well in detecting BRR. Compared with the CPA procedure, the exponentially weighted moving average (EWMA) obtains more detailed
-
Multiple-Group Joint Modeling of Item Responses, Response Times, and Action Counts with the Conway-Maxwell-Poisson Distribution Journal of Educational Measurement (IF 1.188) Pub Date : 2022-12-07 Xin Qiao, Hong Jiao, Qiwei He
Multiple group modeling is one of the methods to address the measurement noninvariance issue. Traditional studies on multiple group modeling have mainly focused on item responses. In computer-based assessments, joint modeling of response times and action counts with item responses helps estimate the latent speed and action levels in addition to latent ability. These two new data sources can also be
-
A Unified Comparison of IRT-Based Effect Sizes for DIF Investigations Journal of Educational Measurement (IF 1.188) Pub Date : 2022-11-07 R. Philip Chalmers
Several marginal effect size (ES) statistics suitable for quantifying the magnitude of differential item functioning (DIF) have been proposed in the area of item response theory; for instance, the Differential Functioning of Items and Tests (DFIT) statistics, signed and unsigned item difference in the sample statistics (SIDS, UIDS, NSIDS, and NUIDS), the standardized indices of impact, and the differential
-
A Statistical Test for the Detection of Item Compromise Combining Responses and Response Times Journal of Educational Measurement (IF 1.188) Pub Date : 2022-10-28 Wim J. van der Linden, Dmitry I. Belov
A test of item compromise is presented which combines the test takers' responses and response times (RTs) into a statistic defined as the number of correct responses on the item for test takers with RTs flagged as suspicious. The test has null and alternative distributions belonging to the well-known family of compound binomial distributions, is simple to calculate, and has results that are easy to
-
Fully Gibbs Sampling Algorithms for Bayesian Variable Selection in Latent Regression Models Journal of Educational Measurement (IF 1.188) Pub Date : 2022-10-25 Kazuhiro Yamaguchi, Jihong Zhang
This study proposed Gibbs sampling algorithms for variable selection in a latent regression model under a unidimensional two-parameter logistic item response theory model. Three types of shrinkage priors were employed to obtain shrinkage estimates: double-exponential (i.e., Laplace), horseshoe, and horseshoe+ priors. These shrinkage priors were compared to a uniform prior case in both simulation and
-
A Factor Mixture Model for Item Responses and Certainty of Response Indices to Identify Student Knowledge Profiles Journal of Educational Measurement (IF 1.188) Pub Date : 2022-10-10 Chia-Wen Chen, Björn Andersson, Jinxin Zhu
The certainty of response index (CRI) measures respondents' confidence level when answering an item. In conjunction with the answers to the items, previous studies have used descriptive statistics and arbitrary thresholds to identify student knowledge profiles with the CRIs. Whereas this approach overlooked the measurement error of the observed item responses and indices, we address this by proposing
-
Issue Information Journal of Educational Measurement (IF 1.188) Pub Date : 2022-10-03
Editor SANDIP SINHARAY, Educational Testing Service
-
Using Item Scores and Distractors in Person-Fit Assessment Journal of Educational Measurement (IF 1.188) Pub Date : 2022-09-16 Kylie Gorney, James A. Wollack
In order to detect a wide range of aberrant behaviors, it can be useful to incorporate information beyond the dichotomous item scores. In this paper, we extend the l z $l_z$ and l z ∗ $l_z^*$ person-fit statistics so that unusual behavior in item scores and unusual behavior in item distractors can be used as indicators of aberrance. Through detailed simulations, we show that the new statistics are
-
A New Bayesian Person-Fit Analysis Method Using Pivotal Discrepancy Measures Journal of Educational Measurement (IF 1.188) Pub Date : 2022-09-02 Adam Combs
A common method of checking person-fit in Bayesian item response theory (IRT) is the posterior-predictive (PP) method. In recent years, more powerful approaches have been proposed that are based on resampling methods using the popular Lz∗$L_{z}^{*}$ statistic. There has also been proposed a new Bayesian model checking method based on pivotal discrepancy measures (PDMs). A PDM T is a discrepancy measure
-
Several Variations of Simple-Structure MIRT Equating Journal of Educational Measurement (IF 1.188) Pub Date : 2022-07-28 Stella Y. Kim, Won-Chan Lee
The current study proposed several variants of simple-structure multidimensional item response theory equating procedures. Four distinct sets of data were used to demonstrate feasibility of proposed equating methods for two different equating designs: a random groups design and a common-item nonequivalent groups design. Findings indicated some notable differences between the multidimensional and unidimensional
-
Validity Arguments Meet Artificial Intelligence in Innovative Educational Assessment Journal of Educational Measurement (IF 1.188) Pub Date : 2022-07-08 David W. Dorsey, Hillary R. Michaels
We have dramatically advanced our ability to create rich, complex, and effective assessments across a range of uses through technology advancement. Artificial Intelligence (AI) enabled assessments represent one such area of advancement—one that has captured our collective interest and imagination. Scientists and practitioners within the domains of organizational and workforce assessment have increasingly
-
A Deterministic Gated Lognormal Response Time Model to Identify Examinees with Item Preknowledge Journal of Educational Measurement (IF 1.188) Pub Date : 2022-07-07 Murat Kasli, Cengiz Zopluoglu, Sarah L. Toton
Response times (RTs) have recently attracted a significant amount of attention in the literature as they may provide meaningful information about item preknowledge. In this study, a new model, the Deterministic Gated Lognormal Response Time (DG-LNRT) model, is proposed to identify examinees with item preknowledge using RTs. The proposed model was applied to two different data sets and performance was
-
Cognitive Diagnostic Multistage Testing by Partitioning Hierarchically Structured Attributes Journal of Educational Measurement (IF 1.188) Pub Date : 2022-07-05 Rae Yeong Kim, Yun Joo Yoo
In cognitive diagnostic models (CDMs), a set of fine-grained attributes is required to characterize complex problem solving and provide detailed diagnostic information about an examinee. However, it is challenging to ensure reliable estimation and control computational complexity when The test aims to identify the examinee's attribute profile in a large-scale map of attributes. To address this problem
-
Issue Information Journal of Educational Measurement (IF 1.188) Pub Date : 2022-06-22
Editor SANDIP SINHARAY, Educational Testing Service
-
Estimating Classification Accuracy and Consistency Indices for Multiple Measures with the Simple Structure MIRT Model Journal of Educational Measurement (IF 1.188) Pub Date : 2022-06-20 Seohee Park, Kyung Yong Kim, Won-Chan Lee
Multiple measures, such as multiple content domains or multiple types of performance, are used in various testing programs to classify examinees for screening or selection. Despite the popular usages of multiple measures, there is little research on classification consistency and accuracy of multiple measures. Accordingly, this study introduces an approach to estimate classification consistency and
-
Optimizing Implementation of Artificial-Intelligence-Based Automated Scoring: An Evidence Centered Design Approach for Designing Assessments for AI-based Scoring Journal of Educational Measurement (IF 1.188) Pub Date : 2022-06-12 Kadriye Ercikan, Daniel F. McCaffrey
Artificial-intelligence-based automated scoring is often an afterthought and is considered after assessments have been developed, resulting in nonoptimal possibility of implementing automated scoring solutions. In this article, we provide a review of Artificial intelligence (AI)-based methodologies for scoring in educational assessments. We then propose an evidence-centered design framework for developing
-
Latent Space Model for Process Data Journal of Educational Measurement (IF 1.188) Pub Date : 2022-06-12 Yi Chen, Jingru Zhang, Yi Yang, Young-Sun Lee
The development of human-computer interactive items in educational assessments provides opportunities to extract useful process information for problem-solving. However, the complex, intensive, and noisy nature of process data makes it challenging to model with the traditional psychometric methods. Social network methods have been applied to visualize and analyze process data. Nonetheless, research
-
Validity Arguments Meet Artificial Intelligence in Innovative Educational Assessment: A Discussion and Look Forward Journal of Educational Measurement (IF 1.188) Pub Date : 2022-06-09 David W. Dorsey, Hillary R. Michaels
In this concluding article of the special issue, we provide an overall discussion and point to future emerging trends in AI that might shape our approach to validity and building validity arguments.
-
Validity Arguments for AI-Based Automated Scores: Essay Scoring as an Illustration Journal of Educational Measurement (IF 1.188) Pub Date : 2022-06-08 Steve Ferrara, Saed Qunbar
In this article, we argue that automated scoring engines should be transparent and construct relevant—that is, as much as is currently feasible. Many current automated scoring engines cannot achieve high degrees of scoring accuracy without allowing in some features that may not be easily explained and understood and may not be obviously and directly relevant to the target assessment construct. We address
-
Toward Argument-Based Fairness with an Application to AI-Enhanced Educational Assessments Journal of Educational Measurement (IF 1.188) Pub Date : 2022-06-01 A. Corinne Huggins-Manley, Brandon M. Booth, Sidney K. D'Mello
The field of educational measurement places validity and fairness as central concepts of assessment quality. Prior research has proposed embedding fairness arguments within argument-based validity processes, particularly when fairness is conceived as comparability in assessment properties across groups. However, we argue that a more flexible approach to fairness arguments that occurs outside of and
-
Psychometric Methods to Evaluate Measurement and Algorithmic Bias in Automated Scoring Journal of Educational Measurement (IF 1.188) Pub Date : 2022-06-01 Matthew S. Johnson, Xiang Liu, Daniel F. McCaffrey
With the increasing use of automated scores in operational testing settings comes the need to understand the ways in which they can yield biased and unfair results. In this paper, we provide a brief survey of some of the ways in which the predictive methods used in automated scoring can lead to biased, and thus unfair automated scores. After providing definitions of fairness from machine learning and
-
Linking and Comparability across Conditions of Measurement: Established Frameworks and Proposed Updates Journal of Educational Measurement (IF 1.188) Pub Date : 2022-05-30 Tim Moses
One result of recent changes in testing is that previously established linking frameworks may not adequately address challenges in current linking situations. Test linking through equating, concordance, vertical scaling or battery scaling may not represent linkings for the scores of tests developed to measure constructs differently for different examinees, or tests that are administered in different
-
Anchoring Validity Evidence for Automated Essay Scoring Journal of Educational Measurement (IF 1.188) Pub Date : 2022-05-15 Mark D. Shermis
One of the challenges of discussing validity arguments for machine scoring of essays centers on the absence of a commonly held definition and theory of good writing. At best, the algorithms attempt to measure select attributes of writing and calibrate them against human ratings with the goal of accurate prediction of scores for new essays. Sometimes these attributes are based on the fundamentals of