Module 45: Mokken-scale Analysis

Contains 1 Component(s)

In this print module, Dr. Stefanie Wind provides an introduction to Mokken scale analysis (MSA) as a probabilistic nonparametric item response theory (IRT) framework in which to explore measurement quality with an emphasis on its application in the context of educational assessment. Keywords: item response theory, IRT, Mokken scaling, nonparametric item response theory, model fit, monotone homogeneity model, double monotonicity model, scaling coefficients

Mokken scale analysis (MSA) is a probabilistic-nonparametric approach to item response theory (IRT) that can be used to evaluate fundamental measurement properties with less strict assumptions than parametric IRT models. This instructional module provides an introduction to MSA as a probabilistic-nonparametric framework in which to explore measurement quality, with an emphasis on its application in the context of educational assessment. The module describes both dichotomous and polytomous formulations of the MSA model. Examples of the application of MSA to educational assessment are provided using data from a multiple-choice physical science assessment and a rater-mediated writing assessment.

Keywords: item response theory, IRT, Mokken scaling, nonparametric item response theory, model fit, monotone homogeneity model, double monotonicity model, scaling coefficients

Stefanie A. Wind

Assistant Professor, Department of Educational Research, University of Alabama, Tuscaloosa, AL

Dr. Wind conducts methodological and applied research on educational assessments with an emphasis on issues related to raters, rating scales, Rasch models, nonparametric IRT, and parametric IRT.

Contact Stefanie via stefanie.wind@ua.edu

Register
- Learner - Free!
Continue
More Information

Module 44: Quality-control for Continuous Mode Tests

Contains 1 Component(s)

In this print module, Dr. Avi Allalouf, Dr. Tony Gutentag, and Dr. Michal Baumer discuss errors that might occur at the different stages of the continuous mode tests (CMT) process as well as the recommended quality-control (QC) procedure to reduce the incidence of each error. Keywords: automated review, computer-based testing, CBT, continuous mode tests, CMT, human review, quality control, QC, scoring, test administration, test analysis, test scoring

Quality control (QC) in testing is paramount. QC procedures for tests can be divided into two types. The first type, one that has been well researched, is QC for tests administered to large population groups on few administration dates using a small set of test forms (e.g., large-scale assessment). The second type is QC for tests, usually computerized, that are administered to small population groups on many administration dates using a wide array of test forms (CMT—continuous mode tests). Since the world of testing is headed in this direction, developing QC for CMT is crucial. In the current ITEMS module we discuss errors that might occur at the different stages of the CMT process, as well as the recommended QC procedure to reduce the incidence of each error. Illustration from a recent study is provided, and a computerized system that applies these procedures is presented. Instructions on how to develop one’s own QC procedure are also included.

Keywords: automated review, computer-based testing, CBT, continuous mode tests, CMT, human review, quality control, QC, scoring, test administration, test analysis, test scoring

Avi Allalouf

Director of Scoring and Equating, National Institute for Testing and Evaluation, Jerusalem, Israel

Dr. Avi Allalouf is the director of Scoring & Equating at National Institute for Testing and Evaluation (NITE). He received his PhD in Psychology from the Hebrew University in Jerusalem (1995). He teaches at the Academic College of Tel-Aviv -Yaffo. Primary areas of research: test adaptation, DIF, test scoring, quality control and testing & society. Dr. Allalouf leads the Exhibition on Testing & Measurement project and served as co editor of the International Journal of Testing (IJT)

Tony Gutentag

PhD student, Department of Psychology, The Hebrew University, Jerusalem, Israel

Michal Baumer

Computerized Test Unit, National Institute for Testing and Evaluation, Jerusalem, Israel

Register
- Learner - Free!
Continue
More Information

Module 43: Data Mining for Classification and Regression

Contains 4 Component(s)

In this print module, Dr. Sandip Sinharay provides a review of data mining techniques for classification and regression, which is accessible to a wide audience in educational measurement. Keywords: bagging, boosting, classification and regression tree, CART, cross-validation error, data mining, predicted values, random forests, supervised learning, test error, TIMSS

Data mining methods for classification and regression are becoming increasingly popular in various scientific fields. However, these methods have not been explored much in educational measurement. This module first provides a review, which should be accessible to a wide audience in education measurement, of some of these methods. The module then demonstrates using three real-data examples that these methods may lead to an improvement over traditionally used methods such as linear and logistic regression in educational measurement.

Keywords: bagging, boosting, classification and regression tree, CART, cross-validation error, data mining, predicted values, random forests, supervised learning, test error, TIMSS

Sandip Sinharay

Principal Research Scientist, Educational Testing Service

Sandip Sinharay is a principal research scientist in the Research and Development division at ETS. He received his Ph.D degree in statistics from Iowa State University in 2001. He was editor of the Journal of Educational and Behavioral Statistics between 2011 and 2014. Sandip Sinharay has received four awards from the National Council on Measurement in Education: the award for Technical or Scientific Contributions to the Field of Educational Measurement (in 2009 and 2015), the Jason Millman Promising Measurement Scholar Award (2006), and the Alicia Cascallar Award for an Outstanding Paper by an Early Career Scholar (2005). He received the ETS Scientist award in 2008 and the ETS Presidential award twice. He has coedited two published volumes and authored or coauthored more than 75 articles in peer-reviewed statistics and psychometrics journals and edited books.

Register
- Learner - Free!
Continue
More Information

Module 42: Simulation Studies in Psychometrics

Contains 8 Component(s)

In this print module, Dr. Richard A. Feinberg and Dr. Jonathan D. Rubright provide a comprehensive introduction to the topic of simulation studies in psychometrics using R that can be easily understood by measurement specialists at all levels of training and experience. Keywords: bias, experimental design, mean absolute difference, MAD, mean squared error, MSE, root mean squared error, RMSE, psychometrics, R, research design, simulation study, standard error

Simulation studies are fundamental to psychometric discourse and play a crucial role in operational and academic research. Yet, resources for psychometricians interested in conducting simulations are scarce. This Instructional Topics in Educational Measurement Series (ITEMS) module is meant to address this deficiency by providing a comprehensive introduction to the topic of simulation that can be easily understood by measurement specialists at all levels of training and experience. Specifically, this module describes the vocabulary used in simulations, reviews their applications in recent literature, and recommends specific guidelines for designing simulation studies and presenting results. Additionally, an example (including computer code in R) is given to demonstrate how common aspects of simulation studies can be implemented in practice and to provide a template to help users build their own simulation.

Keywords: bias, experimental design, mean absolute difference, MAD, mean squared error, MSE, root mean squared error, RMSE, psychometrics, R, research design, simulation study, standard error

Richard A. Feinberg

National Board of Medical Examiners, Philadelphia, PA

Richard Feinberg is a Senior Psychometrician with NBME, where he leads and oversees the data analysis and score reporting activities for large-scale high-stakes licensure and credentialing examinations. He is also an Assistant Professor at the Philadelphia College of Osteopathic Medicine, Philadelphia, PA, where he teaches a course on Research Methods and Statistics.

His research interests include psychometric applications in the fields of educational and psychological testing.

He earned a PhD in Research Methodology and Evaluation from the University of Delaware, Newark, DE.

Jonathan D. Rubright

National Board of Medical Examiners, Philadelphia, PA

Register
- Learner - Free!
Continue
More Information

Module 41:Latent DIF Analysis using Mixture Item Response Models

Product not yet rated Contains 1 Component(s)

In this print module, Dr. Sun-Joo Cho, Dr. Youngsuk Suh, and Dr. Woo-yeol Lee provide an introduction to differential item functioning (DIF) analysis using mixture item response theory (IRT) models, which involves comparing item profiles across latent, instead of manifest, groups. Keywords: differential item functioning, DIF, estimation, latent class, latent DIF, item response model, IRT, mixture model, model fit, model selection

The purpose of this ITEMS module is to provide an introduction to differential item functioning (DIF) analysis using mixture item response models. The mixture item response models for DIF analysis involve comparing item profiles across latent groups, instead of manifest groups. First, an overview of DIF analysis based on latent groups, called latent DIF analysis, is provided and its applications in the literature are surveyed. Then, the methodological issues pertaining to latent DIF analysis are described, including mixture item response models, parameter estimation, and latent DIF detection methods. Finally, recommended steps for latent DIF analysis are illustrated using empirical data.

Keywords: differential item functioning, DIF, estimation, latent class, latent DIF, item response model, IRT, mixture model, model fit, model selection

Sun-Joo Cho

Associate Professor, Department of Psychology and Human Development, Vanderbilt University, Nashville, TN

Dr. Cho has collaborated with researchers from a variety of disciplines including reading education, math education, special education, psycholinguistics, clinical psychology, cognitive psychology, neuropsychology, and audiology. She serves on the editorial boards of Journal of Educational Psychology, Behavior Research Methods, and International Journal of Testing.

Youngsuk Suh

Department of Educational Psychology, Rutgers, The State University of New Jersey, New Brunswick, NJ

Woo-yeol Lee

Graduate Student, Department of Psychology and Human Development, Vanderbilt University, Nashville, TN

Register
- Learner - Free!
Continue
More Information

Module 40: Item Fit Statistics for Item Response Theory Models

Contains 8 Component(s)

In this print module, Dr. Allison J. James and Dr. Randall D. Penfield provide an overview of methods used for evaluating the fit of item response theory (IRT) models. Keywords: Bayesian statistics, estimation, item response theory, IRT, Markov chain Monte Carlo, MCMC, model fit, posterior distribution, posterior predictive checks

Drawing valid inferences from item response theory (IRT) models is contingent upon a good fit of the data to the model. Violations of model-data fit have numerous consequences, limiting the usefulness and applicability of the model. This instructional module provides an overview of methods used for evaluating the fit of IRT models. Upon completing this module, the reader will have an understanding of traditional and Bayesian approaches for evaluating model-data fit of IRT models, the relative advantages of each approach, and the software available to implement each method.

Keywords: Bayesian statistics, estimation, item response theory, IRT, Markov chain Monte Carlo, MCMC, model fit, posterior distribution, posterior predictive checks

Allison J. Ames

Assistant Professor

Allison is an assistant professor in the Educational Statistics and Research Methods program in the Department of Rehabilitation, Human Resources and Communication Disorders, Research Methodology, and Counseling at the University of Arkansas. There, she teaches courses in educational statistics, including a course on Bayesian inference. Allison received her Ph.D. from the University of North Carolina at Greensboro. Her research interests include Bayesian item response theory, with an emphasis on prior specification; model-data fit; and models for response processes. Her research has been published in prominent peer-reviewed journals. She enjoyed collaborating on this project with a graduate student, senior faculty member, and the Instructional Design Team.
Contact Allison via boykin@uark.edu

Randall D. Penfield

Professor, Educational Research Methodology, University of North Carolina at Greensboro, NC

Dr. Penfield is Dean of the School of Education and a Professor of educational measurement and assessment. His research focuses on issues of fairness in testing, validity of test scores, and the advancement of methods and statistical models used in the field of assessment. In recognition of his scholarly productivity he was awarded the 2005 early career award by the National Council on Measurement in Education, and was named a Fellow of the American Educational Research Association in 2011. In addition, he has served as co-principal investigator or consultant on a numerous federal grants funded by the National Science Foundation and the Department of Education.

Register
- Learner - Free!
Continue
More Information

Module 39: Polytomous Item Response Theory Models: Problems with the Step Metaphor

Contains 1 Component(s)

In this print module, Dr. David Andrich discusses conceptual problems with the step metaphor for polytomous item response theory (IRT) models as a response to a previous ITEMS module. Keywords: graded response model, item response theory, IRT, polytomous items, polytomous Rasch model, step function, step metaphor

Penfield’s (2014) “Instructional Module on Polytomous Item Response Theory Models” begins with a review of dichotomous response models. He refers to these as The Building Blocks of Polytomous IRT Models: The Step Function. The mathematics of these models and their interrelationships with the polytomous models is correct. Unfortunately,the step characterization for dichotomous responses, which he uses to explain the two most commonly used classes of plytomous models for ordered categories, is incompatible with the mathematical structure of these models. These two classes of models are referred to in Penfield’s paper as adjacent category models and cumulative models. At best, taken in the dynamic sense of taking a step, the step metaphor leads to a superficial understanding of the models as mere descriptions of the data; at worst it leads to a misunderstanding of the models and how they can be used to assess if the empirical ordering of the categories is consistent with the intended ordering. The purpose of this note is to explain why the step metaphor is incompatible with both models and to summarize the distinct processes for each. It is also shows, with concrete examples, how one of these models can be applied to better understand assessments in ordered categories.

Keywords: graded response model, item response theory, IRT, polytomous items, polytomous Rasch model, step function, step metaphor

David Andrich

Chapple Professor, Graduate School of Education, The University of Western Australia, Crawley, Western Australia

Register
- Learner - Free!
Continue
More Information

Module 38: A Simple Equation to Predict a Subscore’s Value

Contains 1 Component(s)

In this print module, Dr. Richard A. Feinberg and Dr. Howard Wainer help analysts determine if a particular subscore adds enough value to be worth reporting through the use of a simple linear equation. Keywords: added value, classical test theory, CTT, linear equation, subscores, reliability, orthogonal, proportional reduction in mean squared error, PRMSE

Subscores are often used to indicate test-takers’ relative strengths and weaknesses and so help focus remediation. But a subscore is not worth reporting if it is too unreliable to believe or if it contains no information that is not already contained in the total score. It is possible, through the use of a simple linear equation provided in this note, to determine if a particular subscore adds enough value to be worth reporting.

Keywords: added value, classical test theory, CTT, linear equation, subscores, reliability, orthogonal, proportional reduction in mean squared error, PRMSE

Richard A. Feinberg

National Board of Medical Examiners, Philadelphia, PA

Richard Feinberg is a Senior Psychometrician with NBME, where he leads and oversees the data analysis and score reporting activities for large-scale high-stakes licensure and credentialing examinations. He is also an Assistant Professor at the Philadelphia College of Osteopathic Medicine, Philadelphia, PA, where he teaches a course on Research Methods and Statistics.

His research interests include psychometric applications in the fields of educational and psychological testing.

He earned a PhD in Research Methodology and Evaluation from the University of Delaware, Newark, DE.

Howard Wainer

Retired

Howard Wainer is an American statistician, past principal research scientist at the Educational Testing Service, adjunct professor of statistics at the Wharton School of the University of Pennsylvania, and author, known for his contributions in the fields of statistics, psychometrics, and statistical graphics.

Register
- Learner - Free!
Continue
More Information

Module 37: Improving Subscore Value through Item Removal

Product not yet rated Contains 1 Component(s)

In this print module, Dr. Richard A. Feinberg and Dr. Howard Wainer show, for a broad range of conditions of item overlap on subscores, that the value of the subscore is always improved through the removal of items with little diagnostic value. Keywords: added value, classical test theory, CTT, diagnostic value, empirical Bayes, item removal, overlapping items, ReliaVAR plots, simulation, subscores

Subscores can be of diagnostic value for tests that cover multiple underlying traits. Some items require knowledge or ability that spans more than a single trait. It is thus natural for such items to be included on more than a single subscore. Subscores only have value if they are reliable enough to justify conclusions drawn from them and if they contain information about the examinee that is distinct from what is in the total test score. In this study we show, for a broad range of conditions of item overlap on subscores, that the value of the subscore is always improved through the removal of items with little diagnostic value.

Keywords: added value, classical test theory, CTT, diagnostic value, empirical Bayes, item removal, overlapping items, ReliaVAR plots, simulation, subscores

Richard A. Feinberg

National Board of Medical Examiners, Philadelphia, PA

Richard Feinberg is a Senior Psychometrician with NBME, where he leads and oversees the data analysis and score reporting activities for large-scale high-stakes licensure and credentialing examinations. He is also an Assistant Professor at the Philadelphia College of Osteopathic Medicine, Philadelphia, PA, where he teaches a course on Research Methods and Statistics.

His research interests include psychometric applications in the fields of educational and psychological testing.

He earned a PhD in Research Methodology and Evaluation from the University of Delaware, Newark, DE.

Howard Wainer

Retired

Howard Wainer is an American statistician, past principal research scientist at the Educational Testing Service, adjunct professor of statistics at the Wharton School of the University of Pennsylvania, and author, known for his contributions in the fields of statistics, psychometrics, and statistical graphics.

Register
- Learner - Free!
Continue
More Information

Module 36: Quantifying Error and Uncertainty Reductions in Scaling Functions

Product not yet rated Contains 1 Component(s)

In this print module, Dr. Tim Moses describes and extends X-to-Y regression measures that have been proposed for use in educational assessment of X-to-Y scaling and equating results. Keywords: concordance, equating, heteroscedastic, regression, prediction error, scaling, scaling error, X-to-Y, Y-toX

This module describes and extends X-to-Y regression measures that have been proposed for use in the assessment of X-to-Y scaling and equating results. Measures are developed that are similar to those based on prediction error in regression analyses but that are directly suited to interests in scaling and equating evaluations. The regression and scaling function measures are compared in terms of their uncertainty reductions, error variances, and the contribution of true score and measurement error variances to the total error variances. The measures are also demonstrated as applied to an assessment of scaling results for a math test and a reading test. The results of these analyses illustrate the similarity of the regression and scaling measures for scaling situations when the tests have a correlation of at least .80, and also show the extent to which the measures can be adequate summaries of nonlinear regression and nonlinear scaling functions, and of heteroscedastic errors. After reading this module, readers will have a comprehensive understanding of the purposes, uses, and differences of regression and scaling functions.

Keywords: concordance, equating, heteroscedastic, regression, prediction error, scaling, scaling error, X-to-Y, Y-toX