Catalog Advanced Search
Module 05: Criterion-referenced Tests in Program Curriculum EvaluationContains 1 Component(s)
This ITEM module describes how criterion-referenced tests (CRTs) can be used in program and curriculum evaluations for developing information to form judgments about educational programs and curricula.
This monograph describes how criterion-referenced tests (CRTs) can be used in program and curriculum evaluations for developing information to form judgments about educational programs and curricula. The material is organized as follows: a brief introduction to the monograph, its purpose and goals, a discussion of the relationship between evaluation and CRTs, pertinent information about program and curriculum evaluation, relevant facts about CRTs, and a summation. A principal goal of the essay is to describe concepts and procedures in terms that are instructionally illuminating. The reader is guided to identify and examine particular points at which decisions must be made about how, when, and why CRTs may aid the evaluation process. These steps are each identified as "An Instructional Step" and presented in separate boxes with pertinent guiding questions. Conciseness is a further aim of this monograph, one consequence of which is that several important concepts are only cursorily described or alluded to. Annotated references are included. Accompanying this instructional monograph is a "Student's Self-Test." An "Instructor's Guide" with expanded references and materials for photocopying or preparing transparencies is available by mail order (see "Teaching Aids" ordering information).
Keywords: criterion-referenced test, CRT, curriculum evaluation, program evaluation
Module 04: Formula Scoring of Multiple-Choice TestsContains 1 Component(s)
This ITEM module discussed the formula scoring of multiple choice tests.
Formula scoring is a procedure designed to reduce multiple-choice test score irregularities due to guessing. Typically, a formula score is obtained by subtracting a proportion of the number of wrong responses from the number correct. Examinees are instructed to omit items when their answers would be sheer guesses among all choices but otherwise to guess when unsure of an answer. Thus, formula scoring is not intended to discourage guessing when an examinee can rule out one or more of the options within a multiple-choice item. Examinees who, contrary to the instructions, do guess blindly among all choices are not penalized by formula scoring on the average; depending on luck, they may obtain better or worse scores than if they had refrained from this guessing. In contrast, examinees with partial information who refrain from answering tend to obtain lower formula scores than if they had guessed among the remaining choices. (Examinees with misinformation may be exceptions.) Formula scoring is viewed as inappropriate for most classroom testing but may be desirable for speeded tests and for difficult tests with low passing scores. Formula scores do not approximate scores from comparable fill-in-the-blank tests, nor can formula scoring preclude unrealistically high scores for examinees who are very lucky.
Keywords: formula scoring, multiple-choice test score, guessing
Module 03: Reliability of Scores From Teacher-Made TestsContains 1 Component(s)
This ITEM module discussed the reliability of teacher-made tests.
Reliability is the property of a set of test scores that indicates the amount of measurement error associated with the scores. Teachers need to know about reliability so that they can use test scores to make appropriate decisions about their students. The level of consistency of a set of scores can be estimated by using the methods of internal analysis to compute a reliability coefficient. This coefficient, which can range between 0.0 and +1.0, usually has values around 0.50 for teacher-made tests and around 0.90 for commercially prepared standardized tests. Its magnitude can be affected by such factors as test length, test-item difficulty and discrimination, time limits, and certain characteristics of the group-extent of their testwiseness, level of student motivation, and homogeneity in the ability measured by the test.
Keywords: reliability, test scores, reliability coefficient, internal analysis
Module 02: Obtaining Intended Weights When Combining Students' ScoresContains 1 Component(s)
This ITEM module describes how scores can be adjusted so that the intended weights are obtained.
An instructor typically combines students' scores from several measures such as assignments and exams when assigning course grades. The relative weights intended for these scores are at least inferred and often stated explicitly by the instructor. This module describes how scores can be adjusted so that the intended weights are obtained. Techniques are discussed for two grading criteria: (a) grading students through comparison to others in the class and (b) grading students through comparison to predetermined levels of performance.
Keywords: grading criteria, intended weights, combine scores, course grade
Module 01: Performance Assessments: Design and DevelopmentContains 1 Component(s)
This ITEM module presents and illustrates specific rules of test design in the form of a step-by-step strategy.
Achievement can be, and often is, measured by means of observation and professional judgment. This form of measurement is called performance assessment. Developers of large-scale assessments of communication skills often rely on performance assessments in which carefully devised exercises elicit performance that is observed and judged by trained raters. Teachers also rely heavily on day-to-day
observation and judgment. Like other tests, quality performance assessment must be carefully planned and developed to conform to
specific rules of test design. This module presents and illustrates those rules in the form of a step-by-step strategy for designing such assessments, through the specification of (a) reason(s) for assessment, (b) type of performance to be evaluated, (c) exercises that will elicit performance, and (d) systematic rating procedures. General guidelines are presented for maximizing the reliability, validity, and economy of performance assessments.
Keywords: performance assessment, reliability, validity
Digital Module 09: Sociocognitive Assessment for Diverse PopulationsContains 2 Component(s)
In this digital ITEMS module, Dr. Robert [Bob] Mislevy and Dr. Maria Elena Oliveri introduce and illustrate a sociocognitive perspective on educational measurement, which focuses on a variety of design and implementation considerations for creating fair and valid assessments for learners from diverse populations with diverse sociocultural experiences.
In this digital ITEMS module, Dr. Robert [Bob] Mislevy and Dr. Maria Elena Oliveri introduce and illustrate a sociocognitive perspective on educational measurement, which focuses on a variety of design and implementation considerations for creating fair and valid assessments for learners from diverse populations with diverse sociocultural experiences. The first part of the module, narrated by Dr. Mislevy, contains a general overview section, a description of the sociocognitive framing of assessment issues, and a section on implications for assessment around key concepts such as reliability, validity, and fairness. The second part of the module, narrated by Dr. Oliveri, contains a section on frameworks for fairness investigations and principled assessment design as well as brief vignette-based illustrations of the principles using a prototype activity to support collaboration and communication skills in the workplace. The module is designed to provide a relatively high-level, conceptual, and non-statistical overview and is intended for interdisciplinary team members who need to create fair and equitable learning and assessment systems for diverse populations.
Keywords: assessment design, Bayesian statistics, cross-cultural assessment, diverse populations, educational measurement, evidence-centered design, fairness, international assessments, prototype, reliability, sociocognitive assessment, validity
Lord Chair in Measurement and Statistics
Dr. Robert [Bob] Mislevy is the Frederic M. Lord Chair in Measurement and Statistics at Educational Testing Service as well as Professor Emeritus of Measurement, Statistics, and Evaluation at the University of Maryland, with affiliations with Second Language Acquisition and Survey Methods. Dr. Mislevy’s research applies developments in statistics, technology, and cognitive science to practical problems in educational assessment. His work includes a multiple-imputation approach to integrate sampling and psychometric models in the National Assessment of Educational Progress (NAEP), an evidence-centered framework for assessment design, and simulation- and game-based assessment with the Cisco Networking Academy. Among his many awards are AERA’s Raymond B. Cattell Early Career Award for Programmatic Research, NCME’s Triennial Award for Technical Contributions to Educational Measurement (3 times), NCME’s Award for Career Contributions, AERA’s E.F. Lindquist Award for contributions to educational assessment, the International Language Testing Association's Messick Lecture Award, and AERA Division D’s inaugural Robert L. Linn Distinguished Address Award. He is a member of the National Academy of Education and a past president of the Psychometric Society. He has served on projects for the National Research Council, the Spencer Foundation, and the MacArthur Foundation concerning assessment, learning, and cognitive psychology, and on the Gordon Commission on the Future of Educational Assessment. His most recent book is "Sociocognitive Foundations of Educational Assessment" for which he received the 2019 NCME Annual Award and on which this ITEMS module is based.
Contact Bob via firstname.lastname@example.org
Maria Elena Oliveri
Dr. María Elena Oliveri is a Research Scientist in the Academic to Career research center at the Educational Testing Service (ETS). Her research focuses on fairness, validity, diversity, equity, and innovative assessment design and development of competency-based digital formative assessments of 21st century skills. She has actively disseminated her research in numerous published articles in journals such as Applied Measurement in Education and the International Journal of Testing; she has led various professional development workshops at national and international conferences such as AERA, NCME, and ITC; and she has presented at numerous national and international conferences. In earlier stages of her career, she was a literacy mentor to second-language teachers in the Vancouver School District as well as a teacher of second language learners and students with disabilities and she has hosted workshops for educators on innovative approaches to assessing culturally and linguistically diverse learners. She also was a lecturer at the University of British Columbia, Vancouver, Canada where she taught courses on assessment and developmental psychology to students pursuing Bachelor of Education degrees in French Immersion programs.
Contact Malena via email@example.com
Digital Module 08: Foundations of Operational Item AnalysisContains 7 Component(s)
In this digital ITEMS module, Dr. Hanwook Yoo and Dr. Ronald K. Hambleton provide an accessible overview of operational item analysis approaches for dichotomously scored items within the frameworks of classical test theory and item response theory.
Item analysis is an integral part of operational test development and is typically conducted within two popular statistical frameworks: classical test theory (CTT) and item response theory (IRT). In this digital ITEMS module, Dr. Hanwook Yoo and Dr. Ronald K. Hambleton provide an accessible overview of operational item analysis approaches for dichotomously scored items within these frameworks. They review the different stages of test development and associated item analyses to identify poorly performing items and effective item selection. Moreover, they walk through the computational and interpretational steps for CTT- and IRT-based evaluation statistics using simulated data examples and review various graphical displays such as distractor response curves, item characteristic curves, and item information curves. The digital module contains sample data, Excel sheets with various templates and examples, diagnostic quiz questions, data-based activities, curated resources, and a glossary.
Keywords: Classical test theory, corrections, difficulty, discrimination, distractors, item analysis, item response theory, R Shiny, TAP, test development
Hanwook (Henry) Yoo
Managing Senior Psychometrician
Henry is a managing senior psychometrician in the Psychometric Analysis and Research division at Educational Testing Service (ETS). At ETS, he manages operational psychometric work for graduate admissions programs. He received his Ed.D. in Research and Evaluation Methods Program from the University of Massachusetts, Amherst. His research interests include measurement invariance across subgroups, innovative score reporting, construct validity of English language proficiency assessment, and applications of IRT to computer-based testing. He is a co-author of a bibliography of research on test score reporting, which is available at the NCME website (https://ncme.connectedcommunity.org/ncmedev/viewdocument/score-reporting-bibliography).
Contact Hanwook (Henry) via firstname.lastname@example.org
Ronald K. Hambleton
Ronald holds the titles of Distinguished University Professor and Executive Director of the Center for Educational Assessment at the University of Massachusetts, Amherst. He earned his Ph.D. in 1969 from the University of Toronto with specialties in psychometric methods and statistics. He is the co-author or co-editor of eight measurement books as well as author or co-author of many research papers, reports, and reviews spanning 50 years on topics such as standard-setting, score reporting, test adaptation, and applications of IRT. He is currently conducting research on a number of topics: computer-based testing, methods and guidelines for adapting tests from one language and culture to another, and design and field-testing of new approaches for reporting test scores.
Contact Ron via email@example.com
Digital Module 07: Subscores - Evaluation & ReportingContains 3 Component(s) Recorded On: 09/09/2019
In this digital ITEMS module, Dr. Sandip Sinharay reviews the status quo on the reporting of subscores, which includes how they are used in operational reporting, what kinds of professional standards they need to meet, and how their psychometric properties can be evaluated.
In this digital ITEMS module, Dr. Sandip Sinharay reviews the status quo on the reporting of subscores. Specifically, he first provides examples of operationally-reported subscores, discusses why subscores are in high demand, and discusses professional quality standards that subscores have to satisfy. He then describes various statistical methods that can be used to evaluate whether subscores satisfy professional standards, which include descriptive statistics, DIMTEST / DETECT, factor analysis, multidimensional item response theory, and the Haberman method. He provides guidance for how to implement these methods on real data using the R package ‘subscores’.
Keywords: Diagnostic scores, disattenuation, DETECT, DIMTEST, factor analysis, multidimensional item response theory (MIRT), proportional reduction in mean squared error (PRMSE), reliability, subscores
Principal Research Scientist
Dr. Sandip Sinharay is a Principal Research Scientist at ETS. He received his Ph.D. and M.S. degrees from the Department of Statistics at Iowa State University. He has received five awards from the National Council on Measurement in Education including the Bradley Hanson Award (2018), the Technical or Scientific Contributions to the Field of Educational Measurement (2009 and 2015), the Jason Millman Promising Measurement Scholar Award (2006), and the Alicia Cascallar Award for an Outstanding Paper by an Early Career Scholar (2005). Sandip has coedited two published volumes and authored or coauthored more than 100 articles in peer-reviewed statistics and psychometrics journals and edited books. His research interests include statistical methods for detecting test fraud, reporting of subscores, Bayesian statistical methods, as well as model-data fit and model selection methods. The collaboration with the instructional design team on this project was a unique learning experience for Sandip.
Contact Sandip via firstname.lastname@example.org
Digital Module 06: Posterior Predictive Model CheckingContains 14 Component(s) Recorded On: 04/24/2019
In this digital ITEMS module, Dr. Allison Ames and Aaron Myers discuss the most common Bayesian approach to model-data fit evaluation, which is called Posterior Predictive Model Checking (PPMC), for simple linear regression and item response theory models.
In this digital ITEMS module, Dr. Allison Ames and Aaron Myers discuss the most common Bayesian approach to model-data fit evaluation called Posterior Predictive Model Checking (PPMC). Specifically, drawing valid inferences from modern measurement models is contingent upon a good fit of the data to the model and violations of model-data fit have numerous adverse consequences, limiting the usefulness and applicability of the model. As Bayesian estimation is becoming more common, understanding the Bayesian approaches for evaluating model-data fit models is critical. The instructors review the conceptual foundation of Bayesian inference as well as PPMC and walk through the computational steps of PPMC using real-life data examples from simple linear regression and item response theory (IRT) analysis. They provide guidance for how to interpret PPMC results and discuss how to implement PPMC for other model(s) and data. The digital module contains sample data, SAS code, diagnostic quiz questions, data-based activities, curated resources, and a glossary.
Keywords: Bayesian inference; simple linear regression; item response theory (IRT); model-data fit; posterior predictive model checking (PPMC); Bayes’ theorem; Yen’s Q3; item fit
Allison J. Ames
Allison is an assistant professor in the Educational Statistics and Research Methods program in the Department of Rehabilitation, Human Resources and Communication Disorders, Research Methodology, and Counseling at the University of Arkansas. There, she teaches courses in educational statistics, including a course on Bayesian inference. Allison received her Ph.D. from the University of North Carolina at Greensboro. Her research interests include Bayesian item response theory, with an emphasis on prior specification; model-data fit; and models for response processes. Her research has been published in prominent peer-reviewed journals. She enjoyed collaborating on this project with a graduate student, senior faculty member, and the Instructional Design Team.
Contact Allison via email@example.com
Graduate Assistant / Doctoral Student
Aaron is a doctoral student in the Educational Statistics and Research Methods program at the University of Arkansas. His research interests include Bayesian inference, data mining, multidimensional item response theory, and multilevel modeling. Aaron previously received his M.A. in Quantitative Psychology from James Madison University. He currently serves as a graduate assistant where he teaches introductory statistics and works in a statistical consulting lab.
Contact Aaron via firstname.lastname@example.org
Digital Module 05: The G-DINA FrameworkContains 4 Component(s) Recorded On: 11/14/2019
In this digital ITEMS module, Dr. Wenchao Ma and Dr. Jimmy de la Torre introduce the G-DINA model, which is a general framework for specifying, estimating, and evaluating a wide variety of cognitive diagnosis models for the purpose of diagnostic measurement.
In this digital ITEMS module, Dr. Wenchao Ma and Dr. Jimmy de la Torre introduce the generalized deterministic inputs, noisy “and” gate (G-DINA) model, which is a general framework for specifying, estimating, and evaluating a wide variety of cognitive diagnosis models (CDMs). The module contains a non-technical introduction to diagnostic measurement, an introductory overview of the G-DINA model as well as common special cases, and a review of model-data fit evaluation practices within this framework. They use the flexible GDINA R package, which is available for free within the R environment and provides a user-friendly graphical interface in addition to the code-driven layer. The digital module also contains videos of worked examples, solutions to data activity questions, curated resources, a glossary, and quizzes with diagnostic feedback.
Keywords: diagnostic measurement; cognitive diagnosis models (CDMs); diagnostic classification models (DCMs); G-DINA framework; GDINA package; model fit; model comparison; Q-matrix; validation
Dr. Wenchao Ma is an assistant professor in the Educational Research program in the Department of Educational Studies in Psychology, Research Methodology, and Counseling at the University of Alabama. He received his Ph.D. from Rutgers, The State University of New Jersey. His research interests lie in educational and psychological measurement in general, and item response theory and cognitive diagnosis modeling in particular. Wenchao was a recipient of the 2017 Bradley Hanson Award for Contributions to Educational Measurement given by the National Council on Measurement in Education as well as the 2018 Outstanding Dissertation Award given by the American Educational Research Association.
Contact Wenchao via email@example.com
Jimmy de la Torre
Dr. Jimmy de la Torre is a Professor at the Faculty of Education at The University of Hong Kong. His research interests include latent variable models for educational and psychological measurement and how to use assessment to improve classroom instruction and learning. His recent work includes the development of various cognitive diagnosis models, implementation of estimation codes for cognitive diagnosis models, and development of the G-DINA framework for model estimation, test comparison, and Q-matrix validation, which is the focus of this module. He is an ardent advocate of CDM, and, to date, has conducted more than a dozen national and international CDM workshops. Jimmy was a recipient of the 2008 Presidential Early Career Award for Scientists and Engineers given by the White House, the 2009 Jason Millman Promising Measurement Scholar Award, and the 2017 Bradley Hanson Award for Contributions to Educational Measurement awarded by the National Council on Measurement in Education (NCME).
Contact Jimmy via firstname.lastname@example.org