Society for Mathematical Psychology

SMP 2026 Salon ABC

Measurement & Testing

Nathan Gillespie

Inferring individual difference characteristics from free-response text data, a process commonly known as content analysis, is a foundational research technique across the behavioral and social sciences. Yet, the methods used to conduct it are strikingly varied. Research practice highlights three distinct styles of content analysis: 1) human-centered content analysis, 2) dictionary-centered content analysis, and 3) machine-learning-centered content analysis. Across these approaches, it is unclear how useful and generalizable the results of any given content analysis are—leaving open the question of how suitable content analysis really is at informing and testing hypotheses about individual differences. While much has been said about ways to measure the empirical reliability of a content analysis, there is little theory regarding the cognitive processes—those of both the people producing the text, and those of the researchers interpreting the text—that generate its output. This review triangulates perspectives from generalizability theory, latent variable theory, and cognitive psychometrics to categorize variability in the outputs from each type of content analysis. Together, these perspectives clarify how suitable content analysis is for understanding latent aspects of behavior and cognition. Implications for more robust design and data modeling in content analytic studies will be discussed, as well as recommendations for researchers seeking to use content analysis in future work.

This is an in-person presentation on July 20, 2026 (10:40 ~ 11:00 EDT).

No recording available Join the discussion

Meike Snijder-Steinhilber
Anna-Lena Schubert

The replication crisis has prompted growing interest in sequential testing as an alternative to fixed-sample designs. Originally formalized by Wald (1945), the Sequential Probability Ratio Test (SPRT) framework for continuous evidence evaluation is here applied in the form of the sequential ANOVA: After each new observation, the likelihood ratio is calculated and compared against two predefined boundaries. If the ratio crosses the upper boundary, the alternative hypothesis is accepted; if it crosses the lower boundary, the null hypothesis is accepted; otherwise, sampling continues. This framework improves sampling efficiency while controlling Type I and Type II error rates – reducing observations by 58% on average compared to fixed-sample designs (Steinhilber et al., 2024). Yet practical questions remain. Are decisions made after only a few observations trustworthy in terms of their error rates? If data collection continues past an initial boundary crossing – violating the SPRT stopping rule, whether out of curiosity or by accident – what are the properties of the resulting second decisions? And how common is it that the likelihood ratio, after crossing one boundary, reverses direction and drifts back into the continuation region? Our simulations show that researchers need not be concerned about the speed at which a boundary is crossed; thus, early decisions are indeed trustworthy. Second decisions show lower error rates than first decisions and are less efficient. While 66% of cases show a boundary reversal, the likelihood ratio typically returns to the original boundary, confirming the first decision; the remaining 34% of cases show no reversal. Based on these findings, the talk will offer concrete guidance for researchers facing practical questions when implementing sequential ANOVA.

This is an in-person presentation on July 20, 2026 (11:00 ~ 11:20 EDT).

No recording available Join the discussion

Ms. Laura Windred
Prof. Aaron Seitz
Prof. Susanne Jaeggi
Prof. Andrew Heathcote
Dr. Quentin Gronau
Guy Hawkins

Attentional control is commonly measured using conflict paradigms such as the Flanker, Simon, and Stroop tasks, which consistently yield large group-level congruency effects. The reliability of individual-difference estimates derived from these tasks, however, remains less certain. Increasing interference demands, as in squared conflict paradigms, may improve the precision of individual-difference estimates. The reliability of both group- and individual-level estimates may also vary as a function of testing duration. We examined the relationship between cumulative testing time and the reliability of group- and individual-level estimates. Participants completed 10 short blocks of a single squared conflict paradigm (Flanker Squared, Simon Squared, or Stroop Squared). First, we quantified the magnitude of the group-level congruency effect by computing Bayes factors sequentially across cumulative blocks, identifying the point at which additional testing yields diminishing returns for group-level inference. Second, we evaluated the reliability of individual-differences using hierarchical mixed-effects models fit to trial-level response times, examining how the variance and stability of individual congruency estimates evolve as additional blocks are included. Together, these findings characterise conflict tasks as dynamic measurement systems in which the precision of both group- and individual-level estimates changes as data accumulates. Formalising the relationship between effect size and testing duration may guide the development of more efficient and psychometrically robust assessments of attentional control.

This is an in-person presentation on July 20, 2026 (11:20 ~ 11:40 EDT).

No recording available Join the discussion

Dr. Kenny Yu
Dr. Maria Robinson

Parameter invariance serves as a key criterion for evaluating whether cognitive models capture mechanisms that generalize across tasks. In this talk, I compare two approaches to testing such invariance. Joint fitting estimates shared parameters from multiple tasks simultaneously; parameter substitution estimates parameters in one task and evaluates them in another without refitting. These methods answer fundamentally different questions. Joint fitting asks whether a shared parameterization exists that can accommodate multiple tasks, while substitution asks whether parameters estimated in one context actually transfer to another. Accordingly, the two methods serve different roles in testing invariance, and their verdicts need not agree. Joint fitting is systematically easier to pass because the shared estimate compromises between task-specific optima rather than matching either one. This divergence is structural, rooted in whether estimation and evaluation operate on the same data, and persists regardless of how strictly one sets the acceptance threshold. Only substitution, by fixing parameters in advance of evaluation, subjects invariance to risky prediction. Joint fitting, however, remains well suited to efficiently screening whether shared parameterization is viable. Together, the two methods form a staged program for building progressively stronger evidence for parameter invariance.

This is an in-person presentation on July 20, 2026 (11:40 ~ 12:00 EDT).

No recording available Join the discussion

Presenting author
Submitting author
Author