
Title: Does the Use of Crowdsourced Listeners Yield Different Speech Intelligibility Results Than In-Person Listeners for Typically Developing Children?
Legend: Expected multiword intelligibility for an average in-person listener, an average crowdsourced listener, and the difference between these averages, each as a function of age. Points represent the average intelligibility score (or difference in average scores) for a child. Note that crowdsourced listener data are for listeners who met the 90%in-task accuracy criterion.
Citation: Salvo, H.D., Mahr, T.J., Sandgren, C., Mabie, H., Hustad, K.C. (2026). Does the Use of Crowdsourced Listeners Yield Different Speech Intelligibility Results Than In-Person Listeners for Typically Developing Children? Journal of Speech, Language, and Hearing Research
Abstract:
Introduction: We examined the performance of crowdsourced listeners compared with in-person listeners on the measurement of speech intelligibility for typically developing children. We used three different in-task quality check criteria to screen listeners and examined between-listener intelligibility differences and interrater reliability under each criterion. We also examined how crowdsourced intelligibility results compared with in-person results.
Method: Sixty neurotypical children between ages 2;6 and 9;11 (years;months), drawn from Hustad et al. (2021), contributed speech samples. We used the online platform, Prolific, to collect intelligibility data from five crowdsourced listeners per child (N = 300 total) and compared scores with in-person results from two listeners per child. We used intraclass correlation coefficients (ICCs) and computed pairwise differences among listener groups for each of three in-task quality criteria groups and the in-person group. We modeled intelligibility as a function of listener source (in-person vs. crowdsourced) and child age using mixed-effects regression with smoothing splines.
Results: Lower ICCs and larger between-listener differences were observed for crowdsourced compared to in-person listeners, regardless of in-task quality check criteria, but in-task quality check criteria reduced the disparity. Crowdsourced listeners produced intelligibility scores that were up to 7 percentage points lower than in-person listeners, even under the most stringent in-task quality check criterion. Results showed the same patterns of change with children’s age as in-person listener findings. Children with midrange (65%–83%) intelligibilities were the most negatively impacted by the use of crowdsourced listeners.
Conclusions: Rigorous in-task quality check criteria improved crowdsourced listener data. Speakers with midrange intelligibility were the most negatively impacted by the use of crowdsourced listeners, with an intelligibility difference of about 7 percentage points. Intelligibility data obtained with crowdsourced listeners should be interpreted with caution, and future studies should evaluate how crowdsourced intelligibility data differs from in-person data for disordered populations of speakers.

Investigator: Katherine Hustad, PhD
About the Lab:
The WISC lab is dedicated to the study of communication development in children with cerebral palsy (CP).