Context effects on phoneme categorization in children with dyslexia

Research shows that, on average, children with dyslexia behave less categorically in phoneme categorization tasks. This study investigates three subtle ways that struggling readers may perform differently than their typically developing peers in this experimental context: sensitivity to the frequency distribution from which speech tokens are drawn, bias induced by previous stimulus presentations, and fatigue during the course of the task. We replicate findings that reading skill is related to categorical labeling, but we do not find evidence that sensitivity to the stimulus frequency distribution, the influence of previous stimulus presentations, and a measure of task engagement differs in children with dyslexia. It is, therefore, unlikely that the reliable relationship between reading skill and categorical labeling is attributable to artifacts of the task design, abnormal neural encoding, or executive function. Rather, categorical labeling may index a general feature of linguistic development whose causal relationship to literacy remains to be ascertained.


I. INTRODUCTION
It is well established that reading skill is correlated with performance on phoneme categorization tasks in which listeners are asked to categorize spoken syllables based on a single contrastive feature (Goswami et al., 2002;Noordenbos and Serniclaes, 2015;O'Brien et al., 2018;O'Brien et al., 2019;Vandermosten et al., 2010). However, the mechanism underlying the link between impaired processing of phonemes and developmental dyslexia remains unclear. While phonological awareness, the ability to identify and manipulate phonemes in speech, is one of the strongest predictors of dyslexia, there are several reasons to question that phonological processing is the "core deficit" that explains why all children with dyslexia struggle with learning to read. Some researchers have criticized this "core phonological deficit theory" on the grounds that not enough children could be accurately diagnosed on the basis of phonological awareness alone Wolf and Bowers, 2000). Perhaps the most popular line of reasoning, though, is that children with dyslexia perform (on average) poorly on many measures of auditory processing, visual processing, working memory, and automaticity, which cannot be explained by a phonological deficit alone. These observations have motivated a new wave of research, searching for a more fundamental mechanism that might explain the myriad of deficits (including phonological awareness) that are associated with reading (dis)ability Jaffe-Dax et al., 2017;Lieder et al., 2019;Ziegler, 2008).
While some researchers have taken the perspective that individuals with dyslexia have fundamentally impaired auditory or visual processing (Goswami, 2011;Stein, 2018;Tallal et al., 1996), the psychophysical literature on the whole is currently inconsistent with a homogeneous and uniform pattern of sensory impairment (Amitay et al., 2002;H€ am€ al€ ainen et al., 2013;Rosen, 2003;Stuart et al., 2006). Noting this, some researchers have argued that individuals with dyslexia are constrained not by sensory processing at a basic level but by the demands posed by psychophysical tasks (Ahissar, 2007;Ramus and Ahissar, 2012).
While the appeal to a domain-general mechanism could potentially explain the heterogeneity observed in the sensory processing literature, a consensus is yet to be reached regarding which particular aspects of the psychophysical tasks are the "bottleneck" in the performance. One candidate is attention and task vigilance; dyslexia is often comorbid with attention-deficit/hyperactivity disorder (ADHD;German o et al., 2010;Light et al., 1995;Stevenson et al., 2005). In accord with this hypothesis, one previous study has shown that performance on "catch trials" tends to degrade faster over the course of a task in poor readers than do controls (Messaoud-Galusi et al., 2011; see also Roach et al., 2004, but note that Vandermosten et al., 2018, reported null results in a similar design). Another candidate is statistical learning. Statistical learning was originally defined as "a way of extracting statistical regularities from the environment" (Saffran et al., 1996). It has been proposed that individuals with dyslexia are less able than their typically developing peers to take advantage of regularities in their environment Banai and Ahissar, 2006;Gabay et al., 2015;Lieder et al., 2019). The statistical learning hypothesis is especially appealing because the process of learning to read involves forming connections between phonological and orthographic representations, which requires a learner to extract regularities from visual and auditory sequences (Ziegler and Goswami, 2005) and also to learn and automate the probabilistic relationship between a given letter and the phonemes it represents (Apfelbaum et al., 2013). Distributional learning-sensitivity to the distribution from which stimuli are drawn-is known to be a key part of language acquisition in development (Maye et al., 2002). Thus, there is growing interest in the possibility that individual differences in mechanisms such as sensitivity to environmental statistics, could explain the degraded performance across an array of experiments as well as a difficulty with learning to read.
The phoneme categorization task provides a reasonably reliable setting to explore how task performance may be differentially affected by task demands in struggling readers. Because it has been so extensively used, experimenters can have reasonable confidence that the key effect-shallower psychometric functions in struggling readers-is broadly replicable for many stop consonant continua (Noordenbos and Serniclaes, 2015;O'Brien et al., 2018;O'Brien et al., 2019). However, the mechanisms underlying the psychometric function shape may be difficult (if not impossible) to disambiguate as there are at least two plausible explanations for individual differences. First, individual differences in noise at the level of phonetic cue encoding could influence shape: increased noise around the category boundary will lead to a flatter function. Second, individual differences in categorization strategy must be considered. The optimal strategy-consistently applying the same label to every token with a phonetic cue above some threshold-would lead to a steep psychometric function, whereas probability matching based on the statistics of the cue distribution (detailed in Clayards et al., 2008) would lead to a shallower function. It is unclear which strategy children might use in this experimental context or which mechanism is most relevant to children's performance on the task. While these limitations do not invalidate the task as a probe of some dimension of speech perception related to literacy skill, they are worth bearing in mind.
Previously, we showed that reduced categorization in struggling readers is somewhat influenced by the working memory demands of the task but could not be entirely explained by task difficulty: irrespective of the taskdifficulty, we found a correlation between reading skill and task performance (O'Brien et al., 2018). We now investigate several other aspects of task performance to clarify the extent to which the phoneme categorization-reading relationship depends on specific experimental conditions. We first consider the effects of varying the stimulus distribution from which speech tokens are drawn. Typically, in categorization experiments, stimuli are drawn from a uniform distribution. Adults without reading disability show sensitivity to the stimulus distribution in the categorization task (Clayards et al., 2008): the task elicits more categorical behavior when stimuli are drawn from narrow bimodal distributions than when drawn from broad distributions. Recently, work by Vandermosten et al. (2018) suggested that children with dyslexia were, on average, less able to utilize distributional cues to learn a non-native speech contrast. In the present study, we examine how children aged 8-12 years performed on a categorical phoneme labeling task with two conditions: a bimodal and uniform distribution of native speech tokens (note that although Clayards et al., 2008, com-pared two bimodal distributions of various widths, a uniform distribution is equal to an infinitely wide bimodal distribution). This result is of interest for two reasons. First, we hope to better characterize the matter of distributional sensitivity in struggling readers, which is largely unsettled in the literature. Second, studying nonuniform stimulus distributions may bring the speech categorization task closer to representing ethological conditions. In natural speech, utterances are typically "drawn" from a structured distribution-it is this structure that may enable children to learn categories in the first place (McMurray et al., 2009). If struggling readers are indeed more categorical when presented with stimuli from bimodal versus uniform distributions, that suggests their ability to perceive speech in naturalistic conditions may be less impaired than many researchers argue on the basis of the categorical labeling task (Noordenbos and Serniclaes, 2015).
Next, we explore how the immediate context of recently presented speech tokens affects judgments about the identity of the current stimulus (e.g., after a clear phoneme exemplar is heard, a listener might be more likely to judge an ambiguous speech sound as representing a different category). Considering how recent stimulus presentations influence performance is of interest for several reasons: it addresses longstanding claims that individuals with dyslexia struggle when stimuli are presented sequentially (Tallal, 1980) or that they show abnormal stimulus adaptation and faster implicit memory decay Jaffe-Dax et al., 2017;Perrachione et al., 2016). Finally, we look for hallmarks of fatigue and disengagement in our participant's responses by examining changes in task performance over the duration of the experiment. In line with previous work, we find that there is a moderate relationship between phoneme categorization and reading skill; this relationship cannot be attributed to (1) the stimulus distribution, (2) stimulus recency effects, or (3) task disengagement. Thus, we conclude that some people with dyslexia have difficulties categorizing speech sounds and this deficit, though likely not universal, is not an artifact of experimental conditions, such as the distribution, order, and duration of the experiment.

A. Participants
A total of 62 native English-speaking children aged 8-12 years were recruited for the study. Children without known auditory disorders were recruited from a database of volunteers in the Seattle area (University of Washington Reading and Dyslexia Research Database 1 ). Parents and/or legal guardians of all participants provided written informed consent under the University of Washington Institutional Review Board protocol. All subjects demonstrated normal or corrected-to-normal vision. Participants were tested on a battery of cognitive and literacy assessments, including the Woodcock-Johnson IV (WJ-IV) Letter-Word Identification and Word Attack subtests, the Test of Word Reading Efficiency (TOWRE), and the Weschler Abbreviated Scale of Intelligence (WASI). All participants underwent a hearing screening to ensure pure tone detection at octave frequencies between 500 and 8000 Hz in both ears at 25 dB hearing level (HL) or better.

B. Demographics
Here, we present analyses of task performance where reading skill is treated as either a continuous or discrete group variable. It has been reasonably established that reading skill is best modeled as a continuous variable with no clear demarcation between readers who are below-average and readers who are dyslexic (Shaywitz et al., 1992). Many results on phoneme categorization published so far (Goswami et al., 2002;O'Brien et al., 2018;O'Brien et al., 2019;Vandermosten et al., 2011) support the perspective that there is a continuous relationship between task performance and reading skill. However, for completeness and ease of comparison with existing literature on dyslexia, we also provide group-level analyses.
Reading skill was summarized in a composite variable: as both the Woodcock-Johnson Basic Reading Skill measure (WJ-BRS; a composite of word attack and letter word identification subtests) and the TOWRE index (a composite of the sight word efficiency and phonemic decoding efficiency subtests) are scored on the same standardized scale [mean-¼ 100, standard deviation (SD) ¼ 15], a composite reading skill measure was created by averaging the two metrics for each participant. Using a composite of both measures as the criterion improves the reliability of our group assignments because they are highly correlated measures (r ¼ 0:877; p < 0:001, in our sample). Participants were assigned to the dyslexia group if their composite reading score was at least 1 SD below the population mean (i.e., <85). Six participants had reading scores above the 1 SD cutoff but a parental report of dyslexia; as in our previous work (O'Brien et al., 2019), these participants were excluded from the control group for group-level comparisons but are included in all other statistical analyses and data. The scores for these six participants fell between 86.5 and 92; they may represent children who, at one point, met criteria for reading disability but have since been remediated into the low-end of the typical range. Another possibility is that these children struggled on measures that we did not consider here, but would be of interest to the diagnosing professional. We cannot be certain as there is no standard for diagnosis among professionals in our area that we can relate to the reading measures assessed here.
Additionally, all subjects were required to have nonverbal intelligence quotient (IQ) and full-scale IQ (WASI matrix reasoning and FS-2 scores, respectively) no less than 1 SD below the population mean (as in O'Brien et al., 2018;O'Brien et al., 2019); three subjects were below this cutoff and excluded from further analysis. This left a total of 59 participants eligible for the study based on their cognitive characteristics, 53 of which could be confidently categorized as dyslexic or control for the purpose of group-level comparisons.
There were 24 subjects in the dyslexic group (13 male) and 29 subjects in the control group (14 male). The mean age and SD were 9.5 yr (1.4 yr) and 10.0 yr (1.5 yr), respectively, in the dyslexic and control groups; the difference in age was not significant (Kruskal-Wallis rank sum test, Hð2Þ ¼ 3:573; p ¼ 0:168), although we noted that there was a small correlation between age and reading skill (r ¼ 0:253; p ¼ 0:053). Importantly, we tested age as a covariate in our exploratory secondary models. We did not exclude participants with ADHD diagnoses from the study because ADHD is highly comorbid with dyslexia (German o et al., 2010). Indeed, research suggests that there is little validity in distinguishing children with dyslexia and a secondary comorbid diagnosis Peters and Ansari, 2019). Therefore, we accounted for ADHD diagnosis in our exploratory covariate analysis. Of 59 total participants, 13 had a formal diagnosis of ADHD: 7 in the dyslexic group and 4 in the control group. The difference in prevalence of ADHD across groups was not significant (Hð1Þ ¼ 1:851; p ¼ 0:174). Table I shows group comparisons on measures of reading and cognitive skills. Note that IQ (either measured as full-scale or nonverbal) differed by group. While we were not concerned that low IQ prevented any subject from understanding the task because low IQ was an exclusion criterion, we included nonverbal IQ as a covariate in our statistical analyses.

C. Stimuli
A seven-step /ba/$/da/ speech continuum was created using Praat version 6.0.37 (Boersma and Weenink, 2020). Synthesis of the continuum followed the procedure described in O'Brien et al. (2018), using linear predictive coding to alter the formant contours of a naturally produced /ba/ token. In the /ba/$/da/ continuum, the starting frequency of the second vowel formant (F2) transition was varied. In brief, the seven speech tokens were identical except for their F2 formant contour. All tokens were resynthesized from a /ba/ utterance spoken by a male American English speaker. The starting frequency of F2 was varied in seven linearly spaced steps from 1085 Hz (/ba/) to 1460 Hz (/da/). F2 followed a linear ramp to a terminal value of 1225 Hz over the course of 100 ms at which point the steady-state portion of the vowel was maintained for 250 ms.

D. Procedure
Stimulus presentation and participant response collection was managed with PsychToolbox for MATLAB (Brainard, 1997). Auditory stimuli were presented at 75 dB sound pressure level (SPL) via circumaural headphones (Sennheiser HD 600, Wedemark, Germany). Children were trained to associate sounds from the two speech continua with animal cartoons on the left and right sides of the screen and indicate their answers by pressing the right or left arrow key. Large text labels were provided over each animal cartoon ("Ba" on the left side and "Da" on the right) so that participants did not have to memorize the animal associated with each sound. Throughout all blocks, each cartoon was always associated with the same stimulus end point.
Participants first completed a practice round consisting of ten presentations, five of each continuum end point, with feedback on each trial. Participants were allowed to repeat the practice round up to three times until they had achieved at least 75% accuracy. All participants were able to meet this minimum standard.
The main task was presented in two parts, one in which the stimuli were drawn from a uniform frequency distribution and another in which they were drawn from a bimodal distribution. In the unimodal condition, all stimuli were presented 15 times. In the bimodal condition, the presentation frequency was greatest at the continuum end points and least in the center of the continuum (see Table II).
Because we were interested in exploring the effects of recently presented stimuli on judgments about the current stimulus, we used a "random but frozen" list of stimuli. This means that we randomly generated the order in which stimuli would be presented in each condition, but every participant was tested with this fixed stimulus order. This reduced one source of variability across subjects so that we could perform more targeted investigations about how recent stimulus presentations differentially affect strong and poor readers.
In each condition (uniform or bimodal frequency distribution), participants heard a total of 210 speech sounds. After every 35 stimulus presentations, a quick optional break was presented. Between the two test conditions, reading assessments were performed. If a participant did not already have an IQ measure on file from a previous laboratory visit, the WASI-III was also administered.
Note that three subjects did not wish to complete the task and opted to quit part of the way through; one such participant came from the control group and the other two participants were from the dyslexic group. Their data were omitted from the study. Complete data were, therefore, collected from a total of 56 participants (28 control, 22 dyslexic, and 6 not categorized).
Seven participants completed the uniform condition first and 49 completed the bimodal condition first. The reason for this discrepancy is that during data collection for the first 15 subjects, we alternated which distribution was presented first. After data were collected for these subjects, we were surprised to see little evidence that participants behaved differently in either condition-particularly because of the positive evidence from two published studies (Clayards et al., 2008;Vandermosten et al., 2018) and our own pilot data in six subjects, which appeared consistent with an effect of stimulus distribution. We, thus, changed to a policy of always providing the bimodal distribution first, wary that initial exposure to the uniform distribution could affect category learning in subsequent conditions. We were unable to detect any significant differences between task performance in these individuals and the remainder of the cohort. Psychometric function slope did not significantly differ by group [b ¼ À0:142, standard error (SE) ¼ 0.365, p ¼ 0.698], nor was there a significant interaction between the order distributions were presented and slope in each con- Based on this evidence, we retained these seven subjects in the data set.

E. Psychometric curve fitting
We used the MATLAB toolbox Psignifit 4.0 (The MathWorks, Natick, MA) to fit psychometric functions. The fitting routine optimized the fit of a logistic curve function with four parameters modeling the upper and lower asymptotes, width of the logistic function, and the category boundary. The width of the logistic function was transformed to the slope at the category boundary value (the estimated point on the continuum where 50% of tokens are labeled Da) to give a standardized measure of psychometric function slope. Psignifit uses a Bayesian framework to optimize parameter estimates not only according to likelihood of generating a given set of behavioral responses but also with regard to prior distributions of each parameter. In the case of psychometric function fitting, where the number of presentations of each stimulus is often relatively low, inappropriate priors can have an outsized influence on parameter estimatesparticularly, as we and many others have summarized elsewhere, when it comes to estimates of the slope and asymptotes. We, therefore, used priors identical to a previous study using this /ba/$/da/ continuum (O'Brien et al., 2019): the asymptotic priors were modeled as a uniform distribution on the range [0,0.10]. In other words, the lower and upper asymptotic parameters could vary freely in the range [0,0.10] to give a lower asymptote between 0% and 10% and an upper asymptote between 90% and100%. This prior width was chosen on the basis of tenfold crossvalidation over the data set to determine the psychometric fitting parameters that best predicted the participant's decisions on held-out trials (see O'Brien et al., 2018;O'Brien et al., 2019, for further details).
To ensure the validity of psychometric function parameter estimates, we excluded any psychometric functions that could not be fit with a category boundary between continuum steps 1-7. Only one psychometric function (produced by a subject in the dyslexic group presented with a uniform stimulus distribution) was excluded on these grounds.
We checked for correlations between reading skill and several metrics of psychometric function fit. The correlation between reading skill and sum of squared residuals (averaged over each participant's two psychometric function fits) was not significant (r ¼ À0.185, p ¼ 0.092). Likewise, deviance of the fits was not significantly associated with reading skill (r ¼ À0.076, p ¼ 0.41).

F. Statistical analysis of parameter estimates
After we fit psychometric functions for each subject in each condition, we used a series of generalized linear mixed models to determine the relationship between reading ability, the frequency distribution from which stimuli are drawn, and four dependent measures. These dependent measures were estimates of task performance based on behavioral responses: (1) psychometric function slope, (2) asymptote, (3) category boundary, and (3) a composite measure of psychometric function shape.
Each participant's average asymptote was determined by averaging the upper and lower asymptote estimates of a given function (i.e., their deviations from zero and one, respectively). The composite measure PC1 was constructed from a principal components analysis on the four parameters of each psychometric function collected in the study. The first principal component captured 46.1% of variance in the four parameters and was defined by the following linear weights: category boundary, À0.385; slope, 0.479; upper asymptote, À0.504; lower asymptote, À0.607.
Linear modeling was performed with the lme4 library for R. For each dependent measure, fixed-effect predictors with sum coding were used for the distribution (uniform or bimodal) variable. Reading ability was entered as a continuous fixed-effect predictor except where otherwise stated.
We tested a core model, where parameter was the psychometric parameter of interest: slope, lapse, category boundary, or PC1. We also tested the additions of three additional "nuisance" predictors to this core model: the presence/absence of ADHD diagnosis (treatment coding), age (continuous predictor), and nonverbal IQ (WASI-III matrix reasoning score; continuous predictor). The core modeling results are reported in the text, and the effects of the individual nuisance predictors are reported in Fig. 1 and Table III.

G. Data availability statement
Data are available immediately in a GitHub repository hosted by the laboratory. 2

III. RESULTS
As expected on the basis of previous studies, we found relationships between reading skill and psychometric function shape (Noordenbos and Serniclaes, 2015;O'Brien et al., 2018;O'Brien et al., 2019;Vandermosten et al., 2010). In Fig. 1, we can see that some psychometric parameters were correlated with reading ability, most notably the asymptote and PC1. Category boundary (the estimated point on the continuum with 50% of tokens labeled Da) was not significantly correlated with reading ability in either the uniform or bimodal condition.
We confirmed this with a generalized linear mixed model analysis, first, with regard to the relationship between reading ability and psychometric slope. In our core model of slope (Table III), reading skill was associated with a sharper slope as expected, although this effect did not reach the threshold of significance (b ¼ 0:224, SE ¼ 0.118, p ¼ 0.062). Note that the main effect of distribution (b ¼ 0:238, SE ¼ 0.161, p ¼ 0.15) and the interaction of distribution and reading ability (b ¼ À0:026, SE ¼ 0.163, p ¼ 0.87) were not significant. No nuisance variable proved to have a significant effect on slope (see Table IV).
Thus, we did not detect a significant relationship between psychometric slope and the frequency distribution from which stimuli were drawn. Moreover, we did not find evidence supporting the hypothesized interaction between reading skill and experimental condition (bimodal or uniform distribution).
Similarly, we tested the core model as a predictor of the asymptote and found a significant main effect of reading skill (b ¼ À0:013, SE ¼ 0.004, p < 0.001). There was also a modest main effect of distribution on the asymptote; the uniform distribution was associated with an 0.008 point greater asymptote than the bimodal distribution (SE ¼ 0.004, p ¼ 0.049). While this effect is small, it is in the direction we would expect if the bimodal distribution had a stabilizing effect on phoneme categories in most participants (at least, more reliable labeling of the clear category exemplars at the end points of the continuum). Importantly, the interaction of the asymptote and reading skill was not significant. Of the nuisance variables, only age was significant (see Table V). Even when age was included in the model, the main effect of reading skill remained significant.
Next, we modeled category boundary. Reading skill and the interaction of reading skill and distribution were both insignificant predictors. There was a significant main effect of distribution on category boundary; in the bimodal condition, participants, on average, tended to have a higher category boundary than in the uniform condition. In other words, they were slightly biased to label sounds as ba in the bimodal condition relative to the uniform condition. The effect is modest (an average shift of approximately 1/4 of a step on the continuum). We had not hypothesized that the category boundary would shift with the stimulus distribution, and it is important to note that having a higher or lower category boundary is not generally considered better for speech perception. As such, this finding should be considered post hoc. Still, the associated p-value would pass most standard corrections for multiple corrections (b ¼ À0:230, SE ¼ 0.064, p 0:001). No nuisance variable was a significant predictor of the category boundary (see Table VI). Last, considering PC1 as the dependent variable, reading skill was a significant predictor of PC1 (Table III). Neither distribution nor the interaction of reading skill and distribution were significant predictors. Of the nuisance variables, only age was associated with a significant effect; after accounting for age, the effect of reading skill was still significant (see Table VII).
Taken as a whole, we detected only minor effects of the stimulus distribution on the asymptote and category boundary. Neither of these effects was hypothesized from the outset. Most importantly, we did not detect that any effect of the stimulus distribution on any psychometric parameter meaningfully varied with reading skill.
We also computed the Bayes factor (BF), which describes the ratio of the likelihoods that our data set was generated by either of the two models: H 0 , the null model with no interaction between reading skill and stimulus distribution, and H 1 , the model containing the hypothesized interaction (Kass and Raftery, 1995). While a p-value describes the likelihood of rejecting the null, the BF estimates which model is more likely given the data (Wagenmakers, 2007).
We used an estimation of the BF from the Bayesian information criterion (BIC; Wagenmakers, 2007), In this case, the model H 0 is defined as parameter $ reading þ distribution þ ð1jsubject IDÞ; where the dependent variable parameter is a parameter of the fitted psychometric functions-slope, asymptote, category boundary, or PC1. Similarly, the model H 1 is defined as For the slope, BF 01 ¼ 25:6; for the asymptote, BF 01 ¼ 1007:6; for the category boundary, BF 01 ¼ 65:4 and for PC1, BF 01 ¼ 32:6. In all cases, the BF indicates considerably stronger evidence for the null model (i.e., no interaction of reading skill and stimulus distribution). By standard BF reporting, these results would be considered strong to very strong evidence for the null.
For completeness, we also tested the interaction of reading skill and stimulus distribution with reading skill treated as a categorical variable (dyslexic versus control). A mixed effects analysis of variance (ANOVA) with a random effect of subject was used to evaluate the interaction term, using the Kenwards-Rogers estimation of degrees of freedom.  (Noordenbos and Serniclaes, 2015). For separation of the asymptote by group, d ¼ 0.87 (CI ¼ ½0:27; 1:47), by category boundary, d ¼ 0.08 (CI ¼ ½À0:50; 0:65, and for separation of PC1 by group, d ¼ 0.67 (CI ¼ ½0:08; 1:27).
Altogether, our analyses indicate that from the behavioral responses alone, there is little evidence that poor readers and strong readers are differentially affected by the stimulus distribution of the categorical labeling task. Our findings are somewhat complicated by the fact that we did not find clear evidence for a robust effect of stimulus distribution on categorical behavior; the small main effects on category boundary and asymptote are difficult to interpret and not expected from prior literature. To the extent that our data can speak to the effects of stimulus distribution on task performance, though, our results do not support a relationship between dyslexia and altered distributional sensitivity on this particular task.

A. Effects of recent stimulus presentations on phoneme labeling
Because we collected 420 responses per individual, our data set may provide sufficient power to examine stimulus recency effects. To explore this possibility, we employed the modeling approach of Lieder et al. (2019), which uses generalized linear models (GLMs) to investigate how recent stimulus presentations affect the judgment of the current stimulus' identity.
For every stimulus presentation in the data set, we determined the identity of the preceding four stimuli. As in Lieder et al. (2019), we adopt the following notation.
Let d 0 be the stimulus steps (1-7) of the current stimulus presentation, t. Then, d 1 is the difference in steps between d 0 and the stimulus presented at trial t -1. Similarly, d 2 is the difference in steps between d 0 and the stimulus presented at trial t -2, and so on for values d 3 and d 4 .
The mixed effects GLM specifying the relationship between the label assigned to the current stimulus presentation and the recent presentations is as follows: where f is the probit link function, b coefficients are linear weights to be estimated, d 0 represents the continuum step of the current stimulus presentation, and ð1jsubject IDÞ is a random intercept for subject. The probit was chosen as the link function because the dependent variable, response, is binomially distributed-i.e., participants decided whether a sound was da or not. We note that the probit function contains only two parameters that variously adjust the slope and category boundary of a sigmoid and, therefore, differences in the asymptote would have an effect on the estimated slope.
For the purpose of illustrating this approach clearly, we begin with an exploration of group differences and then move on to a model where reading skill is treated as a continuous variable. We first fit a mixed effects GLM to the responses of each group (control and dyslexic). The estimated coefficients are compared in Fig. 2.
We can immediately see that stimulus recency effects exist and (d 1 -d 4 ) are quite similar between groups. The coefficient that differs by group is the weighting of d 0 -in other words, the mixed effects GLM estimates that the probit slope is lower in the dyslexic group even when recent stimulus presentations are accounted for.
Having visualized the group-level differences, we follow up with a treatment of reading skill as a continuous measure in the GLM. On the basis of our initial exploration, First, we consider the mixed effects GLM containing main effects of d 1 , d 2 , d 0 , and reading skill, plus an interaction of d 0 and reading skill. We hypothesized a significant interaction between reading skill and d 0 on the basis of the previous group-level model. Indeed, the interaction, as well as the stimulus recency terms, were all highly significant (Table VIII).
We also tested augmenting our hypothesized model to include an interaction between reading skill and stimulus recency terms d 1 and d 2 . Adding a d 1 * reading skill interaction increased the Akaike Information Criterion (AIC) from 15 868.8 to 15 869.0 and increased the BIC from 15 925.0 to 15 933.3, and the new interaction term was not significant (b ¼ 0:007, SE ¼ 0.005, p ¼ 0.182). We, again, computed the BF to assess the relative evidence for the presence of an interaction; we estimated BF 01 ¼ 61:9, consistent with strong evidence for the null. Considering a d 2 * reading skill interaction term fared no better: the interaction was not significant (b ¼ 0:0007, SE ¼ 0.005, p ¼ 0.896), and AIC and BIC increased (to 15 870.8 and 15 935.0, respectively) compared to the simpler model. Again, the BF indicated strong evidence for the null with BF 01 ¼ 149:6.
From this investigation of stimulus recency effects, we find that we are able to detect highly significant effects of the last two stimulus presentations on judgments of the current stimulus' identity. We did not detect a significant interaction of reading skill and the influence of previous stimulus presentations, and Bayesian analysis provides evidence against the presence of an interaction. In all, the model upholds the interpretation that psychometric functions are steeper in stronger readers regardless of the context in which each stimulus presentation occurs.
Last, we performed an analysis to characterize the mechanism by which recently presented stimuli influence judgments of the current stimulus. We hypothesized that if the current stimulus was ambiguous-i.e., it was drawn from the center of the /ba/$/da/ continuum-then the influence of the previous stimulus would be greatest. In other words, listeners might make greater use of the contrast between the current and previous stimuli when the current stimulus is ambiguous than when the current stimulus is a clear category exemplar. We tested this hypothesis with another mixed effects GLM. First, we created a new binary feature that distinguishes stimuli drawn from the center of the continuum versus stimuli drawn from the end points, ambiguous ¼ 0; if d 0 2 steps 1; 2; 6; 7 ½ ; 1; otherwise: We then tested the model where f is a probit function as in Eq. (5). We were specifically interested in the interaction of d 1 and ambiguous: a significant interaction indicates that the magnitude of the difference between the current and past stimulus depends on whether the current stimulus is a category exemplar or not. As expected, the interaction of d 1 and ambiguous was significant (b ¼ 0:085, SE ¼ 0:007; p < 0:001). This analysis upholds the intuition that previous trials influence the present judgment by providing a contrast by which to judge ambiguous stimuli. If this is indeed the primary mechanism by which stimulus recency effects influence performance on the categorization task, then our study is not alone in finding a lack of interaction between reading skill and such contextual effects: Blomert and Mitterer (2004) also found no evidence for context effects at several linguistic scales in a phoneme labeling task like ours.

B. Quantifying fatigue during the task
The relatively large number of trials collected per subject allows us to revisit an analysis proposed by Messaoud-Galusi et al. (2011) to determine whether poor readers show precipitous declines in task performance as the study goes  on. If poor readers become fatigued or distracted at a faster rate than strong readers do, that could explain overall differences in task performance.
To this end, we modeled the probability of correctly labeling a clear ba or da exemplar as a function of the trial number and reading skill (Fig. 3). Note that clear category exemplars are stimuli drawn from the two ends of the continuum (steps 1 and 7). Having already established that poor readers produce shallower psychometric functions overall, we should expect that the probability of correctly labeling these tokens will be lower overall in poor readers. If an interaction of trial number and reading skill is found to be significant, that would suggest task fatigue occurs differentially across the spectrum of reading skill.
Once again, we used a mixed effects GLM with the subject as a random effect (as each subject participated in two test conditions, each with 210 trials). The dependent variable was accuracy on labeling an end point of the continuum, which was coded as 1 or 0. The model included a main effect of trial number, a main effect of reading skill, and an interaction of the two. Trial number and reading skill were scaled and centered prior to modeling. The results of this analysis are provided in Table IX. While there was a significant interaction of reading skill and trial number, the direction of this effect is actually opposite what we might have predicted-greater reading skill is associated with a more deleterious effect of trial number on accuracy. Inspection of our data reveals that this trend is strongly influenced by one particular subject in the dyslexic group who began with nearly chance accuracy and became more accurate over the course of the task. When we removed this subject from the model, the magnitude of the interaction effect more than halved (b went from À0.044 to À0.018) and the interaction was no longer significant (p ¼ 0.397).
We also considered reaction time as a measure of the task engagement. Looking again at the continuum end points, we selected reaction times between 200 ms and 3 s (to remove spurious responses and outliers as in O'Brien et al., 2018) and log-transformed these observations. The model of reaction time is given in Table X. Faster reaction time is associated with stronger reading skills (replicating O'Brien et al., 2018). However, we did not detect a significant interaction between trial number and reading skill.
As such, our results do not corroborate the idea that poor readers are especially prone to becoming disengaged, distracted, or tired during the task-at least by our proposed measures of engagement.

IV. DISCUSSION
It is well established that phoneme categorization is related to reading skill, but there are a variety of explanations for this relationship. We considered how the frequency distribution from which stimuli are drawn during a standard phoneme labeling task might deferentially affect task performance in children with dyslexia versus task performance in typical readers. Indeed, some authors have posited that differences in psychophysical task performance may, in some situations, actually reflect a difference in an individual's sensitivity to task distributions (i.e., the distribution of the stimuli). Because our task did not appear to induce the sort of changes in psychometric functions that other authors have noted (such as slope), it is challenging to draw direct comparisons between our results and others (e.g., Clayards et al., 2008;Vandermosten et al., 2018). However, insofar as we detected some effect of the stimulus distribution statistics on aspects of task performance-where listeners draw a category boundary and a slight tendency to make more labeling errors of clear stimuli-we do not find evidence that the effect differs in children with dyslexia.
There are several reasons why we may not have seen the same effects of stimulus distribution as other authors: unlike in the study of distributional learning in children with and without reading disability by Vandermosten et al. (2018), we used a native-language contrast that may have been overlearned by our participants prior to our study. If  this explanation were true, it would imply that children with dyslexia are entirely equipped to leverage statistical learning to establish phonetic categories from their natural environment (although we cannot rule out that they may do so to a lesser degree than their typically developing peers). However, there is evidence from the literature against this interpretation: even in typically developing children, identification functions produced by categorizing stop consonants do not fully resemble those of adults (Hazan and Barret, 2000, showed this in children aged 6-12 years, and McMurray et al., 2018, largely replicated the finding in adolescents up to age 18 years). These findings argue that even native-language contrasts are unlikely to be fully learned by age 12 years, the oldest child in our study. Another potential explanation for our differing results from previous reports is that our measurements may not have been sufficiently precise: Clayards et al. (2008) detected stimulus distribution effects on psychometric function shape for a native-language contrast in adults using eye-tracking to recover a time-series measure of looks to a closed set of choices displayed on a screen. We do not have access to a similarly fine-grained measure.
Our data set also allowed us to apply the modeling approach of Lieder et al. (2019) to investigate how previous stimulus presentations affect judgments about the current stimulus. We were able to detect effects of the previous two stimulus presentations but, critically, did not find that these effects interacted with reading skill. In other words, people with dyslexia show worse overall phoneme categorization performance but equivalent stimulus recency effects compared to people with typical reading skills. Our results are broadly consistent with the findings of Lieder et al, which showed that stimulus recency effects were similar in adults with and without dyslexia (albeit in a task involving the judgment of tone frequency differences).
For stimulus recency effects to be intact in children with dyslexia, it seems necessary that at least one aspect of sensory encoding is intact-if stimuli were not encoded with sufficiently high fidelity, it is difficult to imagine that children would be sensitive to differences between previous and current presentations. However, because we do not yet rigorously understand the neural or perceptual basis of categorical labeling, it is difficult to extend these results to understanding the quality of neural encoding involved in categorical decision-making. Thus, while our results indicate at some perceptual level that speech encoding is similar between groups, more work is needed to connect our findings to the broader debate about sensory encoding in dyslexia (Casini et al., 2018;Goswami, 2011;Hancock et al., 2017). At this point, our results can mainly be taken to contradict claims that adaptation or anchoring to recent stimuli is different in children with dyslexia Jaffe-Dax et al., 2017;Krause, 2015;Nicolson and Fawcett, 2018;Perrachione et al., 2016).
Finally, we tested whether children with dyslexia showed signs of increased fatigue during the task by analyzing their performance on relatively easy trials throughout the course of the experiment. Although individuals with dyslexia showed a tendency to make more errors on "easy" trials throughout the experiment, we did not find evidence to suggest that they were merely becoming less attentive over time as Messaoud-Galusi et al. (2011) did in a similar task. It is possible that our results differ because we allowed children a brief break (typically less than one minute) every 35 trials, whereas participants in the study by Messaoud-Galusi and colleagues adhered to a different schedule. Additionally, our participant demographics may have differed as many of our children are well-accustomed to computer games at home and at school. While we are, therefore, cautious to generalize our results broadly, we can conclude that task fatigue is unlikely to explain the patterns of categorical labeling we present here. This may be reassuring with regard to the large amount of literature on categorical labeling in individuals with dyslexia: while experimenters must remain vigilant of ways that overall decreased accuracy can bias measures of task performance (Roach et al., 2004;Wichmann and Hill, 2001), our results suggest that a simple explanation of task engagement alone is unlikely to account for the entire relationship between reading and categorical labeling.
Considering our results and the current state of the field, we believe researchers are at an intriguing moment: there is compelling evidence that in certain experiments apparent deficits in groups of participants with dyslexia are wellexplained by nonlinguistic and non-sensory mechanisms Ahissar, 2004, 2006;Gabay et al., 2015), and this framework has considerably more power to explain the diversity of deficits associated with reading disability than purely sensory or phonological models. Still, there are considerable gaps in this explanation: not only are there are experimental contexts where individuals with dyslexia appear to have no statistical learning deficit (Du and Kelly, 2013;Gabay and Holt, 2018;Gould and Glencross, 1990;In acio et al., 2018;Jim enez-Fern andez et al., 2011;Samara and Caravolas, 2017;Staels and Van den Broeck, 2015; perhaps reflecting ongoing vagueness in what "statistical learning" encompasses) but effect sizes in studies that do detect group differences are still too small to accurately separate most cases of dyslexia from typical reading (Lieder et al., 2019;Vandermosten et al., 2018).
Even if we take the view that reduced categorical labeling in struggling readers is entirely the consequence of impaired sensitivity to phonetic categories over the course of many years of language exposure, group separability would remain quite modest: the average effect size in categorical labeling studies is 0.66 (Noordenbos and Serniclaes, 2015), meaning that only 9.7% of individuals with dyslexia would fall below the 95% confidence interval of the control population. A further problem for the statistical learning hypothesis is that it often relies on an assumption that a very subtle impairment can cascade to have drastic effects on literacy by disrupting the development of typical phonological processing. However, a growing body of literature (Booth et al., 2000;Calcus et al., 2018;O'Brien et al., 2018;Robertson et al., 2009;Snowling et al., 2019;Talcott et al., 2000) suggests that a cascading model is inadequate: performance on psychophysical tasks can relate to reading skill separately from the proposed phonological processing mediation pathway. It may be that the categorical labeling task is an index of something far broader than phonological awareness, picking up on other aspects of developmental and linguistic experience.
Sharpening of category boundaries may be partially a result of reading experience itself. In this hypothesis, the acquisition of reading-and spelling skills, in particularmay create or reshape the representation of sound categories (Dich and Cohn, 2013). This hypothesis has been considerably understudied and would best be addressed via careful intervention studies in which category boundaries may be assessed before and after literacy training.
With that said, considering the results of the present study in conjunction with previous results from our group (O'Brien et al., 2018;O'Brien et al., 2019), we are hesitant to recommend the phoneme categorization task for future research. While these psychometric functions are reliably correlated with reading skill in our laboratory and many others, our ability to interpret them with regard to a mechanistic view of reading development is limited. As we have discussed, it is challenging to disambiguate sensory factors from broader aspects of language and cognitive development using these behavioral results. Further, it is not obvious how to extend behavioral results from this narrow experimental context to speech perception "in the wild" (Holt and Lotto, 2010). Seeing as this task has a long history as a probe for speech perception in reading-impaired populations dating back to the early 1980s (Noordenbos and Serniclaes, 2015), we are eager for researchers to develop new experimental paradigms that are guided by our field's maturing understanding of speech perception. For example, statistical methods and experimental tools have sufficiently advanced that researchers can consider time-series measures of behavioral responses (such as eyetracking, as in McMurray et al., 2018) and naturalistic, sentence-length stimuli incorporating talker variability. It is more possible than ever to create tasks that both resemble ethological speech perception and afford fine experimenter control, and we encourage researchers guided by mechanistic hypotheses to think beyond phoneme identification.
In summary, our results are consistent with the perspective that multiple causal routes relate performance on various behavioral and psychophysical measures to reading skill Ziegler et al., 2019). Under this model, deficits in learning category boundaries from speech sounds may be one of many factors that contribute to difficulties with reading-or, potentially, a consequence of developmental literacy challenges (Dich and Cohn, 2013). In light of this, we are most optimistic toward future research that explores how constellations of risk factors Schatschneider et al., 2016;Spencer et al., 2014), including but not limited to reduced category learning, phonological processing, and sensitivity to the statistics of nonlinguistic stimuli, act in concert to determine a child's reading skill.