Introduction

Inter-individual variability in olfactory performance is substantial and very well documented (e.g. Doty et al. 1984; Hummel et al. 2007b; Kobal et al. 2000). Nevertheless, there are still knowledge gaps in our understanding of how this variability develops. Cross-sectional studies show that the most important demographic predictor is age (Doty and Kamath 2014; Hawkes and Doty 2009) although one must bear in mind that age itself or the passing of time is not what drives developmental changes across the human lifespan and should only be considered a proxy for the actual causal mechanisms. For instance, a child’s olfactory performance improves with age, but this progress is most likely driven by growing experience with odours and improving linguistic abilities (Monnery-Patris et al. 2009; Stevenson et al. 2007), broadening working memory span (Larjola and von Wright 1976), improving recognition memory (Frank et al. 2011; Hvastja and Zanuttini 1989), changes in nasal aerodynamics or more effective inhalation of olfactory stimuli (Mennella and Beauchamp 1992). To this day, there are very few longitudinal studies on olfactory development in general and children in particular. Thus, our knowledge derives almost exclusively from cross-sectional studies even on the level of this proxy (Monnery-Patris et al. 2009; Renner et al. 2009; Richman et al. 1992; Richman et al. 1995a; Schriever et al. 2014). Besides, there is an unsettled general debate about the extent to which variation observed in cross-sectional studies reflects actual ongoing changes within individuals or is caused by methodological artefacts. These may include, for example, discrepancies between cross-sectional and longitudinal modelling of age-related effects, that is, whether they are estimated by performing ordinary least squares on all data or by summarising each individual’s data by a slope and then averaging these slopes. To be specific, when the age effect is modelled as linear when it is, in fact, non-linear, linear trends estimated cross-sectionally and longitudinally may differ (e.g. Louis et al. 1986). Furthermore, in olfactory research, children have been an underrepresented cohort and often treated indiscriminately as a single demographic group regardless of age (e.g. Monnery-Patris et al. 2009; Renner et al. 2009). This is despite the number of changes that take place over childhood in cognition, self-regulatory capacity and other factors that may affect olfactory performance considerably (e.g. Goswami 2011). Thus, a more discerning approach to developmental staging in olfactory studies must be embraced. This knowledge gap warrants further, longitudinally oriented research.

An area of study that calls for a longitudinal approach is the gender variation in olfaction and how it develops. A great body of literature reports the superior performance of women on odour identification tests (e.g. Hummel et al. 2007b; Larsson et al. 2004), although not uniformly so (e.g. Kobal et al. 2000). The results for odour discrimination and memory are mixed, with studies indicating small, if any, differences (e.g. Choudhury et al. 2003; Kobal et al. 2000; Öberg et al. 2002). Research assessing olfactory thresholds revealed either no gender difference or greater sensitivity of women to some odorants, but the differences were also small (e.g. Cometto-Muniz and Abraham 2008; Chopra et al. 2008; Koelega and Köster 1974). Some researchers argue that gender differences are present at birth (Doty and Laing 2015), but this proposition is based on cross-sectional data, as studies observing a single sample of children over any period of time are missing. Moreover, some studies indicate that gender variation in olfaction may not be reliably present in infancy and pre-school (Bastos et al. 2015; Cameron and Doty 2013; Martinec Nováková and Vojtušová Mrzílková 2016a, 2016b; Saxton et al. 2014; Schriever et al. 2014). Also, there is little clarity about the magnitude of the gender effect on olfaction in early childhood. These questions invite testing in longitudinal studies.

Studies in which repeated measurements are taken require controlling for test practice effects (that is, remembering the previous responses and proficiency in handling the research tasks encountered before; see Heilbronner et al. 2010 for a review). Repeated olfactory testing itself can be seen as an opportunity for the participant for perceptual learning, which is a relatively long-term, learned increase in perceptual acuity (Wilson et al. 2009). Olfactory performance benefits from repeated odour exposures, which can be relatively brief (Dalton et al. 2002; Wysocki et al. 1989). It is, therefore, necessary to account for the number of sessions attended and their spacing. Another experiential effect to consider is school entry. Formal education modulates and facilitates performance on cognitive tasks across the human lifespan (Ceci 1991; Steinbrink et al. 2014); for review, see e.g. Rogoff (1981); Rosselli and Ardila (2003). Nevertheless, its potential effect on children’s olfactory performance, in general, has been largely overlooked in research on the chemical senses.

Here, we repeatedly tested children’s olfactory abilities across five waves over a 2-year period. We focused on pre-school children who were aged between 3.5 and 7 years at the beginning of the study. Even though some of the previous studies show that 6 years is the youngest age at which it is feasible to carry out psychophysical olfactory tests (e.g. Hummel et al. 2007a), in our study, data from younger children did not manifest themselves as outliers or influential cases. This age group is also of special interest because of the possible effect of experiential factors related to school entry. We hypothesised that olfactory scores would be higher in older children and would increase with time, that girls would outperform boys and that children who attended more testing occasions in shorter intervals and schoolchildren would outperform those who participated in fewer sessions over a wider time span, and pre-schoolers. As this was an initial study, we did not delve into the actual factors behind the proxies of age and gender.

Materials and Methods

Participants

The participants were 157 children of Czech origin: 79 boys, mean age at study commencement 5.87 ± 0.71 years, range 3.58–7.08 years, and 78 girls, mean age at study commencement 5.75 ± 0.57 years, range 4.42–7.00 years. At the start of the study, the children attended one of six public mixed-sex kindergartens in Prague and its suburbs. The kindergartens were attended by children from varied social backgrounds. Kindergarten principals were contacted via telephone, e-mail, and in person to inform them about the planned study. Those who had provided permission to perform the study on the kindergarten’s premises were asked to pass the information on to the teachers, who distributed leaflets to children’s parents. We kept the e-mail addresses and phone numbers the parents had provided to contact them later with an invitation for their children to take part in subsequent sessions. After the given child entered school, the session took place either on the school premises or at our department. The sessions were scheduled in approximately 6-month intervals. Namely, data was collected in the late autumn and early winter of 2010 (Wave 1), late spring and early summer of 2011 (Wave 2), late autumn and early winter of 2011 (Wave 3), late spring and early summer of 2012 (Wave 4), and late autumn and early winter of 2012 (Wave 5). Half the children were recruited and tested for the first time during Wave 1 (48.41%, N = 76, 33 boys), whilst the other half entered the study and were first tested during Wave 2 (51.59%, N = 81, 47 boys). Because on the first testing occasion, these two groups did not differ in age, t (153.72) = 0.45, p = 0.65 or baseline olfactory performance (identification: t (153) = − 1.17, p = 0.24; discrimination: t (153) = − 1.96, p = 0.30; threshold: t (138) = 1.12, p = 0.28), group membership was disregarded. All the children were last tested during Wave 5, meaning that there was data from the maximum of five testing occasions available for those who entered the study during Wave 1, but a maximum of four for those who were recruited during Wave 2.

Due to the longitudinal nature of the study, missing data was a concern. However, the majority of the missing data was the result of participant absence on the day of data collection rather than attrition from the study. The number of tests administered was 76 (100% pre-schoolers) in Wave 1, 135 (100% pre-schoolers) in Wave 2, 91 (49% pre-schoolers) in Wave 3, 34 (94% pre-schoolers) in Wave 4, and 81 (100% school age) in Wave 5. For logistic reasons, we were unable to collect data from school children in Wave 4, resulting in lower N, a greater proportion of pre-schoolers and a drop in mean age within this wave. Participant-wise, complete data was available from two children recruited to the study during Wave 1 and 21 children recruited during Wave 2.

Thus, 85.35% of participants had some missing data due to skipping at least one testing occasion. Online Resource 1 shows that prior to Wave 5, children with complete and incomplete data did not differ on any measures. Because the children entered the study at different times (Wave 1 vs. Wave 2) and the pattern of skipping the sessions was highly individual (i.e. the total number of testing occasions already attended by the given wave varied), experiential effects on the children’s reports could not be simply operationalised in terms of the waves. For example, of the 39 children recruited to the study during Wave 1 who participated in Wave 5, which would have been, in theory, their fifth testing occasion, in reality only two children had actually attended all four of the previous sessions, whilst 19 children had been tested only three times and 16 children only twice, and two had thus far attended only a single session. Therefore, the level of individual experience with the tasks within the specific wave was expressed as the cumulative total of sessions actually attended by the given child thus far. Only a single session was attended by 31 children (19.7%). The number of children who attended two, three, or four sessions was N = 33 (21.0%), 51 (32.5%), and 40 (25.5%), respectively. Additionally, we also took into account the time interval since any previous testing. For instance, for children recruited to the study during Wave 1 who participated in Wave 5, this time interval ranged from 5 to 25 months. The number of children (boys, girls, and total) participating in each wave is given in Table 1, along with the descriptive data on their age, experience level, and olfactory performance. Figure 1 shows the frequencies of boys and girls participating within Waves 2–5 relative to the cumulative total of sessions already attended (newly recruited children within Wave 2 are not shown).

Table 1 Mean ± SD (range) and valid N of the children’s age, interval in months since any previous testing occasion, olfactory scores (identification, discrimination, and threshold), and median of sessions already attended across the individual waves in boys, girls, and the total sample.
Fig. 1
figure 1

The absolute frequencies of boys and girls participating within Waves 2–5 relative to the cumulative total of sessions already attended

Binomial tests showed that the proportion of boys and girls participating in each wave was not significantly different from the 50:50 ratio: p = 0.30 (Wave 1), p = 0.39 (Wave 2), p = 1.00 (Wave 3), p = 0.18 (Wave 4), and p = 0.65 (Wave 5), nor did the proportions of boys and girls differ across the waves, p = 0.81. T-tests confirmed that children participating in each wave were significantly older than those who attended the preceding one: p = 0.003 (Wave 2 vs. Wave 1), p < 0.001 (Wave 3 vs. Wave 2), and p < 0.001 (Wave 5 vs. Wave 4), with the exception of the age difference between Wave 3 and 4, which was non-significant (p = 0.93). This means that the effect of different children participating in the individual waves was in general overshadowed by the effect of passing time. This also held true for boys and girls, as well as the recruitment groups analysed separately. The boys and girls within each group participating in each of the waves did not differ in age.

Olfactory Measures

General Considerations

The Sniffin’ Sticks test (Hummel et al. 1997), manufactured by Burghart Messtechnik GmbH, was used to assess odour identification, discrimination, and threshold. This is one of the most widely used tests of olfactory performance, based on pen-like odour-dispensing devices. The Sniffin’ Sticks test has been widely used by clinicians and researchers across Europe to test olfactory abilities in adults (Hummel et al. 2007b) and children (Ferdenzi et al. 2008b; Renner et al. 2009; Sorokowska et al. 2015), including Czech ones (Martinec Nováková et al. 2018a; Martinec Nováková et al. 2018b; Martinec Nováková et al. 2015; Martinec Nováková and Vojtušová Mrzílková 2016a, 2016b; Saxton et al. 2014).

Odour Identification

The 16-item “blue” identification test consists of odours familiar to the general European population, namely orange, leather, cinnamon, mint, banana, lemon, liquorice, turpentine, garlic, coffee, apple, clove, pineapple, rose, anise, and fish (exact chemicals are not specified by the manufacturer). In the original version of the test, cued identification is employed in which participants select a verbal label for the target odour from a candidate list of four alternatives. The resulting score is the sum of correct answers (maximum of 16).

In the present study, the test was adapted to children who could not yet read or were only beginning to learn how to read. This was done by presenting both the targets and distractors in the form of colour pictures instead of verbal labels. Selection of pictures was based on a pilot study (N = 31, 16 boys, 5.85 ± 0.89 years) carried out in one of the kindergartens. It was split over two sessions. In the first session, children were interviewed about their understanding of the individual verbal descriptors, both the targets and the distractors. Prompted by the question “What do you think…is, what does it look like, when or where can you find it?,” children were encouraged to elaborate on the meanings of the individual labels. Based on these accounts, images depicting items most frequently associated with the given verbal label were selected. For instance, since the majority of children tended to associate “mint” with chewing gum and mints, a picture of confectionery was used (see e.g. Bastos et al. 2015 for a similar approach). These interviews had also revealed that a majority of children expressed uncertainty about what most of the spices, menthol, and turpentine were, looked like and smelled like, and in which foods or products they could be found. Therefore, items 3, 8, 12, and 15 (targets: cinnamon, turpentine, clove, and anise) were excluded from the odour set. In the following session of the pilot study, four colour pictures arranged in a 2 × 2 square on an A4 page were presented for each item. The researcher pointed at each picture, one by one, and encouraged the child to describe it. On average, children produced the veridical label in 88% of cases and a near miss in 12% of cases. The veridical label is the commonly used name of the given stimulus (Dubois and Rouby 2002). The children often started off by explaining where and when they last encountered the depicted item. In that case, the researcher would engage in a conversation with the child to see whether he or she could produce the veridical label in the end. A near-miss referred to an incorrect label, though one that was precise and that represented an item readily confusable with the stimulus. Specifically, children tended to confuse blackberry with raspberry or blueberry (item 1), chives with parsley and fir with spruce (item 4), grapefruit with orange or clementine (item 6), peach with apricot (items 6, 11, and 13), wine with beer (item 10), and chamomile with daisy (item 14). For at least some of the items, there is evidence for an age-related increase in familiarity (e.g. Noll et al. 1990). Hence, some near-misses arose more as a result of children’s limited experience with certain items, such as alcoholic beverages, not the depictions per se. The final set of pictures employed in the study has been published elsewhere (see Supplementary Material to Martinec Nováková and Vojtušová Mrzílková 2016a). Results of the pilot study, given in Online Resource 2, show that the scores were close to those reported in studies on children’s Sniffin’ Sticks performance that were available at the time.

In the study, at the beginning of each trial, children were first asked to describe each picture and were given immediate feedback about the correct answer. Stimuli were presented one by one, in the order recommended by Hummel (2004). The researcher removed the cap of the pen and placed the tip approximately 2 cm in front of both nostrils during several respiratory cycles (at least 5 s). At the same time, she asked the child to choose a label for the given odour by pointing at one of the four pictures. The interval between odour presentations was between 20 and 30 s to prevent olfactory adaptation. The theoretical score range was 0–12, with higher scores indicating better identification performance.

Odour Discrimination

The pilot study (see Online Resource 2 for details) showed that children’s discrimination scores came close to those reported by Kobal et al. (2000), Hummel et al. (2007b), Hummel et al. (2007a), and Ferdenzi et al. (2008b), albeit those studies were carried out in broader age groups. Hence, no alterations were made to the discrimination test aside from masking the colour bands on the Sniffin’ Sticks pens with opaque Sellotape because most children participating in the pilot study found the blindfold uncomfortable. The test of odour discrimination assesses the degree to which an individual can differentiate between odours in suprathreshold concentrations. The set comprises 16 triplets of odorised pens (marked with a blue, green, and red colour band), of which two are identical, and the participant is asked to indicate the odd one, which is always the green one. The odorants used in the test and the order of presentation (which was followed) are given in Hummel et al. (1997) and in Online Resource 3c. Presentation of triplets was separated by circa 20 s. The score is the total of correct trials (0–16), with higher scores indicating a better odour discrimination ability. Prior to the first trial, the task was demonstrated using three colour pencils, of which two were of the same colour and the third one was different, explaining to the children that they had to identify the odd pen by smell, not colour. Next, task comprehension was determined by presenting the child with another three pencils of different colours, upon which they were reminded once again about the olfactory, not visual, nature of the task.

Odour Threshold

The olfactory threshold refers to the minimum concentration of a tested odorant that an individual is able to reliably differentiate from a blank sample. The set employed in the present study consisted of 16 dilution steps of n-butanol (targets), each of which formed a triplet with two blanks. As recommended by Hummel et al. (1997), a single-staircase, three-alternative forced-choice (3-AFC) method was used, in which, starting with the lowest concentration (dilution number 16), an ascending (low to high concentration) series of even-numbered triplets was presented, with successful trials prompting another presentation of the same triplet in a random order. Two successful trials in a row marked a turning point; starting with the nearest lower concentration, a descending series of triplets was presented until the child failed to detect the target. This marked a reversal towards the higher concentrations and, starting with the next higher concentration, an ascending series of triplets was presented until two correct trials occurred, marking another reversal. The testing was finished after a total of seven reversals was reached. To sustain the children’s attention throughout the assessment, they were rewarded with a candy for each response, regardless of whether it was correct or not. They were not given any feedback on their performance during or after the test. The threshold score was computed as the arithmetic mean of the dilution number at the last four reversals. Ranging from 1 to 16, higher scores indicate greater olfactory sensitivity (i.e. lower threshold).

Procedure

Parents and teachers were instructed to only encourage their children to attend the testing sessions, scheduled between 9 a.m. and 3 p.m., when in good respiratory health. We relied on parental reports and absence of any condition potentially affecting the sense of smell was not further verified. Testing took place in a secluded, well-ventilated room without strong ambient odours. Firstly, children were briefly familiarised with the tasks, which were presented as a game, and ensured that they could stop or quit at any time. The order of the olfactory tests was randomised across children. However, within the olfactory tests, the stimuli were presented in the order recommended by Hummel et al. (1997). Since this study was part of a larger project (see Martinec Nováková et al. 2018a; Martinec Nováková et al. 2018b for other studies on this sample), the children were also interviewed about their olfactory behaviours using the Children’s Olfactory Behaviours in Everyday Life (COBEL) questionnaire (Ferdenzi et al. 2008a). It includes questions designed to evaluate self-reported awareness and reactivity to odours in significant everyday contexts: food (e.g. “When you smell a food odour, do you try to guess for fun what it is (never/sometimes/often)?”), social (e.g. “Do you happen to smell parts of your body (never/sometimes/often)?”), and environment (e.g. “Imagine someone is smoking next to you. Do you care/like or not like/love or hate this odour?”). The sheer number of the various olfactory tests and the interview presented a cognitive load that could only be alleviated by splitting them over two sessions. Therefore, each child was tested on two consecutive days or within a week at the very latest. Each session took circa 30 min. Parents or teachers were never present in the room during the testing session.

Statistical Analysis

SPSS 24.0 (IBM Corp 2016) was used to carry out the majority of analyses except for the calculations of Cohen’s f2, which were performed with the SAS University Edition software (SAS 2017). Plots were produced in SPSS and R 3.4.0 (R Development Core Team 2008).

The sample size was calculated using formulae for continuous outcomes given in Kim and Seo (2013). Data in a cohort that was close to the age group we planned to recruit had only been reported by Ferdenzi et al. (2008b) for children aged 7 to 11. Because information was only available on gender differences in odour identification, we calculated a sample size for this effect. At a significance level (α) of 0.05 and power (1-β) of 0.6, which is typical of studies published in major psychology journals, a sample size required per group in each wave was N = 40. Despite the high dropout rates, we were able to achieve those sizes, except for Wave 4.

The normality of the raw data was checked, firstly, by visually examining the individual histograms of all relevant variables, secondly, by producing skewness and kurtosis values and their respective standard errors, from which z-scores were computed and compared to the value of 1.96, as suggested by Field (2005), and, thirdly, with multiple Shapiro-Wilk’s W tests. Two approaches to analysis of longitudinal data were adopted. Both utilised linear mixed models (LMM), were run using the SPSS syntax MIXED command and yielded very similar results. The first data analytic strategy consisted in fitting individual growth curve (IGC) models. In so doing, we followed the procedure recommended by Shek and Ma (2011). One of the advantages of IGC models is that they allow the irregularity of number and spacing of waves by means of a time-structured predictor (“time”) (Singer and Willett 2003). Thus, at Wave 1, the values of time were set at 0, and the number of months from the date of data collection within Wave 1 was calculated for each subsequent wave (i.e. Waves 2–5). In order to be able to perform these calculations for the children who did not take part in Wave 1 and were first tested during Wave 2, for these participants this date was set to November 1, which was the average date of data collection in Wave 1. As an alternative, an arbitrary date in the past was established, at which time was set at 0 and the number of months was calculated from this date. This approach to treating the time variable led to very similar results. The continuous variables of initial age and time interval since any previous testing were treated by a grand mean centring method (i.e. by subtracting the mean, which is generally recommended in order to simplify the interpretation of the results (Hox 2002)). Next, following the strategy suggested by Singer and Willett (2003), several models were fitted and then compared by means of − 2 log likelihood (i.e. a likelihood ratio test/deviance test) and Akaike Information Criterion (AIC, “smaller is better”) in order to select the best model. Namely, to compare models, we calculated delta AIC (Δi) as follows: AICi – AICmin, where AICi is the AIC value for model i, and AICmin is the AIC value of the “best” model. Then we followed the rule of thumb, whereby a ∆i < 2 indicates substantial evidence for the given model, values between 3 and 7 suggest that the model has considerably less support, whilst a ∆i > 10 says that the model is very unlikely (Burnham and Anderson 2002). Firstly, two unconditional models were generated to examine mean differences in the outcome variable across individuals and to compare the fit of models estimated by means of the restricted maximum likelihood method (REML; default option) and the maximum likelihood (ML) method. The methods yielded similar model fits (∆i ≤ 1), therefore the default REML method was used to estimate all subsequent models. Secondly, an unconditional growth model was tested, which served as a baseline model to explore whether the growth curves were linear or curvilinear, and thirdly, two higher-order polynomial models (quadratic and cubic, respectively) were fitted to investigate whether the rate of change accelerated or decelerated across time. In this way, we found that models including only the linear term were by far superior to those with the quadratic and cubic growth curve parameters (∆i > 10). Fourthly, a conditional model was formed to determine whether the variables of initial age, gender, the cumulative total of sessions actually attended by the given child thus far, the time interval since any previous testing, and kindergarten/primary school attendance were related to the growth parameters (i.e. initial status and linear growth). It transpired that the best model fit was obtained when only the main effects of all of these independent variables (IVs) were retained in the model (∆i in the order of hundreds). Thus, the interactions of time and IVs, as well as those among the IVs were omitted from the subsequent models. Finally, several covariance structure models were explored to assess the error covariance structure of the longitudinal data, whereby we determined that there were several, which yielded the best model fit, namely heterogeneous compound symmetry (CSH), diagonal (DIAG), Huynh-Feldt (HF), unstructured (UN), and variance components (VC). Results using the UN model are reported but the other models yielded similar results both in terms of statistical significance and effect size. The intercept and linear slope were allowed to vary across individuals. Missing data was handled through pairwise/listwise deletion. This procedure was followed to model the effects on each of the three olfactory scores (identification, discrimination, and threshold). For t-tests, a standardised measure of effect size, Cohen’s d, was calculated after Cohen (1988). It is the difference between two means expressed in units of standard deviations. Hence, when d = 1, the two groups’ means differ by one standard deviation. For mixed models, Cohen’s f2 was computed using SAS PROC MIXED according to Selya et al. (2012). Cohen’s f2 for a given IV is a ratio of the proportion of the variance in the DV uniquely explained by the IV to the proportion of the variance in the DV unexplained by any variable in the model. Global effect sizes across the waves (i.e. for the overall model) are reported, as well as local ones within the individual waves. According to Cohen’s (1988) guidelines, f2 ≥ 0.02, f2 ≥ 0.15, and f2 ≥ 0.35 represent small, medium, and large effect sizes, respectively. Cohen’s f2 < 0.02 are below the recommended minimum effect size representing a “practically” significant effect for social science data (Ferguson 2009), which is why the exact values are not reported. Ninety-five percent confidence intervals (95% CIs) for the estimates were taken from the SPSS output. CIs can be interpreted in various ways (e.g. Cumming 2014). Here, we favoured the one stating that a 95% CI is an 83% prediction interval for the effect size estimate of a replication experiment (Cumming and Maillardet 2006).

The other data analytic strategy involved a repeated-measures analysis with time-dependent (time-varying) covariates. Waves represented the repeated-measures effect, gender was treated as a fixed factor and the child’s age on the given testing occasion, the cumulative total of sessions actually attended by the given child thus far, the time interval since any previous testing, and kindergarten/primary school attendance as individual-level covariates that were also measured across the waves. Again, the model with the best fit included only the main effects of all the IVs. The residual covariance matrix structure was diagonal with heterogeneous variance, which is the default covariance structure for repeated effects. The model was again run separately for each of the three olfactory scores. Since the results matched those obtained with the first analytic strategy both in terms of statistical significance and effect size, they are not reported in the paper.

To model the relationships on a larger dataset, we then re-ran the analyses on imputed data (N = 587). Data imputation was performed with the missForest package (Stekhoven and Bühlmann 2012) available from the Comprehensive R Archive Network (CRAN) and run in the R (R Development Core Team 2008). Recommended particularly for conducting multiple imputation of mixed data (numeric and factor variables in one data frame) (Starkweather 2014), it has been compared to other imputation methods and found to have the least imputation error for both continuous and categorical variables and the smallest prediction difference (error) (Waljee et al. 2013). Default settings were used. Imputation of the DVs was only carried out when the IVs were available. Explorations of imputed data revealed that the only difference from non-imputed data was in the link between the cumulative total of sessions attended and the time interval from any previous testing (non-imputed: r = 0.08 [− 0.03, 0.19], p = 0.23, N = 262, R2 < 0.01; imputed: r = 0.27 [0.18, 0.33], p < 0.001, N = 587, R2 = 0.07).

Finally, as some of the previous studies suggested that olfactory testing only becomes meaningful in children around 6 years of age (e.g. Hummel et al. 2007a), we arbitrarily filtered out data on children under 50, 60, and 70 months of age respectively and reanalysed the data. It transpired that the results did not change in terms of statistical significance or effect size. Also, these olfactory data of these children did not manifest themselves as outliers or influential cases. Therefore, they were retained in the study.

Results

Individual Growth Curve Models: Influence of Time, Age, Gender, and Experiential Factors on Olfactory Measures

Non-imputed Data

The sample sizes for non-imputed data were N = 246 for odour identification and discrimination, and N = 204 for odour threshold. As detailed in Table 2, higher odour identification scores were linked to being a girl, having attended fewer testing sessions, and shorter intervals between the sessions. Across the waves, global sizes of all the effects, both significant and non-significant, barely qualified as small (Cohen’s f2 < 0.02). Within the waves, the effect sizes were also small, varying between < 0.02 and 0.08. To explore the effect of the number of sessions already attended, we ran paired-sample t-tests comparing baseline and session 2, session 2 and 3, 3 and 4, and 4 and 5 (i.e. baseline scores and scores in children who had attended one, two, three, or four sessions, irrespective of wave). These revealed a significant initial increase in odour identification scores between baseline (7.33 ± 1.91) and session 2 (7.82 ± 1.94), t (118) = 2.47, p = 0.02, Cohen’s d = 0.25. This rise was followed by fluctuations which were non-significant, with a mean difference ranging from − 0.07 to 0.11, Cohen’s d = 0.04 to 0.06. This means that between baseline and session 2, there was a 90% overlap between the two distributions of odour identification scores and a 98% overlap between the other distributions.

Table 2 F-statistics, p values, estimates, 95% confidence intervals (95% CIs), and effect sizes global (i.e. across waves) and local (i.e. within waves) for the effects of time, age at initial testing, gender, cumulative total of sessions attended, interval since last testing, and kindergarten/school attendance on olfactory measures in non-imputed and imputed data

Girls also outperformed boys on the test of odour discrimination. The effect sizes across the waves did not exceed Cohen’s f2 = 0.02 and within the waves varied between < 0.02 and 0.29, indicating small to medium effects. Odour threshold was not affected by any of the IVs, with Cohen’s f2 < 0.02 across the waves, and mostly varying between 0.01 and 0.19 within the individual waves. Figures 2, 3, and 4 show component plus residual plots in non-imputed data modelling the residuals of time, centred initial age, the cumulative total of sessions attended, and the time interval since any previous testing session against odour identification, discrimination, and threshold, respectively. Online Resource 3 gives item-by-item performance on odour identification and discrimination across the testing occasions, irrespective of wave.

Fig. 2
figure 2

Component residual plots for the regression of odour identification on the residuals of time, centred initial age, the cumulative total of sessions attended, and the time interval since any previous testing in non-imputed data. The time and age data are given in months. The fitted black dashed lines represent a linear relationship and the loess fit lines are shown in grey. Deviations of the loess fit lines from the linear ones indicate non-linearity. Effect size for all of the effects was small (Cohen’s f2 < 0.15)

Fig. 3
figure 3

Component residual plots for the regression of odour discrimination on the residuals of time, centred initial age, the cumulative total of sessions attended, and the time interval since any previous testing in non-imputed data. The time and age data are given in months. The fitted black dashed lines represent a linear relationship and the loess fit lines are shown in grey. Deviations of the loess fit lines from the linear ones indicate non-linearity. Effect sizes were small (Cohen’s f2 < 0.15)

Fig. 4
figure 4

Component residual plots for the regression of odour threshold on the residuals of time, centered initial age, the cumulative total of sessions attended, and the time interval since any previous testing in non-imputed data. The time and age data are given in months. The fitted black dashed lines represent a linear relationship and the loess fit lines are shown in grey. Deviations of the loess fit lines from the linear ones indicate non-linearity. Effect sizes were small (Cohen’s f2 < 0.15)

Imputed Data

In imputed data (N = 587), odour identification scores linearly increased with time and were higher in children who entered the study at a higher age than their peers, in girls, in children who had attended fewer sessions in shorter intervals, and in those who had already started school. However, effect sizes both across and within the waves were only small, varying between 0.02 and 0.12, or barely qualified as such (< 0.02). The only exception was the effect of pre-school/school attendance within Wave 4 (Cohen’s f2 = 0.23). Odour discrimination performance, that also increased further into the study, was better in older children, in girls, and in schoolchildren. Effect sizes across the waves remained small, not exceeding Cohen’s f2 = 0.06, and those within the waves ranged between < 0.02 and 0.20, indicating small to medium effects. Finally, older children and those already attending school were more sensitive than younger ones and pre-schoolers. Effect sizes across the waves varied between < 0.02 and 0.08 and those within the individual waves between < 0.02 and 0.22.

Discussion

To the best of our knowledge, the present study represents the first longitudinal examination of development of children’s olfactory abilities. We expected that older children would outperform younger ones and that olfactory scores would increase with time, would be higher in girls than in boys, and that children who had attended more testing occasions in shorter intervals would outperform those who had participated in fewer sessions over a wider time span. Overall, we observed that girls outperformed boys on the odour identification and discrimination tests but those effects were small in Cohen’s (1988) terms. Also, having attended fewer sessions in shorter intervals improved odour identification performance but only to a small effect. This was true for both non-imputed and imputed data. The other small effects of time, initial age, and pre-school/school attendance gained statistical significance only after data imputation. Hence, in terms of effect size, which should be the focus of interpretation (e.g. Cumming 2012; Cumming 2014), non-imputed and imputed data produced very similar results.

The present study adds to the bulk of data provided by cross-sectional studies that show an age-related increase in olfactory performance in the first two decades of life (e.g. Doty et al. 1984; Hummel et al. 2007b; Hummel et al. 1997; Kobal et al. 2000; Sorokowska et al. 2015). However, the effects were rather minute or barely qualified as small. A possible explanation is that a greater variation in age at the commencement of the study, a longer time span than 2 years, and/or longer, possibly irregularly spaced intervals between the sessions were needed to capture any changes in the olfactory scores. Indeed, one of the issues to consider when designing a longitudinal study is to allow for enough repeated observations to recognise the change. Despite the analytical implications associated with the unequal spacing of observations, under certain circumstances, it is actually preferable (Ployhart and Vandenberg 2010). An argument in favour of irregularly spaced repeated olfactory measures would be that children’s cognitive development, in general, is anything but gradual and linear. Rather, it seems analogous to overlapping waves (Siegler 1996), whereby development is characterised by changes in the distributions of strategies children use for problem solving. It appears that specific cognitive abilities mutually cause each other across the developmental life course under a particular profile of environmental constraints, as posited, for example, by the dynamical model of specific abilities (van der Maas et al. 2006). If olfactory abilities can be thought of as a specific category of cognitive abilities (e.g. McGrew 2005, 2009), their development should follow similar general principles. Hence, future studies should carefully consider major developmental milestones in cognitive abilities that are known to affect olfactory perception, co-examine other sensory modalities that could be expected to influence olfaction the most at the given age, and design the duration and spacing of intervals accordingly. Also, development over time likely needs a longer observation period than 2 years.

The effect of gender on odour identification and discrimination is another finding routinely reported in the literature on the development of the sense of smell (Ferdenzi et al. 2008a; Ferdenzi et al. 2008b; Renner et al. 2009; Richman et al. 1992; Stevenson et al. 2007; van Spronsen et al. 2013). There are nonetheless many studies in which it was not observed (Cameron and Doty 2013; Dzaman et al. 2013; Martinec Nováková and Vojtušová Mrzílková 2016a, 2016b; Richman et al. 1995a; Saxton et al. 2014; Schriever et al. 2014; Sorokowska et al. 2015). This inconsistency may stem from different sample sizes on which statistical significance depends (e.g. Cumming 2012). The effect of gender nevertheless tends to be quite small across studies. Perhaps even more importantly, sex or gender take on many meanings, e.g., chromosomal, hormonal or endocrine, gonadal, genital, body-type sex, sex of assignment and rearing, brain sex/gender, social and psychological gender (Fausto-Sterling 2012; Karkazis 2008; Zderic et al. 2002), and individuals may or may not develop in sex-typical ways and behave in a gender-conforming manner. This absence of a clear-cut distinction between males and females has been reported in olfaction as well (Nováková et al. 2013). Hence, the focus should shift from mere search of the so-called gender or sex differences to specific factors that may influence olfactory performance and actually contribute to this aspect of inter-individual variation. Research indicates that verbal fluency is one of the factors in children (Monnery-Patris et al. 2009; Richman et al. 1992; Richman et al. 1995b) and adults (Öberg et al. 2002). For example, Monnery-Patris et al. (2009) reported that the gender effect vanished when verbal proficiency (verbal age and olfactory verbal fluency) was controlled for. Superior performance of females on verbal fluency tasks has been demonstrated in adults (Halari et al. 2006) and children (Anderson et al. 2001b), although there is also considerable within-gender variability (Rahman et al. 2003). Individuals with higher verbal fluency scores tend to outperform low-scoring ones on the task of odour identification (Larsson et al. 2000). Nevertheless, it is important to recognise that sources of variation, particularly in repeated assessments, are likely to be multifactorial and usually do not relate to a single construct. As this was an initial study, we did not delve into the actual factors behind the proxy of gender, but there is clearly a knowledge gap that needs to be addressed for us to fully understand the development of gender- or sex-related inter-individual variation in olfactory perception.

In longitudinal studies, the key element requiring careful consideration is the practice effect, whereby performance on re-testing improves as a result of previous exposure to the same or similar neuropsychological measure rather than an actual change in the individual’s ability (e.g. Collie et al. 2003). Tests with a single solution, such as the odour identification and discrimination tasks in the present study, are more likely to exhibit significant practice effects upon repeated testing (Basso et al. 1999; McCaffrey et al. 1992). This is particularly the case when content is repeated from the original test to the next (Krumboltz and Christal 1960; Kulik et al. 1984). However, scores may increase on re-test even when different items are employed (Benedict and Zgaljardic 1998; Wilson et al. 2000) because participants learn how to handle the task as such more effectively.

Further, numerous studies have shown the critical role of repeated testing in consolidating learning (Karpicke and Roediger 2008; Roediger and Butler 2011), including several that are now considered classics (Carrier and Pashler 1992; Gates 1917; Glover 1989; Jones 1923-1924; Spitzer 1939; Tulving 1967). Thus, a testing occasion may, in fact, represent a learning opportunity with a potentially greater impact than that of olfactory perceptual learning within the everyday olfactory environment we originally meant to address. Even though in the present study children were not corrected upon making an incorrect choice or given any feedback on their performance, the powerful mnemonic benefits of retrieval practice during the testing cannot be ruled out. This is because retrieval practice is often effective even without feedback (for review see Roediger and Butler 2011). Hence, there was an expectation based on multiple lines of evidence from previous studies that multiple testing might positively affect children’s scores on subsequent occasions.

We, however, observed the exact opposite as the children were performing worse after having attended more sessions, although the size of the effect was small. The decrease in performance may have arisen because young children often exhibit poor self-regulatory capacities (for review see McCabe et al. 2004) and thus are more heavily influenced by external environment than older children (López et al. 2005; Mahone 2005). Familiarity with test stimuli may have negatively affected performance in subsequent waves if the stimuli were perceived as less novel, interesting, and stimulating, resulting in lesser attention and interest (Courage et al. 2006; Sheese et al. 2008).

To analyse the practice effect in greater detail, one must acknowledge that gains (or losses) in test scores may not be equivalent across multiple testing sessions and may be modulated by the length of intervals between the occasions. For instance, it has been found that with brief intervals, gains in scores stabilise after an initial practice effect on measures of attention and processing speed (Falleti et al. 2006). In tests with a ceiling effect, the strongest practice effects have been found between the first and second testing with little to no improvement thereafter (Benedict and Zgaljardic 1998; Ivnik et al. 1999). Here, we did not observe any ceiling or bottom effects in any of the tests.

In the present study, odour identification scores, which showed a significant main effect of the cumulative total of sessions attended in the LMM, initially increased and then fluctuated, but the mean differences were less than half a point and the effects were small or barely qualified as such. This effect was statistically non-significant in the LMM analyses on non-imputed odour discrimination data. However, additional explorations using paired sample t-tests revealed that it also significantly improved by almost two points at the second testing occasion and that the effect bordered on large. It stagnated after that (session 2 vs. 3, mean difference < 0.5 point) and eventually dropped (session 3 vs. 4, small effect, mean difference = 0.97 point). Odour threshold also showed no main effect of the number of sessions in the LMM analyses but explorations with paired sample t-tests yielded an initial increase by about half a point in odour sensitivity that was not statistically significant (but note that N = 70) followed by a stagnation (a mean difference of less than 0.1 point), with Cohen’s d < 0.02. This means a 99% overlap of the distributions. Similar findings were obtained when the covariate of the time interval between the testing occasions was included in the models. Thus, it can be cautiously concluded that the positive influence of repeated exposure to the same testing format and stimuli was only limited to the session immediately following the baseline. In the following sessions, the scores reached a plateau or slightly dropped. Finally, the positive influence of schooling on cognitive development is well known (Rutter 1985), although in children who are not socially challenged, such as those in the present sample, the effects are far less noticeable as opposed to lower-socioeconomic status children (Downey et al. 2004; Raudenbush 2009).

An additional consideration for repeated cognitive testing is that the degree of gain may vary for different tests. To explore this possibility, we calculated differences in scores between baseline and sessions 2, 2 and 3, 3 and 4, and 4 and 5 irrespective of wave for each olfactory task. It transpired that between the baseline and the second session, children improved significantly more on the test of odour discrimination than on that of odour identification (Cohen’s d = 0.61) or threshold (Cohen’s d = 0.40), respectively. This means an overlap of 76 and 84%, respectively. Also, between sessions 3 and 4, the drop was more marked for odour discrimination scores compared to those for identification (Cohen’s d = 0.47 (81% overlap)).

Further, the factors contributing to change may vary across tasks. Also, whilst certain variables may affect baseline performance on a given measure, change in performance on that measure can be influenced by others (Attix et al. 2009). The present sample size only allowed us to explore these ideas with the simplest univariate models. It transpired that when change was operationalised as a difference between baseline and after having attended one session, one vs. two sessions, and so on, the time elapsed from the preceding session seemed the most important factor in each of the three olfactory measures, albeit after a different number of exposures. Yet another issue, which we were unable to explore within the present study, is that with repeated assessments, a given task may begin to target a different cognitive function than initially intended. We present these explorations more as ideas to be addressed as proper hypotheses in future studies.

In paediatric populations, such considerations are of particular consequence because, in children, variability in performance may be inversely related to age, that is, variability can decrease as children grow older. At the very least, this would be true for measures of cognitive functions that can be modelled as linear and exhibit limited floor and ceiling effects. Nevertheless, to complicate matters, linear models of cognitive development are often an over-simplification because brain maturation proceeds in rapid developmental progression during growth spurts. Hence, during periods of accelerated development, a temporary rise in variance may be observed in at least some measures (Anderson et al. 2001a; Huizinga et al. 2006). In the present study, variability dropped significantly in odour threshold scores between the first and last wave, which was probably attributable to children’s better ability to handle this demanding task. Nevertheless, threshold testing, along with the odour identification and discrimination tasks, could be carried out effectively enough even in children under the age of 6. Their data did not represent outliers and additional analyses from which they were omitted led to results similar in both statistical significance and effect size to those reported above.

Future studies may also find it useful to employ statistical methods that may facilitate the interpretation of change, such as reliable-change index (RCI) scores (Iverson 2011) or standardised regression-based (SRB) change scores (McSweeny et al. 1993). Moreover, to correct for the initial practice effects on odour identification and discrimination, two or more baseline assessments may be considered in future studies (McCaffrey and Westervelt 1995).

One of the limitations of the present study was the amount of missing data and related issues, such as the higher proportion of pre-schoolers in Wave 4 compared to Wave 3, similar mean age of children participating in these two waves, and the low N in Wave 4. Collection of data within Wave 4 took place in May and June 2012, which was towards the busy end of the school year. As school principals, teachers, and parents each time provided a one-time permission only and had to be asked again within the next wave, sometimes they responded in the negative to our request. This was mostly because school teachers had busy curricula to follow and children’s absence from classes due to testing would have been a major complication. This posed less of a problem for kindergarten teachers, who were generally more relaxed about the testing and more interested or even enthusiastic about co-operating. Besides, pre-schoolers tended to have similar schedules and could be approached within the kindergarten on almost any given day, as opposed to school children. When the first participants started school during Wave 3, they turned out to be difficult to reach because of their busy curricula and after-school activities. Also, after the majority of children in a given kindergarten or school had been tested, we did not have the luxury of coming once again to test those children who were previously absent or ill because the principals would not allow it. Parents of children who could not be tested at school were then invited to visit our laboratory, but they often found it logistically inconvenient. Future studies could overcome many of these issues by recruiting children from institutions with more relaxed curricula, such as outdoor kindergartens and elementary schools. This approach might also allow ecologically (externally) valid observation of olfactory behaviours, even though the generalisation of findings may be limited as such institutions are generally attended by children from specific backgrounds.

Conclusion

In the present study, we investigated the development of children’s olfactory abilities over five waves which took place every 6 months. Odour identification and discrimination but not sensitivity were on average higher in girls, but the effect was small. Further, children’s performance on the odour identification task was affected by the cumulative total of sessions attended and the time elapsed since any previous testing. After data imputation, other small effects gained statistical significance, namely those of time, initial age, and pre-school/school attendance. In terms of effect sizes, non-imputed and imputed data yielded very similar results. Despite the small magnitude of the effects reported in this paper, the unexpected findings, particularly regarding the practice effects, warrant replication and extension in longitudinal studies carried over a broader time span.