1 Background

The vocalic contrasts and harmony processes found in Khalkha Mongolian have been discussed for over 50 years in the Western (generative) tradition starting with Lightner (1965), Binnick (1969, 1980), Odden (1980), Steriade (1979), Hamp (1980), Rialland & Djamouri (1984), Svantesson (1985), and more recently Svantesson et al. (2005), Vaux (2009), and Ko (2011, 2019), among others. An important milestone in this line of research was Svantesson’s (1985) discovery that the presumed front rounded vowels /ü/ and /ö/ of the Classical language are realized as back vowels in several modern varieties including Khalkh. In some Mongolian dialects these derived back vowels merged with the corresponding original back round vowels /u/ and /o/; but in others, including Khalkh, merger was avoided by lowering /u/ and /o/ in the vowel space. Svantesson made the intriguing proposal that the lowering of the original back vowels reflected a reorganization of the phonological opposition between /ü/ and /ö/ versus /u/ and /o/ from [±back] to a contrast in pharyngeal width. The /ü,ö/ vowels were reassociated with the Advanced Tongue Root [ATR] feature and /u,o/ with Retracted Tongue Root [RTR]. It should be noted that Svantesson (1985) actually uses the feature [pharyngeal] from Wood (1979) implemented by the hyoglossus and pharyngeal constrictors; but he aligns this distinction with the [ATR] feature introduced by Stewart (1967) and associated with pharyngeal width via positioning of the tongue root in Lindau’s (1975, 1979) X-ray studies of Akan. Svantesson’s hypothesis was supported by the classification of the vowels as ‘female’ lax (eme) versus ‘male’ tense (ere) in the native Mongolian literature. According to Chinggeltei (1979) the tense vowels are produced with the upper part of the throat tensed and the tongue root retracted. The positions in the vowel space implied by the backing and lowering ‘rotation’ of the Modern Mongolian round vowels have been accepted in most subsequent studies. More recently a controversy has arisen as to whether it is the contrast in tongue-body backness or the tongue-root posture that represents the original contrast: Vaux (2009), Ko (2011, 2019), and Ko et al. (2014) see the Mongolian tongue-root expression of the contrast as part of a more general Northeast Asian sprachbund also observed in Tungusic and Kamchatkan languages, and possibly Korean as well, and treat the oral-cavity front-back tongue-body articulation as an innovation, primarily of the Turkic languages.

While the phonetic correlates underlying the ATR harmony contrasts in (West) African languages have been relatively well studied (Ladefoged 1964, Lindau 1979, Fulop et al. 1998, Hess 1992, Guion et al. 2004, Przezdziecki 2005, Starwalt 2008, among others; see Ko 2019 for a helpful summary), the same is not true for the East Asian languages. The few investigations that do exist are based often on just a single speaker with minimal indication of the words comprising the corpus, the recording methods, and the statistical analyses of the data; Rialland and Djamouri (1984): seven vowel phonemes recorded in isolation (9–12 repetitions) by one male Khalkh speaker; Svantesson (1985): 10 words with three repetitions from two inner Mongolian male speakers and one male Khalkh speaker; Kang and Ko (2011): one male speaker of Western Buriat and one female speaker of Tungusic Ewe. This paper is an attempt to help fill this gap, at least with respect to the Mongolian variety with the largest number of native speakers—the Khalkh dialect of the capital Ulaanbaatar whose population of 1.5 million comprises roughly one half of the entire nation according to a recent (2019) census (Wikipedia.org). Our results broadly but not completely align with earlier findings. We add value to the investigation of the phonetic basis of the Khalkh vowel contrasts by replicating earlier results, especially as they relate to harmony, with a larger number of speakers representing several generations, a larger number of test items, and consistent statistical analyses of the three parameters of contrast defining the Mongolian vocalic inventory: length, height and backness, and the volume/width of the pharyngeal cavity. Our study breaks new ground by investigating a larger number of phonetic correlates to the contrast between the female /e,ö,ü/ and male /a,o,u/ vowels. Like almost all previous studies, our data are acoustic: duration (for phonological length), formants (for vowel quality), and phonation (for the pharyngeal-laryngeal articulators). The articulatory correlates to the phonological contrasts remain an outstanding gap that needs to be filled in future research.

2 Phonemic vowel system

In the Mongolian phonological literature the vowel phonemes are transliterated from the Cyrillic alphabet with the customary five /i,u,e,o,a/ letters as well as /ү/ and /ɵ/ for the rotated front rounded vowels of the Classical language. For the sake of typographic convenience and in conformity with the traditional literature, we represent these elements with the umlaut sign (i.e. as /ü/ and /ö/, respectively) rather than a particular IPA symbol since their phonetic realization is a point of controversy and a research question. This seven-vowel system is crossed for length to yield fourteen vowel phonemes. There is also a diphthong /aj/ which, for our speakers, has coalesced into a long low/mid front vowel /æ:/. We put this vowel aside for the moment and return to it in Sect. 5.2. As seen in Table 1, the Mongolian vowel system (Classical or Modern Khalkh) is distinguished by three basic phonological features: front versus back or ATR versus RTR, high versus nonhigh, and round versus nonround. The /i/ lacks a back/RTR counterpart.

Table 1 Phonological contrasts comprising the Khalkh vowel inventory

As in most Altaic languages, vowel harmony in Khalkha Mongolian is a root-based progressive assimilation process (there are no prefixes). The /i/ is a ‘neutral’ vowel that does not alternate while the front/ATR /e,ü,ö/ alternate with their back/RTR counterparts /a,u,o/ as a function of the rightmost stem vowel. The neutral /i/ is transparent so that harmony appears to skip over this segment. When it is the sole stem vowel, then suffixes take the /e,ü,ö/ set vowels. In contrast to the Classical language, Khalkha Mongolian has an additional labial-harmony process. It is also progressive and rounds a following nonhigh vowel when the stem contains a round vowel which is also nonhigh (/ö/ or /o/). Svantesson (1985) notes that labial harmony only occurs in Mongolian varieties that have shifted to the tongue-root/pharyngeal based harmony. Table 2 illustrates the vowel-harmony alternations with ablative case forms; the ablative suffix has a lexically determined n ≈ 0 alternation.

Table 2 Ablative case forms illustrating vowel harmony

3 Outline of the study

As mentioned above, our study was designed to investigate the acoustic correlates to the phonological contrasts of length, vowel height and backness, and pharyngeal width in the Khalkh dialect. We first describe the methods: our speakers and the word list comprising our corpus; the recording equipment and types of data analysis. We then present our results in three successive sections: length, formant measures, and phonation/voice quality, each followed by discussion of the findings. The paper closes with a brief summary of the overall results.

4 Methods

Our data was provided by five native speakers who were born and raised in the Mongolian capital Ulaanbaatar. They represent three generations: younger (two females c. 25 years of age), middle-aged (one male and one female both c. 50 years of age), and elderly (one female c. 75 years of age). A set of five monosyllabic CVC nouns for each of the fourteen vowel phonemes as well as for the coalesced diphthong /aj/ > /æ:/ was assembled (see Appendix 3). Before recording we went over the word list with each speaker to make sure they were familiar with the words. In a couple of cases, a word was unfamiliar to one of our speakers and so an alternative was substituted. The 75 words comprising the speaker’s list were randomized and then successively elicited in four case forms: nominative (identical to stem), accusative -iig, ablative -AAs, and comitative/possessive -taj. The speakers each read the same randomized word list with the minor substitutions noted above. Our data corpus thus comprises 1500 words: 14 + 1 vowels * 5 word stems * 4 case forms * 5 speakers. The recordings were made in a sound-insulated booth with a Shure SM10A Unidirectional Head-Worn Dynamic Microphone and a USB Pre 2 Preamp at a sampling rate of 44.1 kHz and quantizing resolution of 16 bits. The resulting sound files were analyzed in Praat (Boersma & Weenick 2019) with segmentation and labeling by hand based on auditory and visual inspection of the spectrograms for large changes in the waveform indicative of the transition to and from the nuclear vowel with steady F1. Duration and formant measurements based on vowel midpoint were collected by Praat scripts. The resultant data were stored in excel files and then analyzed in R (4.1.3) with mixed-effects linear regressions (speaker and word as random factors). At the suggestion of a reviewer, speaker age and gender were entered as fixed effects for the phonation correlates. A t-value of 2.0 was taken as the threshold of statistical significance. Formant measures were normalized with Z-scores based on the mean and standard deviation for each speaker. Presumed tongue-root correlates (H1-H2, H1-A1, H1-A2, HNR05, HNR15, HNR25, NHR35, as well as f0, F1, F2, and F3) were investigated with VoiceSauce (Shue et al. 2009) based on a 20 ms window centered on the vowel’s midpoint.

5 Results

5.1 Duration

Our data file permits several research questions to be posed concerning the length of the stem vowel. First, vowel length can be affected by the number of syllables in the word. In particular, the vowel of a monosyllable is expected to be longer than in a disyllable. Indeed, this factor is often phonologized as a so-called “minimality effect”, as in Kansai Japanese where monomoraic words of the standard (Tokyo) dialect like [yu] ‘hot water’ are lengthened to two moras when spoken in isolation [yuu] (Ito 1990). Hence, other things being equal, we may expect the stem vowel in our Mongolian data to be longer in the zero-suffixed nominative case form where the word is monosyllabic compared to the same vowel in the accusative and comitative cases formed with the suffixes -iig and -taj, respectively, where the word is disyllabic. Another factor influencing the duration of a vowel is syllable structure: vowels are typically longer in open syllables compared to closed syllables (Maddieson 1997). The diachronic sound changes and synchronic phonological processes of Open-syllable Lengthening and Closed-syllable Shortening are reflexes of these phonetic factors. Other things being equal, we may expect the vowel of a CVC stem to be longer in the accusative where the stem-final consonant parses as an onset [CV.Ciig] compared to the comitative where that consonant closes the syllable of the stem [CVC.taj]. Finally, we can ask how the long versus short phonemic contrast might be affected by these three case-form contexts: will the phonemic long-short duration ratio remain stable under these associated shifts in word length and syllable structure?

Figure 1 indicates the log-transformed durations for the stem vowels in three case forms (nominative, accusative, comitative) as a function of the long versus short phonological contrast. In the bare stem nominative context, long vowels are overall more than twice the duration of short vowels: raw score means 350 versus 136 ms (ratio = 2.57). The stem vowels are shortened in the disyllabic accusative -iig; but the ratio between the long and short vowel categories remains comparable: raw score means 237 versus 90 ms (ratio = 2.63). The consonant-initial comitative -taj closes the stem vowel’s syllable leading to shorter stem-vowel durations; but the ratio between the long and short vowel categories remains largely unchanged: raw score means 186 versus 78 ms (ratio = 2.38). Also, the /æ:/ vowel derived from coalescence of earlier /aj/ has mean durations that are very close to the long-vowel categories: raw score means = 349 ms (nominative), 246 ms (accusative), 177 ms (comitative). A mixed-effects linear regression over the logs of the stem-vowel durations for the three case forms found significant differences between the baseline accusative and the nominative on the one hand (est = 0.15, t=4.57) and between the accusative and comitative on the other (est = − 0.10, t = − 2.86); see Appendix 1 Model 1.1 for details. The same pattern of results obtained when the data was split as a function of the phonemic length of the stem vowel: long vowels (est = 0.16, t = 11.42 for nominative and est = − 0.10, t = − 7.13 for comitative) and short vowels (est = 0.19, t = 7.33 for nominative and est = − 0.06, t = − 2.27 for comitative); see Appendix 1, Models 1.2 and 1.3 for details. Our data are thus consistent with the hypotheses that changing the word’s length from one to two syllables shortens the vowel of the stem as does changing the stem vowel’s syllable from open to closed. The test also suggested that the former effect is greater in magnitude and reliability compared to the latter. Finally, the ratio between the long versus short vowel contrast was largely stable across these different prosodic contexts.

Figure 1
figure 1

Log-transformed long versus short stem-vowel durations (ms) in three case forms

5.2 Formants

Figure 2 shows the placement of the 15 stem vowels (normalized) in F1*F2 space based on their means in the citation form nominative case. F1 is indicated on the vertical axis and F2 on the horizontal axis.

Figure 2
figure 2

F1(vertical axis) * F2 (horizontal axis) plot of stem vowel means from citation form nominative

The following observations can be made concerning this excel plot. First and most importantly the /ü/ and /ö/ counterparts to the Classical Mongolian presumed front round vowels have retracted to the back vowel region with F2 values that are on average lower than /a/. Second, the counterparts to the Classical Mongolian back round vowels have lowered in the vowel space so that /u,u:/ are below /ü,ü:/ and even below /ö,ö:/, respectively. The vowels /o,o:/ are much lower in the vowel space. Our results thus accord well with Svantesson’s (1985) original findings and confirm the striking chain-shift rotation of the round vowels that he discovered. It should be noted that Rialland & Djamouri (1984) independently reached much the same conclusion. Third, the short vowels /ü/ and /ö/ are located more towards the center of the vowel space compared to their long-vowel counterparts. On the other hand, there is little difference between the short versus long positions of the /i/, /a/ and /o/ vowels. This disparity might be taken to indicate that the backing of /ü/ and /ö/ is a sound change in progress with the short vowels lagging behind their long counterparts. Fourth, the front mid vowel /e/ is very close to /i/, especially when short: no significant differences were found for normalized F1 (est = − 0.06, t= − 0.99) and normalized F2 (est = 0.01, t = 0.04); see Appendix 1 Model 2 for details. The long vowel /e:/ is quite a bit higher in the vowel space compared to the back vowels /o,o:/, which sit roughly halfway between the rotated /u,u:/ and /a,a:/. Finally, the coalesced [æ:] from /aj/ is roughly halfway between /e:/ and /a:/ in the normalized F1*F2 space; it occupies a position comparable to the lowered /o/ in vowel height and is well separated from /a:/. Since stem vowel /a/ and /aj/ > [æ:] both choose the same suffixal RTR harmonic alternants (cf. [pag-a:s] ‘team’ abl. and [xæ:r-a:s] ‘love’ abl.), the nonhigh RTR region has introduced a [± back] distinction in addition to the [±round] contrast that separates /a/ from /o/. The plot in Fig. 3 shows the realization of the vowel of the ablative suffix as a function of whether the stem vowel is /a:/ or [æ:] < /aj/. The former belong to a more compact region while the latter are more widely dispersed. A mixed-effects regression test predicting F1 and F2 for the ablative -a:s as a function of stem-vowel backness found a significant difference in normalized F2 (est = 0.16, t = 3.67) and a marginal difference in normalized F1 (est = − 0.22, t = − 2.04). See Appendix 1 Model 3 for details. These data suggest that the [±back] feature that distinguishes /a:/ from /æ:/ < /aj/ is being spread as an autosegmental dependent of RTR so that when the -aas alternant is derived by tongue-root harmony, it splits into a minor front versus back difference, as suggested in the plot. The data are admittedly quite noisy for the [æ:] < /aj/ case and so this is a point that requires further study.

Figure 3
figure 3

Ablative suffixal vowel as function of /a:/ vs [æ:] < /aj/ stem vowel

As a final observation, we note that the Khalkh rotated /ü,ü:/ and /ö,ö:/ vowels have F2 values considerably lower than /a/ (especially when long). In this respect they differ from the front round vowels of Turkish (Korkmaz & Aytug 2018) as well as French and German (Strange et al. 2007), which have higher values. Figure 4 plots the round vowels in our corpus in F1*F3 space and shows that the third formant does a much better job of isolating the /ö,ö:,ü,ü:/ set than the second formant does. A mixed-effects linear regression found the difference in normalized F3 as a function of the /ö,ü/ versus /o,u/ contrast to be significant (est 0.713, t = 6.24). For details see Appendix 1 Model 4. A similar F3 /i/ > /u/ > /y/; /e/ > /o/ > /ø/ hierarchy appears in French (Law 2008), German (Piroth et al. 2015), and Turkish (Korkmaz and Aytug 2018) suggesting that the lower F3 of the Khalkh /ü/ and /ö/ vowels relative to /u/ and /o/ may reflect their former front vowel status.

Figure 4
figure 4

F1 (vertical axis) *F3 (horizontal axis) plot of round stem vowel means from citation form nominative

5.3 Voice quality

In order to investigate potential spectral offset and phonation correlates to the ATR-RTR contrast in our Khalkha Mongolian data, we utilized the VoiceSauce (Shue et al. 2009) Matlab program. The program’s options were set to their default values: the Straight algorithm (Kawahara et al. 1999) for the harmonics and the Snack Sound Toolkit (Sjölander 2004) for the formant frequencies. The analysis was restricted to the stem vowels in our data. The input sound files for each speaker were accompanied by a Praat textgrid marking off 20 ms intervals centered on the midpoint for each stem vowel in the corpus: hence 75 vowels * 4 case forms * 20 measurements (one per millisecond) * 5 speakers. In a few cases (c. 1% of the data) the program failed to make a measurement and so the resultant file contained 29,664 lines of data. The data were reduced by taking the average of the stem vowel across the 20 measurement points per word and then further reduced by taking the average across the four case forms. This resulted in a file of 377 items across the five speakers. The resultant data was then normalized with Z-scores based on the mean and standard deviation for each speaker. Two of the resultant items were removed because their Z-scores exceeded three standard deviations and were considered outliers. The data was analyzed with mixed-effects linear regression tests in R (4.1.3) with the ATR (baseline) versus RTR specification of the stem vowel as the main predictor variable (coded as VQ). At the suggestion of a reviewer, the gender and age of the speaker were also entered into the models as predictors. Female was the baseline for gender, and age had three levels: old, middle (baseline), and young. Speaker and word were specified as random intercepts and, in full models, by-speaker and by-word random slopes for VQ were called as well. In cases where the full model did not converge, the random slope by speaker (and in a few cases the random slope by word as well) were dropped in order to obtain convergence.

Three groups of dependent variables were regressed on the VQ + gender + age predictor variables. First, the spectral drop-off factors that measure the difference in amplitude (dB) between the first and second harmonics (H1H2c) as well as the difference in amplitude between the fundamental and the harmonics located nearest the first, second, and third formants (H1A1c, H1A2c, H1A3c) were considered. Based on the ATR harmony literature, the Khalkh RTR vowels /a,o,u/ should be associated with less drop-off than the corresponding ATR vowels since ATR vowels are expected to have weaker amplitudes in the higher regions of the spectrum and hence they will diminish the value of H1 less. In other words, given that ATR is the baseline category in our regression models, the change to RTR should be negative (less drop-off) for these variables. The second class of factors we considered involve the harmonics-to-noise ratio (HNR) at various levels of the speech spectrum: HNR05 for 0–500 Hz, HNR15 for 0–1500 Hz, HNR25 for 0–2500 Hz, and HNR35 for 0–3500 Hz. If the RTR vowels involve pharyngeal narrowing we might expect greater turbulence and hence an increase in the noise factor compared to the vowels that advance the tongue root. Again, given that ATR is the baseline in our regression models, increased noise will lower the value of HNR and so we expect that a switch from ATR to RTR be associated with negative coefficients. The final group of factors we considered are the fundamental (sF0) and first three formants. We already know from Fig. 2 that the RTR vowels are lower in the Khalkh vowel space compared to their ATR counterparts and so again, given that ATR is the baseline, we expect the switch to RTR to result in positive coefficients for sF1. The ATR vowel-harmony literature indicates that F1 is the most widespread and reliable correlate to the ATR-RTR contrast so we might also expect this factor to be associated with greater magnitude and reliability compared to the other predictors. The literature is far from unanimous on the effect of the second formant; but given that the ATR /e,ö,ü/ are located in the more anterior region of the vowel space, we might expect that a switch to RTR should be associated with a negative value in the regression coefficient for sF2 if tongue-root posture is affected by tongue-body position. We saw above that the third formant distinguished between the ATR /ö,ü/ and RTR /o,u/ and so we anticipate that a switch from baseline ATR in VQ should be associated with positive coefficients for sF3. Finally, given that higher vowels are associated with greater inherent f0 (Lehiste 1970), the switch from ATR to RTR should be negative for sF0.

5.3.1 Results

Table 3 summarizes the results of the regressions tests over the VoiceSauced Khalkha Mongolian data. For these tests the predictor variables received default dummy coding. (We ran a separate set of tests in which the predictors were centered with simple coding; the overall pattern of results was largely the same; see Appendix 2). The number of observations was 374 grouped by word (77) and speaker (5). In the second column, 2 means that the model converged with by-speaker and by-word random slopes for VQ (for example, H1H2cZ ∼ VQ + age + gender + (1|speaker) + (1|word) + (1+VQ|speaker) + (1+VQ|word)); 1 indicates models with by-word random slopes and 0 are models with no random slopes. It turned out that in none of these models were there significant effects for gender or age.

Table 3 Summary of mixed-effects linear regression tests of full data set

Turning now to the results, for the first group of factors testing spectral slope there was either no significant effect (H1H2c, H1A3c) or an effect in the opposite direction from what was expected (H1A1c, H1A2c), where a switch of VQ from baseline ATR to RTR was associated with positive coefficients indicating greater drop-off. This result suggests that the Khalkh contrast between /e,ö,ü/ and /a,o,u/ is modal versus RTR rather than ATR versus modal or binary ATR versus RTR. For the noise factor there is a relatively strong and reliable (***) decrease in HNR (negative coefficients) indicating that the /a,o,u/ vowels are associated with greater noise and thus lower the harmonics-to-noise ratio. This finding held for all but the lowest region of the spectrum where the association went in the opposite direction. This result would be consistent with pharyngeal narrowing (generating turbulence) for the /a,o,u/ set and dovetails with the presence of spectral drop-off for these vowels. As anticipated, there is a very strong and reliable effect of sF1 being positively associated with the RTR vowels. The pattern with sF2 went in the opposite direction from sF1 and is consistent with the more anterior position of /i,e,ö,ü/ in the vowel space. As for sF3, no significant association was found when sF3 was regressed on VQ for the entire vowel set. However, when the same test was run on just the round vowels /ö,ü/ and /o,u/ (205 observations) then there was a significant positive effect for RTR. Finally, sF0 had significant negative coefficients for RTR consistent with the association between vowel height and f0.

A second set of regressions with the same settings (ATR as the VQ baseline and speaker and gender as random factors) was run over a subset of the data comprised of vowels that are harmonic partners for RTR harmony: i.e. a≈e; o≈ö, and u≈ü; number of observations: 302, groups: word 60; speaker 5. The results, seen in Table 4, are generally congruent with the larger data set seen in Table 3: positive spectral drop-off for H1A1c and H1A2c, negative coefficients for HNR15 and HNR25, positive coefficients for sF1, and negative coefficients for sF0 and sF2.

Table 4 Summary of mixed-effects linear regression tests of reduced data set

The excel plots below indicate how the individual harmonic pairs perform with respect to the dependent variables (raw scores). The error bars are standard errors. For spectral drop-off, the low /a,e/ and mid /o,ö/ vowel pairs pattern together (except for H1A3c) while the high /u,ü/ go in the opposite direction (H1H2c) or find no difference (H1A1c, H1A2c) (Figs. 5, 6, 7, 8).

Figure 5
figure 5

H1H2c for Khalkha Mongolian vowels

Figure 6
figure 6

H1A1c for Khalkha Mongolian vowels

Figure 7
figure 7

H1A2c for Khalkha Mongolian vowels

Figure 8
figure 8

H1A3c for Khalkha Mongolian vowels

For the harmonics-to-noise ratios, the /a,o,u/ set vowels contribute more turbulence than /e,ö,ü/ at levels above 500 Hz, lowering the HNR values at all three levels of vowel height (Figs. 9, 10, 11, 12).

Figure 9
figure 9

HNR05 for Khalkha Mongolian vowels

Figure 10
figure 10

HNR15 for Khalkha Mongolian vowels

Figure 11
figure 11

HNR25 for Khalkha Mongolian vowels

Figure 12
figure 12

HNR35 for Khalkha Mongolian vowels

For sF0 and sF1 the vowels separate cleanly into the /a,o,u/ versus /e,ö,ü/ sets at each level of vowel height. For sF2 the nonround vowels separate cleanly while for the round vowels the VQ factor is confounded with vowel length. As was seen in Fig. 2, above, the short ATR /ö,ü/ are positioned more toward the center of the vowel space compared to their long vowel counterparts. A regression test over the round vowels found a significant positive effect for sF2 as a function of length:short (est = 0.79, t = 6.55); see Appendix 1 Model 5 for details. Lastly, for sF3 the VQ patterns differ as a function of rounding as noted in Table 4 (Figs. 13, 14, 15, 16).

Figure 13
figure 13

sF0 for Khalkha Mongolian vowels

Figure 14
figure 14

sF1 for Khalkha Mongolian vowels

Figure 15
figure 15

sF2 for Khalkha Mongolian vowels

Figure 16
figure 16

sF3 for Khalkha Mongolian vowels

Finally, univariate binomial logistic regression models were run in R with glmer to see how well the individual factors classified the reduced data set (302 observations) into the ATR /e,ö,ü/ versus RTR /a,o,u/ categories. It turned out that the only significant predictor was F1: z=3.31, AIC=71.3. Among the noise factors, none of the HNR predictors by themselves reached significance. But when combined into a single model, HNR05 and HNR15 covering the lower region of the spectrum, each reached significance: HNR05 (z=2.62) and HNR15 (z=−2.59). The AIC for this model was 65.2. A model combining the noise factors HNR05 and HNR15 with F1 had a significantly lower AIC of 59.7 (anova Chisq=15.61, p = 0.0004) and was the best fitting model that could be constructed.

In order to investigate further the role of the various predictors for the ATR versus RTR distinction in our data, we employed R’s Recursive Partitioning algorithm in the mlr package. The random search parameter was set to 200 iterations with fivefold cross validation. Three different models were investigated. A full model employed both the voice-quality predictors of spectral slope and the harmonics-to-noise ratios as well as the formant-based sF1 and sF2 (plus sF0). Two reduced models employed either the voice-quality predictors alone or the formant-based predictors alone. The models are detailed below in Table 5 along with their mean misclassification error (mmce) scores. The results indicate that the Formants-only model performed at essentially the same rate as the Full model with both correctly classifying roughly 90% of the data. The misclassification rate for the model predicting the RTR versus ATR distinction based on the voice-quality predictors alone was substantially higher with an error rate of over 20%.

Table 5 Rpart analysis of reduced data set

In sum, based on the location of a vowel in F1*F2 space c. 90% of the data can be correctly classified with respect to the ATR versus RTR distinction. The voice-quality factors are less reliable in this regard; but they nevertheless succeed in classifying roughly 80% of the data correctly. There is thus considerable collinearity between the vowel quality and voice quality predictors.

5.3.2 Discussion

The most notable result of our investigation into the phonetic parameters underlying the Khalkha Mongolian /a,o,u/ versus /e,ö,ü/ contrast relevant for vowel harmony is that the voice-quality factors of spectral drop-off and acoustic turbulence align with the formant-based F1 and F2 predictors. While not as great in magnitude or reliability, the voice-quality parameters nevertheless do play a significant role. This general result is compatible with and supports the Laryngeal Articulator Model (LAM) of voice quality proposed in Esling (2005) and the phonological potentials model (PPM) of Esling et al. (2019) which it grounds. The LAM is presented as an alternative to the traditional source-filter model of the vocal tract in which the larynx provides the basic sound wave which is then filtered and shaped by articulators located primarily in the oral cavity. LAM recognizes and attempts to model the contribution of articulators in the pharynx and larynx in shaping speech sounds. With regard to the typology of tongue-root harmony, Esling (2005) notes that Lindau’s (1979) X-rays for Akan were not able to thoroughly resolve the lower region of the pharyngeal cavity (a point made by Svantesson 1985 as well) though she notes the presence of laryngeal raising for the [-ATR] vowel set raising the question of whether and, if so, how such nonlingual articulations are involved in ATR-harmony. Esling sheds light on this question with photographic (laryngoscopic) evidence from the northern Togolese (Gur) language Kabiye. Kabiye has a harmony system that divides its five basic vowels into two sets comparable to many other West African languages such as Akan. Esling finds that the primary gesture distinguishing the Kabiye [-ATR] vowels is an aryepiglotto-epiglottal stricture with concomitant reduction in the size of the pharyngeal cavity. This active articulation is accompanied by tongue retraction, larynx raising, and possible narrowing of the pharyngeal walls; but the primary factor is claimed to be the aryepiglotto-epiglottal stricture. The relevance of this finding for Mongolian harmony is that such a stricture would provide a plausible articulatory basis for the turbulence we have observed in the Khalkh RTR vowels. This, in turn, would make the RTR set the marked or active factor phonologically. As noted by Ko (2019), Clements & Rialland (2008: 53) observe that in many African harmony systems the low vowels are the site of neutralization for ATR contrasts with [ATR] /a̘/ replaced by /a/ while in the East Asian languages like Mongolian it is the high vowels that are the site of neutral vowels with [RTR] /i̙/ replaced by /i/ (reversed polarity). We recall that /i/ lacks an RTR counterpart in the phonemic inventory of (1) and is transparent to the spread of tongue-root harmony in Mongolian. The generalization across these two language types would be the tendency for loss of contrast to be found in the regions of the vowel space that are least compatible with the ATR (open pharynx) West African versus RTR (constricted pharynx) East Asian articulations.

Esling et al.’s PPM (2019: 166) depicts the positive and negative synergies at play among the various articulations linking vowel quality, phonatory quality, and tonal quality. Phonological systems may override the positive connections for purposes of contrast; but a natural default state is implicit in the model. Of particular relevance for our Khalkha Mongolian results are the following synergistic (positive) connections posited in the PPM: lingual (tongue-body and tongue-root) retraction and lowering (leading to increased F1 and decreased F2), epilaryngeal constriction (leading to noise), larynx raising, vocal fold adduction (for voicing), and increased vocal fold tension leading to higher f0. The acoustic correlates for the Khalkh /a,o,u/ vowels that we have found are consistent with the first four of these five pharyngeal-laryngeal articulations. The exception is f0 where we found the /a,o,u/ set to be associated with lower pitch. The PPM also treats tongue-body raising as an anti-synergy for epilaryngeal constriction. This premise aligns with our data for spectral drop-off where H1H2 has the profile /a/ < /e/ and /o/ < /ö/, but /u/ > /ü/ and H1A1 has the profile /a/ > /e/ and /o/ > /ö/, but /u/ ≈ /ü/. In both cases it is the high vowel pair that is the odd man out. And as noted above, the high front region of the vowel space lacks an RTR element in Mongolian.

As a final remark, one cannot help but wonder if the throat singing for which Mongolia is justly famous is not recruiting a basic feature of the language’s phonological structure for artistic purposes comparable to how Japanese haiku and tanka verse are based on the fundamental role of the mora in shaping that language’s prosody.

6 Conclusions

In this article we presented the results of a study of the phonetic correlates of the phonological contrasts of vowel length, vowel quality and voice quality in Khalkha Mongolian. The study is based on a recording of 75 monosyllabic noun stems in four case forms collected from five native speakers. The principal results are as follows. First, we confirmed and replicated Svantesson’s (1985) important finding that the presumed front round vowels of Classical Mongolian have retracted to the back vowel region while original [u] and [o] have lowered in the vowel space. Svantesson interpreted this rotation of the round vowels as evidence that the phonological contrast between /ü/ and /ö/ versus /u/ and /o/ changed from an opposition in tongue-body backness to an opposition based on pharyngeal width. Second, we presented the results of an attempt to find voice-quality evidence for the postulated ATR versus RTR contrast. The most reliable correlate found was the Harmonics-to-Noise ratio where the RTR /u/ and /o/ vowels presented consistently lower values compared to /ü/ and /ö/ at most of the spectral levels analyzed. We speculated that this finding supported Esling’s (2005) Laryngeal Articulator Model according to which the Khalkh RTR /a,o,u/ set may reflect aryepiglotto-epiglottal stricture. We also examined the positioning of the ablative suffixal vowels in F1*F2 space and found preliminary evidence that the [±back] contrast between /a:/ and /æ:/ is transmitted to the suffixal vowel along with assimilation in pharyngeal opening and lip rounding in the two vowel harmony processes for which the Mongolian language is well known. Finally, we found that long stem vowels were approximately two and a half times the duration of short vowels. This ratio was maintained in the face of changes in absolute duration as a function of suffixation and changes in syllable structure. Future research should try to pinpoint the articulatory basis of the tongue-root/pharyngeal harmony contrasts with glottography and other tools. Also, tests modifying the voice-quality correlates of spectral drop-off and turbulence could be devised to see if these variables cue the ATR versus RTR distinction perceptually.