Introduction

Vocal cues are a valuable source of socially relevant information about others (Kamiloğlu, & Sauter, 2021; Latinus, & Belin, 2011). Their meaning, evolutionary significance, and social effects have been emphasized across diverse disciplines including biology, psychology, anthropology, ethology, linguistics, the computer sciences, and media studies. Given the increasing quantity of studies on voice perception, there is a growing need to better understand how the methodologies applied to investigate human nonverbal vocal communication may influence research outcomes (Groyecka-Bernard et al., 2022; Lavan, 2023; Pisanski et al., 2021; Sorokowski et al., 2022).

One prevalent paradigm in voice research is to use perception/playback experiments to assess listeners’ evaluations of unseen speakers based only on their voices. Such studies have, for example, investigated listeners’ assessments of social or psychological characteristics of the speakers, such as dominance (Puts et al., 2007), cooperativeness (Knowles & Little, 2016), or authority (Sorokowski et al., 2019), and biological traits including body size or physical strength (reviewed in Aung & Puts, 2020; Pisanski & Bryant, 2019). These experiments consistently highlight the social value of vocal cues in human communication; however, research into methodological aspects of such studies is scarce despite considerable variation in methodological protocols across studies, and particularly in the types of voice stimuli used to measure listeners’ voice-based assessments.

Although we know that even extremely short speech stimuli, such as a single word, can encode a great deal of socially relevant information about the speaker (McAleer et al., 2014), few researchers have systematically tested how the perception or accuracy of voice ratings is influenced by variations in the length and type of voice samples (e.g., Ferdenzi et al., 2013; Mahrholz et al., 2018, Painter et al., 2021; Groyecka-Bernard et al., 2022). In one of the earliest studies to test this, Ferdenzi and colleagues (Ferdenzi et al., 2013) showed that voice attractiveness judgments were higher for words than for vowel sounds. Mahrholz and colleagues (Mahrholz et al., 2018) later showed that listeners’ voice-based ratings of trustworthiness, dominance, and attractiveness were highly correlated when based on a word versus sentence produced by the same speakers, and did not depend on whether speech was socially relevant or neutral. Painter and colleagues (Painter et al., 2021) found that accuracy in listeners’ perceptions of sexual orientation from speech, though generally low, varied depending on whether speakers were producing a single scripted word or one or two scripted sentences. Most recently, Groyecka-Bernard and colleagues (Groyecka-Bernard et al., 2022) showed that listeners’ perceptions of several socially relevant traits including attractiveness, trustworthiness, dominance, likability, health, masculinity, and femininity increased as the length of speech stimuli increased. Listeners’ ratings of these traits were slightly higher for sentence-long greetings or read renditions of the first paragraph of the classic, content-neutral Rainbow Passage than they were for vowels and single words. Nevertheless, listeners’ judgments were highly similar for the same speakers, regardless of the length of their scripted speech, indicating that socially relevant perceptions of speakers are not wholly altered but instead are moderated by the length of their speech. Critically, all of these studies focused on scripted or read speech stimuli, with none having included free spontaneous speech, despite known differences in the way that people talk when they read versus speak freely (e.g., Szekely et al., 2020; Winkworth et al., 1994).

It thus remains unknown whether freely produced spontaneous speech elicits different impressions of speakers than does standardized scripted or read speech, a key aim of the present study. This is an important research question both theoretically and methodologically. Theoretically, answering this question can provide insight into the extent to which acoustic cues to speaker traits are encoded in different kinds of speech, as we predict they will be, given that socially rich acoustic parameters such as voice pitch are preserved across speech types and even nonverbal vocalizations within speakers (Pisanski et al., 2020, 2021). Methodologically, answering this question can clarify whether studies using different kinds of speech stimuli are comparable, and indeed, whether using free versus read speech is preferred in terms of ensuring the more robust responses from listeners. Practically, this research can also uncover whether reading text or speeches, which many public speakers in politics and broadcasting do, causes listeners to perceive us more or less favorably.

The large body of existing research on human voice production and perception is highly inconsistent in terms of methodology and voice stimulus choice. Some research designs provide a high degree of standardization by ensuring that all participants produce vocal stimuli with the same verbal content (e.g., Atkinson et al., 2012; Knowles & Little, 2016; Pipitone & Gallup, 2008; amongst many others), while others prioritize ecological validity over internal validity, and therefore record spontaneous, unscripted free-speech sentences (e.g., Mayew et al., 2013; Puts et al., 2007; Schild et al., 2020a, b; Tigue, et al., 2012; and many others). Ecological validity refers to the extent to which the findings of a research study can be generalized to real-life settings, and it is generally clear that listening to a free speech sample more closely reflects a real-life social interaction than does listening to a standardized recording of someone reading, particularly if the read text is socially irrelevant and affectively neutral, as it most often is.

At the same time, it is also true that the free speech design bears certain risks, as there are many factors that can influence the nonverbal vocal parameters or prosody of speech, and thus perceptions of the freely speaking person. Indeed, according to previous studies, free speech has certain characteristics that distinguish it from read speech (Szekely et al., 2020; Winkworth et al., 1994). For example, free speech tends to have a higher proportion of filled pauses such as “um” and “uh” (Szekely et al., 2019), as well as more variation in rhythm, intonation, speaking rate, and pitch (Wester, et al., 2016). The relatively greater variance in free speech might emphasize psycho-biological traits of speakers, namely by drawing attention to stable individual differences in vocal parameters that can index speaker traits, such as dominance or age, which are often indexed by individual differences in mean voice pitch (Kreiman & Sidtis, 2011). This could in turn increase accuracy in listeners’ assessments of such traits based on free speech compared to read speech. In contrast, increased vocal variance might occlude or distract from otherwise reliable vocal indices of speaker traits, for example by drawing attention away from relatively stable vocal markers toward more idiosyncratic speaking patterns that are less biologically constrained and thus less biologically relevant, such as speaking rate.

The aim of the current study was to test whether several commonly studied traits, including social, biological, and psychological characteristics of men and women, would be assessed differently by listeners from the voice depending on whether these same vocalizers were producing read or free speech with neutral content. We thus compared listeners’ assessments of attractiveness, trustworthiness, dominance, likability, femininity/masculinity, and health as well as accuracy in their judgments of vocalizer sex, age, height, and weight based on voice recordings of free speech versus standardized read speech produced by the same men and women.

Methods

The study was conducted in accordance with the Declaration of Helsinki. Study protocols were accepted by the ethics committee at the Institute of Psychology, University of Wroclaw, Wroclaw, Poland. Vocalizers and raters provided informed consent prior to study inclusion. Vocalizers were informed that their voice recordings will be used for research purposes and that they will be played to other participants in perception experiments.

Participants

Vocalizers

Voice samples for the current study were recorded from 208 adult men and women (Mage = 30.4 years, SDage = 11.54 years; 49% women), hereafter referred to as vocalizers. We aimed for a sample size of approximately 100 vocalizers per sex, based on a trade-off between the sample size of raters that would be required to judge such a large sample of speech stimuli (totaling 416), while also retaining a representative sample of men and women with variable interindividual differences in the traits of interest and in nonverbal features of the voice. Vocalizers were recruited through snowball sampling by researchers, research assistants, their colleagues, and acquaintances. All vocalizers were native Polish speakers and were not compensated for study participation.

Voice raters

A separate group of participants took part in the study as listeners (voice raters), with a final sample size of 4,088 men and women (Mage = 32.16 years, SDage = 12.97 years, 46% women), after removal of approximately 5% of data from participants who failed listening or attention checks (see below). The majority of raters (67% or n = 2446, with 51.9% women, Mage = 26.2 years, SDage = 10.3 years) were recruited by the research team using the snowball sampling method, wherein researchers posted recruitment ads on their social media profiles and around their city of residence, both inside and outside of the university, aiming to recruit a representative sample (i.e., including older individuals and non-students). These participants were reimbursed through a lottery draw of small prizes such as pen drives, hard drives, and thermoses. To further increase the representativeness of our rater sample, the remainder of raters (n = 1642, with 37.6% women, Mage = 41.1 years, SDage = 12.2 years) were recruited with the help of the research recruitment firm (Imas International). These participants completed the experiment online and were reimbursed in cash. The number of raters was determined such that each voice stimulus would be rated approximately 20 times, as high inter-rater agreement (alphas > 0.80, p < .001) among judges can typically be achieved with 15 raters (see Kordsmeyer et al., 2018; Schild et al., 2020a, b).

Procedure

Voice recording

Voice recording took place in a quiet room. Vocalizers were invited to take part in a recording session. Prior to voice recording, each vocalizer completed a short survey, and their height and weight were measured using metric tape and a scale. Afterwards, they were requested to provide specified voice samples (see Voice stimuli) that were recorded using a Zoom H4n recorder microphone at a distance of 10 cm from the mouth. Voice recordings were saved as WAV files at 96 kHz sampling frequency and 16-bit resolution, and later transferred to a laptop. After recording the voice samples, participants were thanked and debriefed about the study aims.

Voice stimuli

Each vocalizer provided eight randomized voice samples, among which two were analyzed for the purposes of the current study, namely:

  1. 1.

    Reading a short, scripted paragraph describing a weather forecast (read speech)

  2. 2.

    Free speech about the same topic, the current weather conditions (free speech)

Perception experiment

Listeners in the perception experiment rated voice stimuli through a dedicated web platform. The recorded voice samples (two per vocalizer) were incorporated into an online web app prepared for the purpose of this study. Raters were invited to complete the survey at the University or in their homes. Those who took part in the rating sessions at the University completed the survey either individually or in small groups in a designated room (allowing for a sufficient distance between the participants that assured their privacy and independence with regard to listeners’ responses). Those who completed the survey at home were asked to avoid other competing noises or distractions while performing the assessments. Lab raters used Sennheiser HD 210 professional headphones to listen to stimuli, while participants who completed the playback online were instructed to use high-quality headphones.

At the beginning of the session, raters were exposed to a test voice sample to ensure they can hear the stimulus properly and understand the study instructions. Additionally, the survey included one attention check item placed in the middle of the survey (instead of a voice sample to rate, listeners heard the instructions: “This is an attention checking question – please mark 1”). Approximately 5% of participants failed the entry test or attention check, and thus their data were excluded from further analysis.

For each rater, a sample of ten vocalizers (five per sex), producing both read and free speech, was randomly drawn from the pool of all 208 vocalizers. Thus, each rater heard a total of 20 speech stimuli. Each speech stimulus was judged independently on a single trial and for a single vocalizer characteristic, and was played in full to listeners before a rating could be given. Raters were randomly assigned to assess only one of the following ten characteristics for all voice stimulus samples presented to them: sex/gender (male or female); age in years (How old is this person?); height in centimeters (How tall is this person?); weight in kilograms (How much does this person weigh?); attractiveness on a 1–7 scale, with 1 indicating very unattractive and 7 indicating very attractive (How attractive is this person?); dominance on a 1–7 scale, with 1 indicating not dominant at all and 7 indicating very dominant (How dominant is this person?); trustworthiness on a 1–7 scale, with 1 indicating not trustworthy at all and 7 indicating very trustworthy (How trustworthy is this person?); likability on a 1–7 scale, with 1 indicating not likable at all and 7 indicating very likable (How likable is this person?); femininity/masculinity on a 1–7 scale, with 1 indicating very feminine and 7 indicating very masculine (How feminine/masculine is this person?); and finally, health on a 1–7 scale, with 1 indicating very unhealthy and 7 indicating very healthy (How healthy is this person?). Each speech stimulus was, therefore, evaluated by a mean of 20 raters (M = 19.57, SD = 4,75; see Online Supplementary Material (OSM)). Finally, participants completed a short demographic survey, and were thanked and debriefed. The procedure took approximately 20–30 min.

Statistical analyses

To examine the effect of speech stimulus type on perceived traits of the vocalizers, as well as on the accuracy of listeners’ ratings of biological traits, we proceeded with a series of Linear Mixed Models (using Restricted Maximum Likelihood estimator) with responses clustered within both speakers and raters. Because of sizeable sexual dimorphism in the acoustic properties of the human voice (Titze, 1989), we conducted separate models for male and female vocalizers. Most models included ratings of a given trait as the outcome variable, speech type (categorical variable coded 0 – read, 1 – free), method of data collection (categorical variable coded 0 – online, 1 – lab), vocalizer age, rater gender (categorical variable coded 0 – F, 1 – M), and rater age as fixed variables, and vocalizers’ and raters’ coded identities as random effects. In the model examining the effect of speech type on the accuracy of sex/gender ratings, we entered accuracy in sex assessment as the outcome variable (0 – incorrect, 1 – correct). Additionally, in the models with age, height, and weight as outcomes, we included the actual height or weight of the speakers (as the relationship between actual and rated age reflects the accuracy of the ratings) and age/height/weight*speech type as interaction effects, to test whether the accuracy of listeners’ ratings differs between read and free speech. All continuous predictors were grand-mean centered. All model results are presented in Tables S1S11 in the OSM. For means and standard deviations of actual properties and rated properties (by speech type), see Tables S12 and S13, respectively, in the OSM. We also conducted additional analyses that controlled for the duration of the vocal stimuli. Because the results of these analyses closely resemble those presented below (OSM, pages 25–52), we have included the less extensive findings in the main article.

Due to multiple comparisons, we performed Bonferroni correction and set the alpha level to 0.005 (0.05/10). Estimates presented in the main text were not standardized. For standardized betas see Fig. 1. All analyses and figures were performed using Jamovi, R using packages ggplot2 (Wickham et al., 2016) and parameters (Lüdecke et al., 2020), and Python using the matplotlib (Ari & Ustazhanov, 2014) package. Raw data and codebooks for Jamovi and scripts can be retrieved from: https://osf.io/ga4tp/?view_only=6e0f6b455db14bf79882c0f910277d96

Fig. 1
figure 1

Listeners’ ratings of speaker traits based on read versus free speech. Standardized beta coefficients with 95% confidence intervals representing the effects of free vs. read speech across all traits of interest (each coefficient is derived from a separate analysis) for female and male vocalizers. The plot illustrates that read speech elicited relatively higher ratings of attractiveness, dominance and trustworthiness in both sexes and of health in male vocalizers compared to free speech.

Results

Assessments of biological traits

Sex/gender recognition

The overall accuracy of voice-based sex/gender judgments was very high, averaging 99% for both types of speech stimuli (read and free speech) and for both vocalizer sexes. None of the predictors included in our models significantly predicted variance in sex rating accuracy (see Table S1 in the OSM). Accuracy in voice-based sex recognition was, therefore, extremely high independent of method of data collection, speaker age, rater age, and rater gender for both female and male vocalizers, or speech type. Indeed, the effect of speech type was very small, b = 0.001; CIs: -0.01, 0.01; p = .781 and b < 0.001; CIs: -0.01, 0.01; p = .857, for female and male vocalisers, respectively, indicating that listeners could gauge sex extremally and equally well from both read and free speech.

Age assessment

For both male and female vocalizers, assessed age was predicted by the vocalizer’s actual age (females: b = 0.69, p < .001, males: b = 0.524, p < .001), indicating an overall high accuracy in listeners’ voice-based ratings of age. For female vocalizers, assessed age was not statistically significantly related to the speech type (b = -0.50; CIs: -0.94, -0.06; p = .026) following Bonferroni correction. The speech type also did not predict accuracy in listeners’ age assessments of either female (b = -0.03; CIs: -0.07, 0.002; p = .062) or male vocalizers (b = -0.03; CIs: -0.07, 0.003; p = .069), as reflected by a lack of interaction effects. In males, speech type also did not explain any variance in raters’ age assessments (b = -0.01, CIs: -0.42, -0.44; p = .967) (see Table S2, OSM, for outcomes). Whether participants completed the experiment in the lab or online did not significantly predict their age assessments of either sex (p = .318 and p = .720 for female and male vocalizers, respectively).

Height assessment

In both female and male vocalizers, actual height predicted perceived height, but for females this relationship was slightly below the threshold of statistical significance following Bonferroni correction (b = 0.09, p = .014 in females and b = 0.19, p < .001 in males). For female vocalizers, speech type did not predict listeners’ assessments of speaker height (b = -0.32; CIs: -0.71, 0.07; p = .109, see Table S3, OSM, for detailed outcomes) and there was no significant interaction between speech type and vocalizer’s actual height on listeners’ assessments (b = -0.09; CIs: -0.16, -0.013; p = .022). In male vocalizers, no effect of speech type was found (b = 0.08; CIs: -0.36, 0.52; p = .728), and there was no interaction between actual and assessed height (b = -0.02; CIs: -0.09, 0.06; p = .667). Whether participants completed the experiment in the lab or online did not significantly predict their height assessments of either sex (p = .114 and p = .062 for female and male vocalizers, respectively).

Weight assessment

For female vocalizers, the model revealed no significant main effect of female vocalizers’ actual weights on listeners’ weight assessments following Bonferroni correction (b = 0.07; p = .018). We found no effect of speech type on listener’s assessments of weight (b = 0.02; CIs: -0.48, 0.45; p = .948), and no interaction between speech type and vocalizer actual weight (b = -0.03; CIs: -0.07, 0.006; p = .099). In male vocalizers, the effect of speech type was also not significant (b = 0.05; CIs: -0.47, 0.57; p = .845), nor was the effect of participant’s actual weight on their assessed weight (p = .137), with no interaction between these two variables (b = 0.02; CIs: -0.02, 0.06; p = .257). Thus, these results suggest that the type of speech stimulus did not influence listeners’ weight judgments (see Table S4, OSM). Whether participants completed the experiment in the lab or online did not significantly predict their weight assessments of either sex (p = .875 and p = .982 for female and male vocalizers, respectively).

Assessment of psychosocial traits

Subsequent models examined listeners’ assessments of socially relevant traits of vocalizers including attractiveness, dominance, likeability, trustworthiness, masculinity/femininity, and health. While many of these traits can be framed as both psychological and biological to some degree, the key distinction for the purposes of this study is that these traits were not defined by any objective ground-truth measure. Thus, while we could test whether listeners’ perceptions of these traits varied when judging free versus read speech, we could not test whether speech type affected the accuracy of these assessments, as we could for biological traits for which we had objective measures (sex, age, and body size). The statistical models were thus composed of the same set of predictors (speech type, vocalizer/rater age and sex); however, we did not assess accuracy in listeners’ assessments of these psychosocial traits.

Speech type predicted differences in listeners’ ratings of several psychosocial traits, with similar results for male and female vocalizers (see Fig. 1). Specifically, read speech elicited relatively higher ratings of attractiveness (vocalizer sex: females b = -0.13; CIs: -0.20, -0.06; p < .001; males b = -0.11; CIs: -0.18, -0.04; p = .003; Table S5, OSM), dominance (females b = -0.37; CIs: -0.45, -0.29; p < .001; males b = -0.39; CIs: -0.47, -0.31; p < .001; Table S6, OSM) and trustworthiness (females b = -0.45; CIs: -0.53, -0.36; p < .001; males b = -0.44; CIs: -0.52, -0.36; p < .001; Table S8, OSM) than did free speech. Likability ratings tended to be relatively higher for free speech but did not vary significantly between speech types after Bonferroni correction in either female (b = 0.099; CIs: 0.024, 0.173; p = .009) or male vocalizers (b = 0.08; CIs: 0.01, 0.15; p = .033; Table S7, OSM). Masculinity-femininity ratings also did not vary between speech types for either female (b = 0.04; CIs: -0.01, 0.10; p = .146) or male vocalizers (b = -0.037; CIs: -0.09, 0.02; p = .165; Table S9, OSM), nor did health assessments of female vocalizers (b = -0.05; CIs: -0.12, 0.02; p = .152), whereas the same male vocalizers producing read rather than free speech were evaluated as healthier (b = -0.25; CIs: -0.32, -0.17; p < .001; Table S10, OSM). In general, differences between speech types were small (Fig. 1).

In all cases, whether participants completed the experiment in the lab or online did not affect the patterns of results, with the exception of ratings of men’s health (b = -0.20; CIs: -0.44, 0.03; p <0.001). Therefore, we performed additional analyses for lab and online raters with perceived health as an outcome variable (see Table S11, OSM). The effect of speech stimulus type was statistically significant in lab (b = -0.21; CIs: -0.32, -0.09; p < .001) but not online raters (b = -0.03; CIs: -0.14, 0.09; p = .663), with higher ratings observed for read speech (as compared to free speech).

Additional analyses

To corroborate our findings, in addition to the multilevel models reported above, we performed analyses at the vocalizer level (n = 208). Across all ratings for which we collected objective continuous measures of traits from vocalizers (i.e., actual age, height, and weight), we conducted correlation analyses to illustrate accuracy in listeners’ ratings by speech type (i.e., we correlated assessed traits with objective measures of those traits for both read and free speech). The results are presented in Fig. 2; in Fig. 3 we further present correlations between mean ratings of psychosocial traits based on read versus free speech. In both sets of these regression analyses, the relationships are significant in virtually all cases (at p <.001) and range in strength from moderate (r = .48, p < .001) in the case of dominance ratings of female vocalizers, to extremely strong (r = .96, p < .001) in the case of age ratings for female vocalizers (see Figs. 2 and 3). These results demonstrate that listeners’ voice-based ratings of vocalizer traits correlate substantially between read and free speech produced by the same vocalizers, indicating that vocalizes are perceived similarly regardless of whether they are reading a scripted passage or producing free speech about the weather.

Fig. 2
figure 2

Relationships between actual objective measures and listeners’ ratings of biological vocalizer traits for read versus free speech. The upper rows in each panel (A, B, C) reflect correlations between objective measures of actual age/height/weight and listeners’ mean ratings for these same traits, based on read versus free speech. The lower scatterplot in each panel represents the relationship between mean ratings from read and free speech for each biological trait. Each datapoint presents a single vocalizer (n = 208), wherein black circles represent female vocalizers and yellow diamonds represent male vocalizers. Grey shading represents 95% confidence intervals. Mean ratings are derived from voice-based assessments given by 4,088 male and female listeners

Fig. 3
figure 3

Relationships in listeners’ ratings of psychosocial traits based on read versus free speech. Relationships between listeners’ ratings of psychosocial traits based on read versus free speech were moderate to strong. Each datapoint represents a single vocalizer (n = 208), wherein black circles represent female vocalizers and yellow diamonds represent male vocalizers. Grey shading represents 95% confidence intervals. Mean ratings are derived from voice-based assessments given by 4,088 male and female listeners

Discussion

In our species, as in many other terrestrial mammals, the voice is a crucial modality for social communication. This study corroborates a large body of literature showing that simply by hearing another person speak, even for a brief moment and largely regardless of what they are saying, we can effectively gauge their sex, their general age and body size, and we can make a rapid series of judgments regarding other psychological and social traits such as how attractive, dominant, or trustworthy they are (for reviews, see Kamiloğlu & Sauter, 2021; Kreiman & Sidtis, 2011; Pisanski & Bryant, 2019). Here, we go one step further to also show that these judgments are highly similar for the same speaker whether they are reading a passage or producing free speech about the same neutral topic. In other words, a person who is perceived as particularly dominant or as relatively young when reading is perceived very similarly when speaking spontaneously, even when using different words, though this depends to some extent on the trait that is being judged, as described below. Listeners are also similarly accurate in their ability to judge objective traits, such as sex, age, and body size, from read and free speech. Thus, despite some acoustic variation in how we humans talk when we read, largely tied to breathing (Winkworth et al., 1994), this mild departure from our natural spontaneous way of speaking does not profoundly affect listeners’ judgments of us. This is almost certainly due to intraindividual stability in underlying vocal parameters such as fundamental and formant frequencies across different kinds of speech (see, e.g., Pisanski et al., 2016, 2020, 2021) that often explain the majority of variance in listeners’ voice-based judgments of speakers (for review, see Pisanski & Bryant, 2019).

While the present study shows that voice-based assessments of biological traits, including speaker sex, age, and body size, depend very little on whether a person is speaking freely or reading a scripted text, one exception is age assessments of female vocalizers wherein listeners judged women as slightly older (< 1 year) when those women produced read speech versus when they spoke freely about the weather, a relatively small difference in ecological terms considering that our sample of vocalizers ranged from 18 to 67 years of age. Moreover, accuracy in age assessments of female vocalizers was the same for read and free speech.

Interestingly, we show that for assessments of some psychosocial traits (e.g., attractiveness, dominance, and trustworthiness), listeners’ judgments are slightly more affected by whether speech is read or spontaneous compared to what we observed for quantifiable biological traits. Indeed, we observed statistically significant differences between listeners’ ratings for read and free speech across most psychosocial traits and for both vocalizer sexes, with the exceptions of masculinity-femininity assessments of males and females and health assessments of females. It should be noted, however, that the effect sizes were small and not likely to be ecologically or socially meaningful.

Critically, using regression analyses, we show that listeners’ ratings of both biological and psychosocial traits share a high degree of variance between speech types for the same speakers. Even in cases where the differences between read and free speech were significant, ratings between speech types were moderately to strongly correlated for the same group of vocalizers, with at least 23% and up to 92% of the variance in ratings for one speech type explained by ratings for the other. Nevertheless, the finding that assessments based on different types of speech can differ is important to consider when comparing voice-based ratings across studies that use different speech types, as well as in designing further research. Regarding practical implications, such differences may be relevant in the social and media spheres, for example in terms of how listeners may regard public figures or persons in advertisements based on the perceived spontaneity of their speech.

Our study also corroborates previous studies showing that human listeners can judge physical characteristics fairly reliably from the voice alone (for reviews, see Kamiloğlu & Sauter, 2021; Pisanski & Bryant, 2019). In our study with a large sample of over 4,000 listeners, we have shown that sex/gender was assessed almost perfectly, age with high accuracy, and body size still higher than chance. Weight ratings were the least accurate – in particular, male weight was not accurately recognized by our listeners from the voice, regardless of speech type.

Limitations and future research recommendations

Our research design does not allow us to draw direct conclusions regarding the “accuracy” of listeners’ judgments for psychosocial traits. We appreciate that many of these traits, such as masculinity, femininity, attractiveness, dominance, and health, have biological components. Nevertheless, in this study, we did not have quantifiable, objective measures of these traits, nor of likeability or trustworthiness. It is indeed difficult to quantify such traits (see, e.g., Rubenstein, 2005) compared to objective traits such as age or body size. Our study also has technical limitations, for example, although online raters were instructed to use high-quality headphones, and headphone use was verified with hearing tests, we cannot be certain that the quality of their headphones was comparable to that of those used by lab participants. However, this method allowed us to increase our sample of respondents, and we did not find systematic differences in our results between our lab and online samples. Finally, as our sample was homogenous in terms of language, the study may be replicated in other countries to assess its cross-cultural generalizability. In the future, it would also be informative to examine the effect of speech type in more ecologically valid or multi-modal conditions, i.e., where a speaker is seen (Groyecka et al., 2017), to test whether speech type remains relevant when other social cues and modalities are available to the perceiver.

Conclusion

The human voice can be a critical social tool that informs listeners about biological and psychosocial traits of human vocalizers. While past work has shown that some acoustic properties of the voice, such as fundamental frequency (voice pitch), are relatively stable across speech types (Pisanski et al., 2016, 2021), and that listeners’ judgments also do not vary substantially based on the duration or complexity of scripted speech (Groyecka-Bermard et al., 2022; Makrholz et al., 2018), this study is among the first to compare listeners’ ratings of vocalizers based on read versus free speech, and for a broad range of biological and psychosocial traits. Our results thus provide the most comprehensive evidence to date that listeners’ judgments of speakers based on vocal cues and the validity of those judgments also vary to a little yet observable extent with the type of speech stimulus (free speech vs. read speech). This indicates that while the same vocalizer might be perceived as, for example, slightly more trustworthy when producing read than free speech, both those ratings will correlate, often strongly, for any given speaker. Nevertheless, although differences in voice-based assessments for free versus read speech are minor, they are worth acknowledging when comparing across studies or designing future research paradigms.