Introduction

Assessment plays a crucial role in higher education, providing a means to certify levels of achievement, verify learning outcomes, and test student knowledge. However, assessment has considerable additional educational value due to its potential to motivate, facilitate and enhance learning (Carless et al., 2017; Entwistle & Entwistle, 1991; Kickert et al., 2022; Marton & Säljö, 1997; Ramsden, 1997), and lay the foundations for future learning (Boud & Falchikov, 2006; Boud, 1995, 2000). The way in which students are assessed also has profound implications for both student wellbeing (Baik et al., 2019; Jones et al., 2021; Slavin et al., 2014) and student engagement (Vaughan, 2014) and, arguably more than any other aspect of teaching, signals to students what is valued by their teachers, the discipline, and the institution.

The challenge of designing effective assessment is a perennial problem for universities, and one of the primary issues of concern identified by higher education quality assurance bodies. In reviews conducted by the Quality Assurance Agency for Higher Education in the UK, for example, deficiencies related to assessment practices have persistently emerged as the main criticism of university courses, especially the ‘very narrow range of assessment methods in use and over-reliance on traditional examinations’ (Boud & Falchikov, 2006, p. 402). This over-reliance on examinations is problematic for two key reasons. Firstly, it constrains diversity in assessment methods. Such diversity is necessary to assess a broad range of learning outcomes, provide a multi-dimensional understanding of student’s skills and knowledge, maintain student engagement, and involve students in learning activities that lead to higher order thinking and a deeper understanding of content (Biggs et al., 2022). Secondly, high-stakes final examinations tend to serve a purely summative function, which becomes an issue when they dominate the curriculum at the expense of opportunities for formative assessment and feedback. It is important that formative assessment and feedback feature prominently in curriculum design (Morris et al., 2021) to allow students to advance their learning by actively engaging with and implementing feedback (Henderson et al., 2020; Winstone & Carless, 2020).

In the scholarly literature on assessment in higher education, questions relating to the pedagogical value of final examinations surface repeatedly. For example, John Biggs (2001, p. 234) argues that “invigilated examinations are hard to justify educationally” citing concerns about plagiarism and contract cheating as the leading “distorted priority” for their ongoing use. Scholars such as Gibbs (1992) and Ramsden (1992) warn against the reliability of examinations as a measurement of student learning, noting that questions assessing the recall of facts can often be answered without an understanding of the fundamental principles of the topic or a more complex understanding of the ways in which concepts are integrated in real-world scenarios. Such critiques are by no means new. Indeed, examinations have received criticism since their inception in Imperial China when they were widely criticised for their emphasis on rote memorisation, testing of skills rather than knowledge, the prevalence of cheating, and for cases of mental disorders that were anecdotally attributed to failing the high-pressured examinations (Kellaghan & Greaney, 2019). Similar criticisms were made of the written examinations introduced at Oxford and Cambridge in the nineteenth century which were perceived to dissuade originality through their focus on recall and to contribute to social stratification by benefiting the most privileged (Kellaghan & Greaney, 2019). Yet examinations continued to hold a privileged place in universities, and in fact, a global feature of educational systems was the increase in prevalence of examinations throughout the nineteenth and twentieth centuries (Kellaghan & Greaney, 2019), despite strong criticisms of their low reliability (Hartog & Rhodes, 1936).

Since the 1970s, there has been a shift away from the use of high-stakes final examinations in many countries, including New Zealand (Bassey, 1971), Finland (Sahlberg & Hargreaves, 2011), Australia, and the UK (Richardson, 2015a). This shift is in part a response to the growing concern that heavily weighted summative examinations may negatively impact student learning and wellbeing (Ecclestone, 1999; Jones et al., 2021; Pascoe et al., 2020) and debate about their efficacy and validity as assessment instruments (Knight, 2002). It is also related to broader macro-level processes, including the internationalization of higher education which brings global dimensions into the curriculum that impact assessment design (Jamil et al., 2021), and digitalization which has opened-up new possibilities for more diverse and creative assessment methods that can be employed at scale. However, in many systems, high-stakes examinations remain strongly entrenched. For example, Wong et al. (2020) observe that in Singapore attempts to introduce new modes of assessment have been constrained by a high-stakes examination culture that has been a key feature of the national education system since the 1960s. Similarly, researchers in China (Chen et al., 2020; Song, 2016; Wang et al., 2022; Wang & Brown, 2014), Korea (Kwon et al., 2017), South Africa (Mutereko, 2018), the U.S (Berliner, 2011; Fook & Sidhu, 2014; Gorgodze & Chakhaia, 2021), and Canada (Rawlusyk, 2018) suggest that high-stakes examinations continue to be used as a dominant mode of assessment, despite growing awareness of the need for more formative and diversified assessment practices.

In this article, we consider whether the enduring popularity of high-stakes summative examinations is justified by empirical evidence through a scoping review of literature on the topic. To our knowledge, this study represents the first comprehensive synthesis of the purported benefits and drawbacks of high-stakes examinations. It offers a summary of the key arguments and an analysis of the evidence that will be of value to university teachers, learning designers, institutional policy makers, and anyone with an interest in assessment in higher education.

While examinations can take many forms, our interest here is in individual, closed-book assessments “undertaken in strict formal and invigilated time-constrained conditions” (Bridges et al., 2002, p. 36), either in-person or via proctored online testing, which occur at the end of a subject (and therefore have a purely summative function). We employ this restricted definition in part because of the prevalence of such examinations in university curricula and in part in acknowledgement that the limitations of examinations identified in the literature and summarized here can be overcome to some extent with more creative design. For example, well-designed open-book and take-home examinations, groupwork examinations, and shorter nested examinations scheduled throughout the semester, potentially all offer an advance on the traditional high-weighted closed-book examination. We consider high-stakes examinations to be those which are strongly consequential for student progression, either due to heavy weighting (often 50% or more of the overall assessment weight associated with a subject), and/or the assignment of ‘hurdle’ status (students must obtain a passing grade in the examination to pass the subject). While high weightings assigned to any assessment task reduce opportunities for students to demonstrate their abilities through more diversified forms of assessment, we argue that the time-constrained, information-restricted and typically written format of the examination make it especially problematic.

In the current higher education landscape, questions about the reliability of assessment methods and their relative vulnerabilities to cheating are highly topical, especially in the wake of the recent availability of generative AI technologies such as ChatGPT. These developments seem likely to spur a call for even greater curricular emphasis on high-stakes examinations. Equally, ongoing concerns about rising costs and growing student numbers mean that universities are under considerable pressure to prioritise assessment methods that are cost effective and efficient, which can result in a default to examinations without sufficient consideration of their pedagogical merits. It is therefore timely to revisit and review the empirical evidence and arguments for and against the use of examinations.

Methods

We conducted a scoping review of the education literature for arguments and evidence for and against the use of high-stakes final examinations in higher education. The purpose of a scoping review is to summarize the body of literature on a given topic and to assess the quality of the evidence (Gómez & Suárez, 2021; Munn et al., 2018). This methodological approach was therefore well suited to the objective of our study. Our selection of studies followed the Preferred Reporting Items for Systematic Reviews and Meta-Analysis for Scoping Review (PRISMA-ScR) guidelines which involves four stages: identification of records, screening of records, assessment of eligibility, and the inclusion of studies (Gómez & Suárez, 2021; Liberati et al., 2009). In the first stage, we conducted a search in three databases (Web of Science, Scopus, and Proquest) for all articles written in English and published before July 2023 that contained the term ‘high-stakes test,’ ‘high-stakes exam,’ or ‘high-stakes assessment’ in the title, abstract or keywords, coupled with ‘university’ or synonyms (‘tertiary,’ ‘higher education,’ and ‘college’) and ‘review,’ ‘evidence,’ ‘empirical,’ or ‘analysis.’ Our database search returned 406 articles. As part of stage one, in line with the PRISMA guidelines, we also identified articles through other sources, including papers that were known to the authors, articles from reference lists, additional references suggested by reviewers of an earlier version of this paper, and papers that addressed the related topic of ‘exam culture.’ Through this process, we identified a further 82 articles.

After removing duplicates, in stage two of the process, two of the authors (SF and RM) independently screened the abstracts of the retrieved articles to determine whether they were relevant. Papers were rejected if they were deemed a) off-topic or out of scope; b) not relevant to a higher education teaching and learning setting; c) narrowly focussed on a specific assessment type, assessment instrument or educational context; or d) only peripherally related to the topic. We aimed to include as many claims and sources of evidence for and against examinations as possible and thus empirical grounding was not used as a criterion for inclusion. After independent screening, we reviewed cases where there was a mismatch between the authors in initial assignment of relevance (n = 8, 2%) to achieve consensus about inclusion or exclusion (6 papers were excluded and two were retained by consensus). After screening, the 406 articles returned by our database search were narrowed to 40 while the records identified through other sources remained at 82, giving us a total of 122 relevant papers. In the third stage the authors read the full text of the studies selected to confirm that they met the inclusion criteria.

Article coding and analysis

Articles were coded according to emergent themes. Our method of data coding combined an inductive and deductive approach (Braun & Clarke, 2012). Firstly, using an inductive approach, we derived thematic codes from the articles which allowed us to create a long list of codes and to tag each article in our database with the relevant codes. Secondly, we identified the themes, considering areas of commonality between the codes and the key ideas around which the codes clustered (Braun & Clarke, 2012). We refined the themes and combined themes that were conceptually aligned, resulting in seven key themes. Some themes, such as optimal exam design or the prevalence of examination cultures in certain countries, were not helpful in addressing our central question and were therefore omitted from the thematic analysis but are included in the introductory section of this article. We then returned to the papers and applied deductive reasoning to determine whether each paper presented arguments in relation to one or more of the themes as well as whether it offered evidence to support its claims. Each article was classified according to the themes and inserted into a table (Table 1); articles that cross-over more than one theme are listed multiple times. Once the texts were grouped thematically, we summarized and synthesized the arguments and evidence presented within each of the key themes, identified contradictions and inconclusive findings, and highlighted any gaps in the literature.

Table 1 Listing of broad themes and the publications corresponding to each theme

Results and discussion

Below, we explore each of the above key themes and synthesize the scholarly literature on the topic. Within each theme, we closely examine the empirical evidence relating to both the benefits and drawbacks of high-stakes summative examinations and analyze the key arguments.

Theme 1: memory recall and knowledge retention

Many studies provide empirical evidence that addressing questions in tests and exams can improve memory recall and the retention of information (Butler & Roediger, 2007; Deng et al., 2015; Karpicke & Roediger, 2008; McDaniel et al., 2007; Roediger & Karpicke, 2006). These studies, in the field cognitive psychology, define this phenomenon as ‘the testing effect’ or ‘test-enhanced learning,’ and show that testing produces greater retention than studying. This evidence suggests that within certain disciplines, testing students is of value and is especially useful in courses in which students need to learn a large amount of factual information.

However, the findings of these studies show that the spacing, timing and frequency of effective testing is not aligned with the conditions commonly associated with high-stakes final examinations. Rather, research on the benefits of test-enhanced learning consistently indicates that regular short-answer tests or quizzes taken shortly after the content is taught are of greater value for knowledge retention than single, high-stakes summative examinations (Butler & Roediger, 2007; McDaniel et al., 2007; Santovena-Casal, 2019). Pointing to the power of successive relearning for knowledge retention, Rawson et al (2013) show that memory retrieval is most effective as the number of successful retrievals increase, illustrating the limitations of once-off learning that is encouraged by summative examinations. Low-stakes tests during the semester are therefore a better alternative to high-stakes examinations at the end of the semester in terms of retention.

Despite the benefits of test-enhanced learning, it is also well known that retention of knowledge demonstrated in exams can be short-lived (Greene, 1931; Jones et al., 2015; Rawson et al., 2013). This is likely because the kind of cognitive activity that is required for long-term retention is at odds with that required for rote-learning facts, which tends to be the dominant form of activity required exam preparation. Content analyses of examinations in university science and medicine courses, for example, have shown that most questions merely test the isolated recall of factual knowledge (Ramsden, 1992). Without a deeper understanding of the relevance, context and application of concepts, abstract ideas are easily forgotten. Examinations may also undermine long-term knowledge acquisition because they emphasise extrinsic reward (Kuhbandner et al., 2016), enticing students to memorise facts so that they can perform well on an examination rather than engage in deeper learning.

Thus, while there is evidence that testing may assist with knowledge retention, high-stakes summative examinations are ultimately ill-suited to deliver benefits of test-enhanced learning because they do not involve repetition, occur towards the end of the learning process, involve large amounts of content that is difficult to retain, and encourage rote learning.

Theme 2: student motivation and learning

A potential pedagogical benefit of examinations lies in the role they can play in motivating students to study. Studies examining self-reported levels of motivation have found (perhaps unsurprisingly) that student motivation to study is higher for high-stakes assessments than low-stakes assessments, and that motivation is a positive predictor of outcomes (Wise, 2009; Wise & Demars, 2005). Study habits are also known to be an important factor in retrieval, retention, and student achievement, including the practices of self-testing, rereading, and scheduling of study (Hartwig & Dunlosky, 2012; Roediger & Butler, 2011). However, if student motivation to study stems from a desire to rote-learn information to perform well on an examination, extrinsic motivation is activated, but not intrinsic or autonomous motivation, which has been shown to be far more important for student learning (Ryan & Deci, 2000) and long-term memory acquisition (Kuhbandner et al., 2016). In their systematic review of research on the impact of testing on students’ motivation for learning, Harlen and Deakin Crick (2003) found that high-stakes examinations reduce intrinsic motivation and have a negative impact on student self-regulation. Similarly, studies reporting on student preferences find that students do not perceive high-stakes examinations to be motivating or beneficial for their learning (Benediktsson & Ragnarsdóttir, 2020; Gijbels & Dochy, 2006; Wang & Brown, 2014). Students prefer a broad range of assessment tasks spread throughout the semester (Surgenor, 2013; Trotter, 2006; Wass et al., 2015), provided that the use of continuous assessment is not so frequent that students are over-assessed (Harland et al., 2015; Wass et al., 2015). There are also limitations to the efficacy of exam study as a learning method since students tend to adopt traditional methods of studying such as memorization-related activities (Biggs et al., 2022) or reviewing past exams, rather than application-related activities or more authentic modes of study. For example, Zhan and Andrews (2014) found that English language students in China developed their listening skills for a high-stakes examination by listening to past examination recordings, rather than by listening to authentic English media as was intended by the designers of the exam.

A significant pedagogical disadvantage of summative high-stakes examinations identified in the literature is their encouragement of superficial or ‘strategic’ learning whereby students focus only on studying content that has the potential to enable them to gain higher grades (Williams, 2014). The term backwash, coined by Elton (1987), refers to the effect that assessment has on student learning, which can take either negative or positive forms. As Biggs et al., (2022, p. 188) observe, ‘negative backwash always occurs in an exam-dominated system’ in which ‘strategy becomes more important than substance.’ Examinations therefore tend to encourage surface-level learning (DeWitt et al., 2013; Gibbs, 1992), as discussed in Theme 1.

In part, such limitations are an outcome of poorly designed examinations that purely test the recall of information. Examinations can be designed to promote higher levels of thinking, such as by posing questions that demand in-depth analysis and synthesis, and by placing questions within contexts relevant to the field being studied (McConnell et al., 2015; Villarroel et al., 2019). The format of the examination also determines the range of skills and knowledge that can be fostered and assessed. For example, open book and take-home examinations that allow students to engage with materials and integrate knowledge have a greater potential to measure the application and synthesis of knowledge than closed book invigilated examinations (Durning et al., 2016), and to assess authentic tasks using real-world scenarios (Deneen, 2020). However, even when they are designed to promote higher order thinking, there are limitations as to the kinds of skills examinations can assess, given that their format is almost exclusively written.

A further impediment to student learning results from the responses of teachers to high-stakes examinations, which encourage educators to ‘teach to the test’ (Marchant & Paulson, 2005; Smith, 1991) and spend large amounts of time on test preparation (Jones et al., 2003). Berliner (2011) writes of the way in which examinations lead to curriculum narrowing, which he describes as a rational and inevitable response to high-stakes testing. Teachers will naturally align their curriculum with assessment, narrowing the content they teach to what will (and can) be assessed within the final examination. This practice is especially problematic when curriculum narrowing favors activities that call for low-level cognitive processes. Surface learning is a consequence of high-stakes testing since examinations are limited in their capacity to measure higher-order thinking, and teachers’ structure their curriculum accordingly (Berliner, 2011; Jones et al., 2003; Smith, 1991).

Therefore, while high-stake examinations undoubtedly motivate students to invest time and prepare for the assessment, the nature of this investment is often geared towards maximising grades rather than learning, and at odds with demonstration of high-order thinking and skills.

Theme 3: authenticity and real-world relevance

Some argue that examinations reflect real-life situations in the workplace, especially in fields such as the medical and health professions, where information and facts must often be recalled, and decisions made, under time-pressure and without recourse to materials (Durning et al., 2016; Van Bergen & Lane, 2014). This is considered an essential skill by employers in a limited set of academic disciplines. However, it is debatable whether this rationale applies to most other disciplines, and whether performance in an examination is actually a useful proxy for performance under pressure in the workplace, since the artificially constructed nature of the exam format is unlikely to authentically reflect a genuine workplace situation.

Indeed, a recurring argument against the use of examinations in the literature is their lack of authenticity and limited capacity to foster the kinds of skills and knowledge students will need in their future careers, which are more likely to require skills such as critical thinking, problem solving and communication, and the application of knowledge than the ability to recall facts (Boud & Falchikov, 2006; Gibbs & Lucas, 1997; Williams, 2008, 2014). While improving the authenticity of examinations by designing questions that reflect real-life situations and require evaluative judgement has been shown to support deeper approaches to learning (Villarroel et al., 2020), even when designed effectively, examinations are limited in their capacity to support the principles of authentic assessment. Examinations have especially limited capacity to foster and assess listening and communication skills (Choi & Chun, 2022; Stopar & Ilc, 2017), two of the most valued skills employers report wanting in university graduates (Bauer-Wolf, 2019). Assessment tasks that foster these skills, such as presentations, debates, peer-peer learning activities, and inquiry-based group projects all offer potentially valuable opportunities for authentic learning that are reduced when there is a heavy emphasis on high-stakes examinations.

It is also highly unlikely that summative exams will prepare students to be life-long learners, because they position students as passive recipients of feedback without encouraging them to judge the quality of their work, or to apply the feedback they receive (Boud, 2018; Boud & Falchikov, 2006). The results of summative examinations are final and preclude any opportunity for students to learn from mistakes or improve their performance (Knight, 2002). The capacity to reflect upon, critically appraise and improve one’s own work are likely to be essential to students’ future lives and careers. Examination formats are therefore poorly aligned with the imperative for students to develop the capacity to self-assess and to receive and implement feedback. The provision of quality and timely feedback is an essential element of formative assessment, as is the active role of the student in engaging with and implementing the feedback they receive (Henderson et al., 2020; Winstone & Carless, 2020), to ensure continuous improvement and enhanced performance. The capacity for formative assessments to better allow students to demonstrate their ability is evidenced by the numerous studies showing that student performance is improved in assessments that take place outside of the examination context (Bridges et al. 2002; Richardson, 2015a; Simonite, 2003).

The high-stakes nature of most summative examinations also discourages students from adopting an experimental approach to learning, which is a key desired graduate capability. Making mistakes and experiencing misconceptions are an essential part of learning (Metcalfe, 2017; Verdake et al., 2017), which is why it is important that sufficient opportunities are provided for students to engage in low-stakes (or no-stakes) assessments early in a subject, and given sufficient opportunity to use ‘error detection’ (Biggs et al., 2022, p. 186) as the basis for correcting and learning from their mistakes.

In summary, the information-limited, high-pressure context of high-stakes examinations is far removed from most authentic workplace situations. The restricted format of such assessments and their summative nature furthermore limits opportunities to assess key generic skills and discourages an experimental approach to learning through iterative cycles of feedback and improvement.

Theme 4: validity and reliability

When assessments are highly consequential for selection, progression, recognition or certification, it is self-evident that students, teachers and other stakeholders must have confidence that the instrument has high validity and reliability. A simple definition of the validity of an assessment is that it assesses what it claims to assess (and thus allows meaningful, accurate, and appropriate inferences to be made from its scores; Messick, 1992)). However, validity is a complex and multifaceted construct with dimensions ranging from construct validity (whether the test measures the concept it was intended to measure), to content validity (whether the test includes a representative set of all aspects of the construct), face validity (whether the content of a test seems appropriate to its aims), criterion validity (whether test scores correlate with functional behaviours the test seeks to measure), and consequential validity (whether the test has potential or actual positive or negative consequences for teaching and learning; Messick, 1992).

High-stakes university examinations are problematic from the perspective of validity for several reasons. Firstly, while there is well-developed literature on validity testing, validation of high-stakes examinations administered in university courses is neither required nor routinely undertaken (indeed, even formal validation of large national high-stakes tests Mason, 2007; Stobart, 2009) is uncommon (Messick, 1992)). University examinations are typically set by academic staff who have extensive disciplinary knowledge but scant expertise in maximizing or evaluating the validity of their examinations (an exception is the clinical sciences, where ‘blueprinting’ is often used to improve the content validity of examinations; Eweda et al., 2020). University assessment policies often require staff to prepare ‘supplementary’ examination papers for students who are unable to sit the initial examination, but formal comparison of the validity equivalence of exam variants is also seldom undertaken. The rarity of formal validation is likely due to the complexity of the concept of validity, the lack of agreed approaches to validation, the multiple methods needed to provide comprehensive validity evidence, and the effort required to undertake such endeavours (Shaw et al., 2012).

Secondly, because validity is not an inherent characteristic of a test, but of the test in the context in which it is used (Cronbach, 1971), even if a particular test is found to be valid for one purpose (e.g., norm-referenced comparison of accomplishment across students in a cohort), the same test may not be valid for another purpose (e.g. as a pass/fail decision-making tool; Smith & Fey, 2000). Sternberg (1997) found that summative assessments which had high predictive validity in relation to achievements at points in an undergraduate degree were often moderate or poor predictors of subsequent career achievement.

Thirdly, depending on their perspectives, teachers and students may have very different perceptions about whether a test is valid and fair (Caines et al., 2014). As discussed in Theme 2, high-stakes examinations have been extensively criticised for their poor consequential validity. Frederiksen and Collins (1989) defined systemically valid tests as those that drive curricular and instructional changes in education systems that foster the development of the cognitive traits that the tests are designed to measure. They suggested that high-stakes examinations led to an undesirable narrowing of what is taught due to excessive focus on meeting test requirements. Sambell et al (1997) reported that students had very negative views of traditional ‘unseen’ (closed-book) examinations, perceiving that the strategies required to perform well on such examinations encouraged shallow or poor learning, distorting the quality of their learning. Entwistle and Entwistle (1991) also found that examinations both distorted students’ efforts to achieve genuine understanding and that examination questions often did not tap conceptual understanding.

Examinations have also long been criticized for their low reliability (Hartog & Rhodes, 1936), defined as their ability to produce consistent and reproducible results. Low reliability of examination performance can result from a range of factors, related to examinees, examiners, the subject being examined, the test items, and how the examination is scored (Haertel, 2006). For instance, the same examinee may perform differently on an examination because of their psychological or physical health, the conditions of the examination space, or their familiarity with the test items selected for the exam. There are also many factors that can affect the performance and judgement of the examiner, including bias, inconsistency, or rater drift (Cox, 1973; Hartog & Rhodes, 1936; Kellaghan & Greaney, 2019; Knight, 2002). Finally, the number of test items, whether rubrics are used, and how many items are included in the examination can all affect how it is marked. Collectively, these factors significantly reduce confidence in the reliability of examination scores.

Issues of validity and reliability are arguably not unique to examination assessments. However, the absence of a culture of validation, the difficulty of achieving high validity, the evidence for low consequential validity, and the myriad factors that can affect the reliability of examination performance means that there is a troubling lack of validity and reliability evidence underpinning the culture of high-stakes, ‘one-chance’ examinations.

Theme 5: academic misconduct and contract cheating

One of the most common arguments in favour of high-stakes summative examinations is the belief that they are more effective than other forms of assessment at preventing contract cheating. This is one reason why invigilated closed-book examination formats are favoured: students complete these assessments in a tightly controlled environment, providing photo ID and completing the examination in an open public space under close observation. This ought to minimize opportunities for cheating and plagiarism (Crossley, 2022; Van Bergen & Lane, 2014).

However, it is evident that even these tightly controlled contexts do not provide reliable protection against academic misconduct and cheating (Lancaster & Clarke, 2017). Indeed, contract cheating in relation to examinations appears to be prevalent, involving behaviors that range from collusion to impersonation (Bretag et al., 2019a), facilitated by the apparent ease with which university student identification cards can be forged (Potaka & Huang, 2015). Sheard and Dick (2003) estimated that the frequency of cheating in examinations in a cohort of graduate students in IT courses approached ten percent, while in McCabe’s (2005) study of 64,000 North American university students, over one-third admitted to some form of exam cheating. Bretag et al (2019a) found in a large survey of Australian universities that students participated in undetected cheating in invigilated examinations at higher rates than any other type of cheating, including contract cheating in written assessments. The frequency of academic misconduct in examinations undoubtedly increased with the move to online learning during the COVID-19 pandemic (Hill et al., 2021; Peh et al., 2021; Reedy et al., 2021), and challenges around online proctoring (Raman et al., 2021), mean that there is only minimal capacity to ensure academic integrity in closed-book online examinations. Noting that students perceive contract cheating to be most likely for heavily weighted assignments, Bretag et al., (2019b, pp. 685–686) suggest that “examinations provide universities and accrediting bodies with a false sense of security” and that “an over reliance on examinations, without a thorough and comprehensive approach to integrity, is likely to lead to more cheating, not less.”

Effective exam design and delivery can minimize opportunities for contract cheating; for example, examination papers should not be reused, online tests should not be unsupervised, and low-level or ‘one right answer’ tasks and questions should be avoided (Dawson, 2020). However, concerns about academic misconduct can more effectively be alleviated through alternative forms of assessment that limit or remove the potential for cheating. The increasing prevalence of “assignment outsourcing” by ghost writers and essay mills is well known (Ali & Alhassan, 2021; Awdry, 2021), and the essay format is likely to continue to cause concerns relating to academic misconduct, especially considering the increased uses of AI software such as ChatGPT. However, careful assessment design may help combat or reduce opportunities and incentives for cheating in a range of ways (Baird & Clare, 2017). For instance, assessment tasks or questions that ask students to reflect or draw on personal circumstances or experiences (Sutherland-Smith, 2008), local contexts or environments, or assessment tasks that are conducted within a specific class or tutorial activity should generally be more difficult to procure from external sources than standard essays on common topics. Similarly, where tasks involve repeated contributions (reflective journals and blogs), audit trails of progress, or other forms of ‘authentic’ assessment, they ought to be difficult or costly to obtain from external providers. Finally, authorship of some assessment tasks such as vivas, individual or group oral presentations, or video presentations can be verified with a relatively high degree of confidence. It is nevertheless important to recognise that ultimately, with the possible exception of labor-intensive formats like interviews or vivas, very few assessment tasks are immune to outsourcing (Bretag, et al., 2019b; Ellis et al., 2020).

In summary, several large empirical studies challenge the widespread belief that invigilated high-stakes examinations offer better security against contract cheating (and instead suggest that they may be particularly vulnerable). While effective exam design and delivery measures can reduce cheating opportunities, academic integrity concerns alone do not provide compelling grounds for maintaining an overreliance on high-stakes examinations. Educational institutions should explore a broader range of assessment methods that better align with the evolving challenges of academic misconduct in the digital age.

Theme 6: stress, anxiety, and wellbeing

High-stakes examinations have long been associated with psychological distress and anxiety (Hembree, 1988; Kellaghan & Greaney, 2019; Lotz & Sparfeldt, 2017). This issue has come to the fore in recent years with an increased focus on the role that curriculum and assessment design play in supporting student mental wellbeing (Baik et al., 2019; Slavin et al., 2014). Physiological measures of stress including cardiovascular parameters and stress hormones have been shown to be higher during examination periods compared to outside these periods (Fejes et al., 2020; Maes et al., 1998; Weekes et al., 2006; Wolf & Smith, 1995; Zhang et al., 2011). Students also self-report higher levels of anxiety during examination periods (Ballen et al., 2017; Högberg & Horn, 2022; Zhang et al., 2011); a reliable correlate of physiological stress (Roos et al., 2021). Studies have found a correlation between the competition amongst peers promoted by high-stakes exams and negative mental health impacts, including emotions of shame, self-loathing (Fang et al., 2023), and suicidal ideation, especially in female students (Fawaz & Lee, 2022).

There is some dispute regarding whether distress and anxiety negatively impact on examination performance. Many studies have found that anxiety does have a negative impact on performance (Hembree, 1988; Stenlund et al., 2018; Von Der Embse et al., 2018; Wolf & Smith, 1995), while others have found that anxiety does not have a statistically significant effect on performance (Monrad et al., 2021; Sommer & Arendasy, 2015). Some even argue that examination anxiety is useful as it promotes study and preparation (Hamzah et al., 2018) and can increase performance (Shean, 2019). There is also disagreement on whether stress interferes with the retrieval of previously learned knowledge, with some studies finding that stress impairs memory retrieval (Vogel & Schwabe, 2016), while others find it does not (Theobald et al., 2022).

Such contradictory findings hinder a conclusive understanding of the relationship between anxiety and performance. Nevertheless, there is evidence to suggest that a range of other factors associated with anxiety may have an impact on the capacity for students to perform well. For example, anxiety has been found to correlate negatively with motivation, which has a direct effect on achievement (ShayesteFar, 2020). Students more prone to examination anxiety are also more likely to have lower self-esteem and sleep less during examination periods, which reduces concentration (Fernández-Castillo & Caurcel, 2019). It is likely that higher-weighted examinations have stronger negative impacts on students’ wellbeing, due to the increased perception of consequences of the outcomes (Franke, 2018; Salehi et al., 2019; Wolf & Smith, 1995). There are also a range of cultural and genetic factors that exacerbate experiences of examination stress (Zhang et al., 2011), meaning that students will be affected unequally, as we discuss in more detail below. Finally, there is evidence that assessments which are perceived by students as threatening and which provoke anxiety may drive students to adopt surface rather than deep approaches to learning (Gibbs, 1992; Ramsden, 1992).

As outlined above, there is substantial evidence that examinations cause elevated distress and anxiety. Although the impact of examination anxiety on student performance is inconclusive, the proven adverse effects of examinations on student mental health and wellbeing is concerning, as is the negative impact of examination anxiety on student motivation.

Theme 7: fairness and equity

There is some anecdotal opinion that examinations are a fair form of assessment, a ‘test of truth’ that allows students to compete and demonstrate their individual capabilities on equal footing (Crossley, 2022). However, such opinions are not supported by the empirical evidence. It is known that students perform differently under time pressure (De Paola & Gioia, 2016), and there is considerable evidence that examinations have the potential to generate academic inequity due differential performances based on gender (Ballen et al., 2017; Mehrazmay et al., 2021; Rask & Tiefenthaler, 2008; Salehi et al., 2019), socio-economic status (Gliatto et al., 2016; Jackson et al., 2020; Uy et al., 2015), race and ethnicity (Claypool & Preston, 2013; Klenowski, 2009, 2016; Marchant & Paulson, 2005; Preston & Claypool, 2021; Richardson, 2015a; Trumbull & Nelson-Barber, 2019), and disability (Meeks et al., 2022; Nieminen, 2022; Nieminen & Tuohilampi, 2020; Tai et al., 2022). This substantial body of literature on the equity implications of high-stakes examinations provides compelling evidence that examinations can disadvantage marginalized groups and contribute to academic inequity, which intersects with impacts on wellbeing and student learning.

There are a range of studies within the STEM disciplines that suggest examinations differentially affect students based on their gender, finding that women tend to suffer from higher levels of assessment anxiety leading to lower wellbeing and reduced concentration during an examination, and resulting in lower performance (Fernández-Castillo & Caurcel, 2015, 2019; Roos et al., 2021; Salehi et al., 2019), an effect that may be stronger at introductory levels of university (Ballen et al., 2017; Salehi et al., 2019). A study by Ballen et al. (2017) found that women in an introductory biology course underperformed on examinations compared to their male counterparts but outperformed them on combined non-examination methods of assessment. Salehi et al. (2019) argue that the use of high-stakes examinations as a primary assessment method in the STEM disciplines, especially in introductory level courses, imposes a “gender penalty” on female students that may prevent them from advancing in the discipline. Gendered differences have also been found in how students respond to outcomes, with studies suggesting that women may be more likely than men to leave a course after receiving low marks on an introductory course (Rask & Tiefenthaler, 2008; Salehi et al., 2019).

However, there is some debate about the relationship of gender to performance and preferences across modes of assessment, and not all studies provide conclusive evidence of gender bias. For example, a study at the University of Sussex (Woodfield et al., 2005) found that women outperformed men by a small margin on both coursework assignments and final examinations and that students of both sexes performed better on coursework than examinations. Some studies have found no evidence of differential performance on the basis of gender (Karami, 2013; Niessen et al., 2019), while others have found that different components of examinations and styles of questions favour one gender over the other (Bordbar, 2020; Burgoyne et al., 2021), suggesting that examination designs might have more significant equity implications than the assessment method itself. There are a range of intersecting factors and independent variables that render a definitive conclusion about whether exams perpetuate gender bias problematic. However, there is sufficient evidence in the literature to suggest that there may be correlations between assessment modes and gendered styles of learning that warrant consideration when designing assessment tasks.

In addition to potential gender biases, examinations may be biased towards Western students, with equity implications for international students from non-Western countries and for Indigenous students. Richardson (2015a, b) suggests it is likely that the under-attainment of students from ethnic minorities is connected to assessment methods, while many have argued that examinations disadvantage Indigenous students (Claypool & Preston, 2013; Klenowski, 2009; Preston & Claypool, 2021; Trumbull & Nelson-Barber, 2019), as they tend to promote Western intellectual knowledge and values by supporting the view that knowledge can be given, accumulated, and tested in a linear manner. Recent scholarship on inclusive assessment design further argues that examinations fail to meet the needs of student diversity, especially with respect to students with disabilities (Nieminen, 2022; Nieminen & Tuohilampi, 2020; Tai et al., 2022). If examinations favour students from certain groups (male, western, able-bodied) as this research suggests, their reliability and validity as measures of student achievement also comes into question.

Overwhelmingly, the research suggests that ‘once-chance,’ time-pressured final examinations have exclusionary effects and disadvantage marginalised student groups. Alternative forms of assessment that allow for more diverse formats (including non-written formats), as well as formative assessments that offer more support for students, are therefore better aligned with the principles of inclusive assessment design.

Conclusion

Our scoping review of the literature suggests that the current heavy reliance on high-stakes final examinations in many university subjects is poorly justified by the balance of empirical evidence, and that traditional examinations (closed-book, individual, invigilated, time-constrained, summative, final, and high-stakes) have limited pedagogical value. However, the evidence on the benefits of test enhanced learning for memory recall and knowledge retention (Theme 1) along with the role that examinations can play in motivating students to study (Theme 2), indicate that well-designed examinations in a revised format do have a role to play in the curriculum in some subjects, especially when they are formative and low-weighted. To be beneficial to student learning, the format of the examination must engage students in high-order skills, which can potentially be achieved in open-book and take-home examinations, short examinations, or tests scheduled throughout the semester that can build student learning over time, and in groupwork examinations, which can be employed to engage students in collaborative learning tasks. Regardless of their format, it is imperative that examinations are well designed for both pedagogical and security reasons. For example, short-answer questions as well as context-rich multiple-choice questions that require the application of knowledge can enhance learning relative to multiple-choice questions that require the recall of facts (McConnell et al., 2015). To reduce opportunities for contract cheating, low-level or ‘one right answer’ tasks and questions should be avoided (Dawson, 2020).

While the evidence presented in this paper within Themes 1 and 2 suggests that well-designed examinations can be of value under the kinds of conditions outlined above, the pedagogical drawbacks of examinations across all other themes illustrates that it is highly problematic when high-stakes final examinations dominate the curriculum. The artificially constructed nature of the examination format limits their authenticity and real-world relevance (Theme 3) and prevents opportunities for students to self-asses, and implement feedback, which are fundamental to becoming life-long learners. The absence of formal validation, the difficulty of achieving validity, and the low-reliability of examination performance (Theme 4) raises serious concerns about the role examinations frequently play as highly consequential measurements of student performance and capacity. Although guarding against academic misconduct and contract cheating (Theme 5) are commonly-cited reason for the ongoing use of examinations, the evidence shows that contract cheating in examinations is not only prevalent but may be even more pronounced than in other forms of assessment. Increasing the use of invigilated final examinations will not fix this problem. Instead, universities need a comprehensive approach to integrity that includes careful assessment design and forms of authentic assessment that mitigate against the potential for cheating.

The role that high-stakes examinations play in contributing to increased stress and anxiety and decreased student wellbeing (Theme 6) is one of the most troubling findings of this review. The literature provides considerable evidence to show that examinations have adverse effects on student physical and mental health and demonstrates the negative impacts of examination anxiety on student motivation, concentration, and deep approaches to learning. Even if, as some studies claim, anxiety can lead to increased study and performance (and many studies dispute this claim), such potential gains need to be weighed carefully against the negative impacts on wellbeing. Moreover, students have differing capacities to perform effectively under stress and time-pressure, and differential performance is also likely on the basis of gender, socio-economic status, race, ethnicity, and disability. The potential for examinations to disadvantage marginalized student groups and perpetuate educational and social inequity (Theme 7) is especially concerning when they are high-weighted and have significant consequences for students’ lives and careers.

The pronounced lack of empirical evidence for the pedagogical benefits of high-stakes examinations suggests that they are employed primarily for reasons related to cost, efficiency, practicality, scalability, and administrative convenience. However, we question whether these reasons remain valid in the contemporary higher education landscape where many advances in assessment design are already well established, and others are emerging. There are promising examples of educational technology that can assist with the administrative burden of distributing and grading assessments other than high-stakes examinations at scale. Examples include platforms that support peer assessment (Søndergaard & Mulder 2012), social annotation (Miller et al., 2018), personalizing feedback through the use of digital recordings (Ryan et al., 2021), and automated feedback and grading (Cavalcanti et al., 2021; Hegarty-Kelly & Mooney, 2021; Kumar & Boulanger, 2021). Programatic assessment also offers an approach to assessment design that has the potential to both increase assessment security (Dawson, 2020) and to reduce the reliance on high-stakes summative examinations by diversifying assessment methods across the curriculum of an entire program (Baartman et al., 2022; Heeneman et al., 2021). While not trivial to implement, a programmatic approach allows for intentional emphasis on low-weighted assessments in the foundational years of a degree, placing more emphasis on assessment tasks that foster the development of student capabilities and cohort connections. Examinations could then be employed at key moments for accreditation purposes (although other methods might also fulfil this role), but final summative examinations would no longer be the default assessment mode.

The use of high-stakes examinations becomes particularly problematic when they dominate the curriculum at the expense of other valuable forms of assessment. It is essential that students are provided with the opportunity to engage with a broad range of assessment tasks, to develop their learning and build diverse skills that align with desired graduate outcomes and promote a culture of life-long learning. Variation in assessments is critical to allow students to build, apply, and demonstrate different kinds to skills; to foster skills in diverse areas such as self-assessment, inquiry-based learning, communication, and teamwork; and to be rewarded for originality rather than conformity (Ramsden, 1992). Designing assessment practices that encourage continuous and high-quality learning while supporting student wellbeing is a challenging but important task that requires creative and innovative approaches that move beyond an over-reliance on high-stakes summative examinations.