The Good Behavior Game (GBG), a universal classroom behavior management method, is among the most effective programs for promoting mental health in children and adolescents, preventing and/or reducing students’ disruptive behavior and increase on-task behaviors (Joslyn et al., 2019; Smith et al., 2021; Tingstrom et al., 2006). GBG provides teachers with classroom strategies for reducing aggressive or disruptive behavior, shyness, and social isolation. These behaviors are risk factors for future negative mental health-related outcomes, such as depression, violence, and suicide (Johansson et al., 2020; Kellam et al., 2014). Early interventions benefit the short- and long-term mental health of children and adolescents (Baker-Henningham, 2014), which is particularly relevant because up to one-fifth of children worldwide are affected by emotional and behavioral problems (Caetano et al., 2021). Evidence shows that the implementation of preventive interventions in early life that have pre-existing risk factors within educational or familial contexts can have a positive effect on reduction of the incidence and/or progression of mental disorders throughout life (Humphrey et al., 2018; Joslyn et al., 2019).

GBG has been used worldwide, including in the United States (Barrish et al., 1969), Belgium (Saigh & Umar, 1983), the Netherlands (Witvliet et al., 2009), the United Kingdom (Leflot et al., 2010, 2013), Sudan (Coombes et al., 2016;), and Estonia (Streimann et al., 2020). It is an evidence-based practice with positive effects in the following areas: (1) reduction of disruptive and aggressive behavior in the school context (Tingstrom et al., 2006), (2) prevention of the use and abuse of psychoactive substances (Kellam et al., 2014; Poduska et al., 2008; Wilcox et al., 2008), (3) prevention of suicidal ideation and suicide attempts (Kellam et al., 2008; Newcomer et al., 2016; Wilcox et al., 2008), and (4) prevention of the development of violent behavior and antisocial personality disorder in adulthood (Petras et al., 2008). Nonetheless, recent randomized controlled trial (RCT) meta-analytic reviews reported only a modest effect from classroom-based behavioral programs, primarily in reducing the occurrence of conduct problems (Ashworth et al., 2020; Smith et al., 2021; Veenman et al., 2018). However, there were no significant results regarding inattention or teacher-rated shyness, which led the authors to suggest that GBG may not have the effect magnitude originally reported (Smith et al., 2021).

If GBG can present a modest or robust effect, proposing its integration into school curricula would be innovative, considering its potential utility for improving public health in short and long term (Braun et al., 2023; Durlak & DuPre, 2008). GBG may facilitate teaching and learning processes and improve interactions between teachers and students (Kellam et al., 2011). Positive teacher-student and social interactions have been proven to be correlated to better academic and behavioral skills in students. This is primarily because the teacher diminishes negative remarks and fosters a friendly environment using social reinforcers, such as acknowledgments, praise, and descriptions of appropriate behavior (Baker et al., 2008; Gomes & Pereira, 2014). Moreover, GBG seems to diminish negative teacher remarks and mitigate student disruptive behavior, thereby reinforcing positive teacher-student relationships. This phenomenon arises from a reinforcing loop, wherein students’ pro-social behavior and heightened attentiveness continually advance, ultimately contributing to improved group dynamics and teacher-student relationships (Holmdahl et al., 2023; Leflot et al., 2010).

In 2013, the GBG program was culturally adapted to and implemented in Brazil for children between 6 and 10 years of age. This program was titled “Programa Elos: construindo coletivos” (Elos program: constructing collectives) to promote cooperative and democratic interactions between teachers and students (Lorenzo et al., 2018). The Elos program also aimed to promote the early integration of protective factors to minimize the risks of drug use/abuse in the long term, which was an important outcome identified in other GBG RCTs (Kellam et al., 2014; Petras et al., 2008). The program was implemented in Brazilian schools across 17 municipalities on a quasi-experimental pilot basis in 2014 and 2015, with no control groups (Schneider et al., 2018). An RCT evaluating the effectiveness of this program in Brazil has not yet been conducted. The program materials were revised in 2018 to include adaptations identified during qualitative studies of the content (Lorenzo et al., 2018). The updated version of the program was named Elos 2.0. Before disseminating this tool as part of a national public policy, it was essential to first test its effectiveness. This was recommended by the PROMISE project, which provides mental health promotion training guidelines and resources for healthcare professionals (Greacen et al., 2012).

This RCT aimed to evaluate the effectiveness of the Elos 2.0 program. Elos 2.0 is a version of the GBG that was adapted in and for Brazil, an upper-middle-income country, to assist in the reduction of problem behavior and promotion of prosocial behavior in children attending public schools. If the Elos 2.0 program proves to have a positive effect, it can then be implemented by the Ministry of Health in Brazilian public schools as a national policy.

Method

This study was a cluster-randomized controlled trial (RCT) with two parallel groups (intervention and control) of children aged 6 to 10 years (students in the first to fourth years of elementary education) from Brazilian public schools. The schools that were enrolled in this RCT did not offer programs or activities regarding drug use prevention nor the promotion of mental health as of 2019. The National Coordination of Mental Health, Alcohol, and Other Drugs of the Brazilian Ministry of Health (BMH) was responsible for implementing the program in collaboration with the Municipal and State Departments of Health and Education. The inclusion criteria for schools in this study were as follows: (1) being a public school, (2) having at least one class in each grade (first, second, third, and fourth), and (3) belonging to one of the 3 municipalities selected by the BMH. Further details of this specific RCT design can be found in Mariano et al. (2021).

According to the sample size calculation, a sample of at least n = 3800 (1900 children in the control group and 1900 in the intervention group; i.e., 19 clusters with an average of 100 individuals per group in each cluster) was needed to achieve a power of 84% to detect a difference of at least 0.2 in the mean scores for the Teacher Observation of Child Adaptation-revised (TOCA) between the two groups. The units of randomization were the schools. Thus, there were no intervention classes offered at control schools and vice versa, as a way to avoid sample contamination. The standard deviation was set at 0.80, based on Storr et al.’s (2002) work. The cluster correlation coefficient was set at 0.050, which is a conservative value for intraclass correlation compared to the previously reported value of 0.039 (Koth et al., 2009). The coefficient of variation for the cluster sizes was set to 0.700. A t-test with a significance level of 0.050 was used, and the degrees of freedom were based on the number of clusters. The PASS 15.0 software program’s cluster sample calculation module for randomized controlled trials was used to test two proportions; this was based on Donner and Klar’s (1996) equation.

We randomly selected 30 schools (15 experimental and 15 control) from 3 cities: Fortaleza and Eusébio (in the northeastern Brazilian state of Ceará) and São Paulo city (in the southeastern Brazilian state of São Paulo). We began by evaluating the schools in the state of Ceara in 2019. In 2020, we intended to introduce the program in São Paulo city; however, its implementation was halted due to the COVID-19 pandemic. Fortunately, the study in Ceará had enough sample size power to allow for an analysis of the program’s effectiveness in the 5th poorest of Brazil 27 states (federative units).

Participants

In 2019, a total of 2030 elementary school children studying in 22 schools in two cities, Fortaleza and Eusébio, in the state of Ceará, were enrolled in the study. There were 917 students (mean age = 7.7 years; 50.3% girls) who received the intervention and 1113 students (mean age = 7.8 years; 51.5% girls) in the control group (see Table 1).

Table 1 Means and robust standard errors for the three subscales of TOCA

Randomization Process

Schools were allocated to the control and experimental groups at a ratio of 1:1 per municipality, as they were the units of randomization. This process was performed using a simple random draw from the list of the Anísio Teixeira National Institute of Educational Studies and Research (the Portuguese acronym is INEP), which included all Brazilian schools as separated by municipality. All eligible schools (n = 30) were randomized via Efron’s biased coin algorithm (PASS, 2023), which ensured that both allocation groups would be balanced. Efron’s bias was implemented by a biostatistician who did not work in-field with data collection and intervention implementation. Those assessing outcomes were blind to group designation. Figure 1 shows the Elos 2.0 progress flow through the phases of a parallel randomized trial of two groups. Randomization was performed before the baseline assessments.

Fig. 1
figure 1

Flow diagram of the RCT to evaluate the effectiveness of the Elos 2.0 Program in the State of Ceará, Brazil

Procedure

Every school that was invited by the BMH agreed to participate. At the intervention schools, all students in the first through fourth grades participated in the Elos program as a part of their regular school activities. Each school designated one teacher per class who would receive training on the program. In Brazil, there is usually only one teacher in each elementary grade classroom. The BMH trained teachers in Ceará between March and April of 2019. Baseline data were collected from the control and intervention groups simultaneously 2 weeks before the beginning of program implementation in May 2019. The BMH implemented the Elos 2.0 program from June to October, and we collected the follow-up data in November 2019.

This RCT was registered with the Brazilian Ministry of Health Register of Clinical Trials (Registro Brasileiro de Ensaios Clínicos; REBEC) under protocol number U1111-1228–2342 (https://ensaiosclinicos.gov.br/rg/RBR-86c6jp). This methodology has previously been published (Mariano et al., 2021). The Research Ethics Committee at the UNIFESP approved this study (protocol number 3.099.878). All school principals provided informed consent before randomization, and the teachers and parents signed after randomization.

Measures

The Teacher Observation of Classroom Adaptation Checklist (TOCA-C) (Koth et al., 2009) was used to assess elementary school students’ primary outcome behaviors. It was adapted to binary answers (“Never” and “Very often”). The teacher answers 21 items that classify the frequency of each student’s behaviors in the classroom. The questions were divided into three dimensions: concentration problems (seven items, e.g., “In the last three weeks, would you say the following statements were never or very often true of this child... pays attention”, “stays on task”), disruptive behavior (nine items, e.g., “Breaks rules”, Harms others”), and prosocial behavior (five items, e.g., “Is friendly”, “Shows empathy & compassion for others’ feelings”). Each dimension outcome is the sum of the score of their answers divided by the total number of items for that subscale. In the original version, the alphas for the TOCA-C subscales ranged from 0.86 to 0.96 (Koth et al., 2009). The instrument was adapted to Brazil with a satisfactory fit of the model (χ2 = 961, gl = 265, RMSEA = 0.078 [0.07–0.08 IC95%], and CFI = 0.9; Schneider et al., 2020). The TOCA-C was completed pre- and post-test for all children by teachers with the mediation of a trained interviewer, who was blind to the group’s designation. Teachers notably observed the students for a minimum of 2 months before this pre-test administration. The interviewers were graduate students from the health sector (nurses and psychologists) who were trained by our team.

Intervention

The teachers underwent a 16-h training program conducted by the BMH during which they received printed material explaining the program, oral presentations, and role-playing. Moreover, at the first implementation of Elos 2.0, the teachers were accompanied by one of the trainers from the BMH.

To participate in the Elos 2.0 program, the teacher followed three steps: (a) they divided the students in the same classroom into heterogeneous teams according to sex, academic performance, and behavior patterns (concentration problems, disruptive behavior, and prosocial); (b) they then reminded each team of the four rules that must be followed (the instructions for activities, voice levels [silence, whisper, group voice, presentation, or street voice], the arrangement of seats [remain seated, stand and walk as agreed, or stand and walk freely], and to be kind [behaviors were combined collaboratively at the beginning of the game by teachers and students]); and (c) teachers announced the start of the game. That is the same procedure adopted in the first GBG version (Barrish et al., 1969). During Elos 2.0, the teachers observed their students, and if the rules were broken, they (1) described the situation to the classroom immediately following the occurrence of inappropriate behavior in a neutral tone of voice; (2) recorded the breaking of the rule on the blackboard, assigning a point to the specific team; and (3) praised the other students who were following the rules.

The rules were illustrated on posters posted in the classrooms while the game took place, similar to the GBG version found in Van Lier et al. (2004). To participate in the Elos 2.0 program, teachers must have selected pedagogical activities that were in accordance with the school curriculum and that could be carried out autonomously by students, such as exercises and artistic activities. Thus, the teacher did not need to offer pedagogical support to the students during GBG, which promoted autonomy and group cooperation.

We also adopted a procedure from Dolan et al. (1993), where one student from each group was the teacher’s assistant. Specifically, this student helped to reward their peers at the end of each game. We altered this framework by assigning this task to all children in each group using a weekly rotation. These children had to adhere to the following criteria: the first assistant would be a prosocial child, then followed by the disruptive child, and finally, the child with concentration problems. At the end of 3 weeks, the rotation restarted. This rotation was based on the principles of model learning, where the children learn to behave according to the model of the prosocial peer who spent the first week assisting the teacher.

At the end of the game, the teams that broke no more than five agreements won the game. Noteworthy, all teams can win the game at the same time, which avoids the promotion of competition. The teachers publicly announce the teams that won and do not mention those that lost. The winning teams and students receive prizes (tangible immediate reinforcement, such as stamps in their notebook, colored pencils, and so on). Depending on the progress of the program they may also receive intangible but immediate reinforcement, such as praise and applause. A score is also recorded for teams that win games over a longer period, such as a week, after which delayed reinforcement is given. In summary, awards and recognition for following the rules were provided in three ways: reinforcement given to the student immediately after the game, reinforcement given to the team immediately after the game, and reinforcement given to the team at the end of each week of the game (Barrish et al., 1969; Hansen et al., 2010; Dion et al., 2011; Weis et al., 2015). The teachers had to perform the Elos 2.0 games a minimum of 3 times per week, with each instance lasting between 10 and 30 min, and this lasted for a period of 4 months (see Mariano et al. (2021) for further details).

Notably, the Elos 2.0 program is similar to the PAX GBG version (PAX stands for “peace, productivity, health, and happiness,” developed by the Paxis Institute in the United States) (Embry et al., 2016). Compared to other GBGs tested in RCT, the Elos 2.0 and the PAX GBG applied more reinforcement of positive child behaviors in the classroom, thereby decreasing the number of negative punishment processes. Although punitive procedures have been used since the beginning of GBG (Barrish et al., 1969), subsequent adaptations tended to reduce or eliminate punishments (Ialongo et al., 1999; Van Lier et al., 2004; Witvliet et al., 2009). Differently from PAX GBG, the Elos 2.0 program did not use the OK/Not OK desk cards to provide feedback to individuals, and students did not craft a vision of a wonderful classroom, outlining comprehensive lists of elements they wish to observe, hear, engage in, and experience more frequently (Streimann et al., 2017).

Regarding Fidelity, the Elos 2.0 program provided a spreadsheet listing all the steps required for the teacher to execute the game properly, and it was fulfilled by the teacher and sent to the supervisor. The BMH supervisors were the same that provided training to the teachers. When the supervisor was present at the classroom, he/she would observe and fulfill the spreadsheet. The implementation of the Elos 2.0 was divided into the following stages: (1) familiarization, when supervisors would be present at the classroom every 2 weeks; (2) consolidation, when supervisor would be present every month; and (3) expansion, every 2 weeks. In the first phase, visits from supervisors occurred fortnightly. Supervisors were able to observe and assist during the game, correcting if necessary. After the game, the supervisor was able to discuss the fidelity assessment with the teacher. In the consolidation phase, the autonomy of the teacher became noticeable, as the supervisors only officially observed once a month. Finally, in the expansion phase, the visits were repeated every 2 weeks with the aim of the teacher mastering the game and being prepared to be knowledge spreaders of the activities. The supervisor made the decision regarding when the teacher was ready to move onto the next phase, this decision was made regardless of the time spent in each phase. In general, phase 1 lasted between 6 and 8 weeks, phase 2 for 8–12 weeks, and phase 3 also lasted between 6 and 8 weeks. During the entire implementation process, teachers could review their fidelity assessments and schedule calls with supervisors. Every 4 weeks that a teacher successfully engaged in the program and played Elos, they received a certificate among a closed group of teachers and supervisors on social media (WhatsApp, Facebook) congratulating them. This also occurred when teachers completed a phase. Different certificates were provided on each occasion, and the recipients could customize their next certificate; for example, they could suggest adding their pictures, quotes, and so on to the next certificate. Successful activities were posted in groups. At the end of the program, the teachers received a certificate that listed their total time spent playing and engaging with Elos 2.0.

Statistical Analysis

Three different regression models were run to evaluate the effects of the Elos program on the three subscales of the TOCA-C. Given the different distributional features of each subscale, different types of models were used. For the concentration problem subscale, linear regression was used due to its continuous nature. For disruptive behavior, zero-inflated Poisson regression was used, given that such an outcome is a count-dependent variable, and approximately 50% of participants scored zero on this outcome. Finally, for prosocial behavior, we used a censored-above regression (continuous outcome, where almost 50% of the sample achieved a ceiling effect for a score of 5).

Regarding disruptive behavior, the zero-inflated Poisson model had two interactive estimated regressions (see Long, 1997). The first evaluates the count-dependent variable for individuals who can assume values of zero and above. For this regression on the count variable, the coefficient was interpreted as the difference between groups on the log count of the number of points in the TOCA’s disruptive behavior. The second interactive estimated regression shows the logistic regression of the binary latent inflation variable on the covariates. This predicts the probability of being unable to assume non-zero values and was the focus of this study, with its effect reported in terms of odds ratio (OR). For the concentration problem and prosocial behavior subscales, the magnitude of the effect was reported in terms of Cohen’s d.

All the differences between groups’ effects are adjusted for sex, age, and baseline measurements of the outcome under assessment. It is crucial to note that, instead of relying on baseline statistical testing on the decision of selecting potential adjustments, we adopted a quantitative policy based on arguments against baseline testing of the total randomized group (Altman, 1985; Senn, 1994) and relevant literature on statistical testing (Molenberghs et al., 2008; Rubin, 1976).

In our attrition analysis, we compared students who provided data at the two different times with those who only completed the initial questionnaire. Supplementary Tables S1 and S2 (Online Resource X) provide a detailed description of the distribution of the baseline continuous outcomes and the socio-demographic features for the groups involved in the attrition analysis.

We assumed missing-at-random, and consistent with intention-to-treat (ITT) analysis, we dealt with missing data using two methods—full information maximum likelihood (FIML) and multiple imputation. We assumed missing-at-random mechanism given that ‘…the empirical data can never provide evidence either for or against the missing at random assumption, and the missing completely at random assumption can be falsified by the data, but it can never be proven true’ (Rhoads, 2012). Additionally, a complete case analysis (which assumes missing completely at random) was performed to evaluate how the FIML and multiple estimates deviated from the former.

In the FIML approach, the model is estimated based on the dependent variables. In our study, these were sex, age, baseline measurements of the outcome under assessment, and group random assignment. In regular regression, the means, variances, and covariances of the dependent variables are not model parameters. However, to avoid missingness from post-assessment measurements (loss of cases), we mentioned the variances of the continuous dependent variables in our mode specification (age and baseline assessment). Distributional assumptions were then made regarding these variables, and their means, variances, and covariances are used as model parameters. Assuming missing-at-random mechanism, FIML estimates were unbiased and more efficient than listwise deletion (i.e., complete case analysis, a common routine implemented in statistical software), as well as pairwise deletion and similar-response-pattern imputation (for further details regarding FIML see Enders & Bandalos, 2001).

As a second approach to dealing with missing data and complying with the ITT paradigm and also assuming missing-at-random mechanism, we utilized multiple imputations carried out using the Bayes estimation of an unrestricted variance–covariance model, which was then used to impute the missing values. Regarding the unrestricted model used for imputation, the regression setting available in Mplus (Muthén & Muthén, 2019) version 8.4 was used. This type of setting is recommended when all variables with missing values are of the same nature (e.g., in this study, all variables were considered to be categorical to avoid computation of integer values communing from continuous assumptions). The following variables were included in the imputation step: random allocation status (control vs. Elos), age, sex, and both baseline as well as follow-up measurements. Fifty imputation datasets were generated and used in the subsequent analysis found in Rubin’s (1987) work.

The estimator used for all analyses was the maximum likelihood with robust standard errors (MLR), which allows for dealing with the non-independence of the observation (e.g., in this study, children nested in classrooms). The standard error was then computed considering the multilevel structure using the TYPE = Complex command in MPlus, as proposed by Asparouhov (2006), using a sandwich estimator. The level of significance was set at 5%.

Results

Table 1 shows the means and robust standard errors for the three TOCA subscales at the baseline assessment. The robust standard error accounted for 76 classes. The groups that received Elos 2.0 as well as the control groups were all similar in terms of age and sex of the student populations.

Table 2 displays the ICC coefficients for the three subscales at the measurement and establishment of the baseline. This analysis shows that almost 30% of the variance in the assessments comes from teacher-class interactions.

Table 2 ICC for the three subscales’ baseline assessments

Table 3 shows the standardized effect sizes, confidence intervals [CI], and p values for both approaches (FIML and multiple imputation) for the three scale outcomes; this is adjusted for cluster effects, age, sex, and measured baseline assessment. The Elos 2.0 program has a positive effect on all three outcomes. In terms of concentration problems (where the higher the score, the better the concentration), students allocated to Elos schools showed an improvement in concentration by 0.254 standard deviations (95% CI = 0.038–0.469).

Table 3 Standardized effect sizes for the three scales’ outcomes after adjusting for cluster effects

Regarding disruptive behavior in participants enrolled in Elos schools, the log-odds of membership to the excess zero-generating process (i.e., having no disruptive behavior) increased by 1.586. This corresponded to an odds ratio of 4.88 (95% CI = 1.89 to 12.60) after adjusting for baseline disruptive behavior, sex, and age. In other words, the odds of no disruptive behavior being observed in classrooms where Elos 2.0 was deployed are 4.88 times the odds of no disruptive behavior being observed among students nested in the control schools. This demonstrates an increase of 1.586 of having no disruptive behavior in environments where Elos 2.0 is being used. For the count aspect of the zero-inflated regression, under the three scenarios (i.e., complete cases, MI, and FIML), there is a lack of evidence regarding group differences.

Finally, for the prosocial domain (the higher the score, the more prosocial behavior), there was a close-to-moderate effect size in favor of Elos schools (Cohen’s d = 0.436, 95% CI = 0.139 to 0.734). The participants exposed to Elos showed an average 0.436 standard deviation more prosocial behavior than the control group.

Of the three outcomes, FIML and multiple imputations agreed with each other, showing the robustness of the estimates. Supplementary Tables S1 and S2 (Online Resource X) provide details of attrition across random group assignments.

Discussion

Elementary school students who participated in Elos 2.0 showed a significant increase in their prosocial and concentration scores at the end of the intervention when compared to children in the control groups. The effects that Elos 2.0 had on reducing disruptive behavior in the classroom were significant and robust; this trial found that children who participated in Elos 2.0 were 4.8 times more likely not to exhibit disruptive behavior than those in the control classrooms. Thus, our study demonstrates the effectiveness of the Elos 2.0 program regarding three behaviors among first- to fourth-grade students.

Our results align with the meta-analysis of RCTs for classroom behavioral programs by Veenman et al. (2018), which reported significant effects on teacher-rated concentration and students’ disruptive behavior. Furthermore, we present a modest-to-robust effect, in contrast to the minimal effect found in the meta-analysis. One reason for the difference in our effect size was the large reduction in our initially calculated sample size due to the COVID-19 pandemic and subsequent school closures. Moreover, Smith et al. (2021) conducted a meta-analysis including only the results of GBG RCTs and presented similar outcomes. The exception was the teacher-rated shyness, for which they did not obtain a positive result. Notably, the positive results from this adapted GBG seemed to be more closely related to an increase in the scores of children with or at high risk for behavioral and emotional problems than in children who were not at high risk (Ashworth et al., 2020; Bowman-Perrott et al., 2013; Bradshaw et al., 2009; Streimann et al., 2020).

Although our intervention was universal (rather than limited to at-risk children with higher scores for disruptive behavior), Brazil is among the less positive school disciplinary climates globally, which may have impacted our results. For example, the Teaching and Learning International Survey (TALIS) reports that an average of 12.7% of classroom time is used keeping a classroom in order; Brazilian teachers spent an additional 19.8% of their time doing so. Similarly, students in Brazil were ranked 74th out of 77 countries for their cooperativeness (− 0.35 according to the Programme for International Student Assessment—PISA Index, 2018; OECD, 2015). According to the PISA (2018), Brazil’s disciplinary climate in language-of-instruction lessons is the penultimate-worst in comparison to other countries studied (− 0.37 PISA Index, rank 75/76, 2018; OECD, 2018).

The quality of education remains a challenge in several countries, most of which are LMICs or have low-income communities (OECD, 2018). In these populated areas, evidence-based prevention programs in schools represent an enormous opportunity to reduce the impact of poverty on children’s mental health (Belkin et al., 2017). Furthermore, as a universal intervention, GBG may reduce stressful social interactions and increase students’ chances of high school graduation and college attendance by providing a better school disciplinary climate and a sense of belonging for the students (Bradshaw et al., 2009; Streimann et al., 2020).

Overall, GBG programs are intended to increase peer reinforcement to achieve prosocial actions and decrease behavioral problems that could undermine positive peer and student–teacher relationships (Johansson et al., 2020; Newcomer et al., 2016; Witvliet et al., 2009). Johansson et al. (2020) explained that the set of rules used to play a GBG provides a scenario where students can experience the consequences of rule breaking as well as the recognition and reward that accompany appropriate behavior.

The Elos 2.0 program methodology also prioritizes strategies to reinforce positive and appropriate behaviors, such as obeyance of rules, cooperative play, and autonomy, which may improve social skills. In Brazil, children aged 4–5 years old showed a 30% delay in socioemotional development in preschools, which is approximately 50% more than the worldwide prevalence (Caetano et al., 2021).

The Elos 2.0 program is similar to previous version of GBG, in particular to the most important GBG version, PAX GBG (Embry et al., 2016), regarding main procedural characteristics. In general, we observed that PAX GBG and other versions of GBGs prioritized the display of rules in a visible place in the classroom, in addition to signaling the exact moment when the game started. The signal could be a specific sound or a pre-established phrase like “The game starts now!” The teachers also selected team leaders and assistants for the duration of the game. This is important because the student team leader assists the teacher in conducting the activities, specifically by distributing the materials to their peers, in addition to serving as a positive role model that other students in the group can imitate (Husband & Chong, 2011). It is worth highlighting that in the subsequent versions of the GBG, teachers and students democratically agreed upon classroom rules. During the game, we observed that the latest GBG versions gradually increased the reinforcing procedures, despite maintaining at least one punitive procedure. Finally, all the evaluated studies emphasized the gradual increase in reinforcing procedures at the end of the game, both immediate and delayed, tangible (e.g., stickers and school supplies) as well as intangible (e.g., praise and positive social recognition). Meanwhile, all studies carried out after the original version withdrew from punitive procedures in the final stage of the game.

Moreover, the PAX GBG was adopted by incorporating extra activities and elements aimed at enhancing compliance and classroom management. The PAX GBG incorporates proven kernels or strategies for influencing behavior (Domitrovich et al., 2010). These kernels are classified into four categories—antecedent, relational, physiological, and reinforcement—each with demonstrated effectiveness in shaping behavior. They aid children in preparing for and attaining goals, reducing anxiety, and providing rewards for desirable behavior (Embry et al., 2016). Regular utilization of these kernels affords students opportunities to practice prosocial skills and self-regulation (Streimann et al., 2017).

Based on a comparison of the diverse applications of GBG in RCT, the Elos 2.0 program had a shorter duration (four months). Veenman et al. (2018) demonstrated that GBG programs with shorter durations had increased effectiveness. This might be related to the teacher burnout and general non-compliance that often occurs with lengthier programs (Berg et al., 2017; Bradshaw et al., 2020). Moreover, we had a fidelity assessment for each game, and the teachers could assess the supervisor, increasing their support. We also intended to increase their engagement using gamification elements, such as positive reinforcement, customized prizes, feedback, and challenges. This new teacher approach should be further investigated as a potential strategy to decrease teacher burnout. Compared to PAX GBG, we did not use the OK/Not OK desk cards to provide feedback to individuals (Streimann et al., 2017), because the Elos 2.0 prioritizes strategies for reinforcing the positive and adequate behaviors that students should exhibit in the classroom.

Considering that the Elos 2.0 program showed effects on increasing prosocial and decreasing disruptive behavior, which are two important risk factors for drug use (NIDA, 2020), we expect that this program can also lead to a reduction in future drug use among these students. This evidence suggests that the federal government’s strategy to include Elos 2.0 as one of its drug-use prevention programs through the School Health Program (PSE) of the Unified Health System (SUS)—Federal decree 6.286/2007—will have positive effects on Brazil’s health policy.

This study had several limitations. First, our study ultimately had a smaller sample size than intended. We began scheduling our RCT of the Elos 2.0 program in São Paulo in 2020, but were forced to stop prematurely because of the COVID-19 pandemic. Therefore, we analyzed only the data collected from the sample in Ceará, where the RCT was conducted in 2019. Nevertheless, even this smaller sample size demonstrated a modest/robust improvement in the outcomes. Second, the game was not repeated if a child was absent on a specific day, but this is a common limitation of the prevention intervention designs (Ariza et al., 2013). Third, we were not able to directly assess children because of the high cost of such observations. However, rating scales completed by teachers constitute the most common method used to evaluate children’s behavior and functioning at school. Pas et al. (2015) noted that if the teachers who provide the scale ratings are also the teachers who implement the intervention, they would know whether they were in the control or intervention group, which may have influenced their ratings. Nevertheless, the BMH supervisors observed and completed an implementation spreadsheet for the first game (to verify whether students’ teams were divided appropriately as planned) and for the subsequent visits. We randomized our sample by school, avoiding the contamination effect that can occur when randomization is performed in the classroom. A student could have characteristics in more than one group; for example, a student could have concentration problems and exhibit disruptive behavior. However, most GBG studies divided their samples in the same manner. Finally, we performed a follow-up assessment only 1 month after the completion of the intervention. Due to the COVID-19 pandemic, we were unable to continue to perform re-evaluations.

In conclusion, our findings indicate that Elos 2.0, as a universal intervention of teacher behavior management, is associated with modest/robust improvements in children’s prosocial behavior and concentration skills as well as a decrease in disruptive behavior among all students, regardless of their behaviors in the classroom. This study is a preliminary support for the national implementation of the Elos 2.0 program across Brazilian public schools. More rigorously designed trials with classroom-observed measures that include all Brazilian regions are needed.