The advancement in technology has dramatically transformed the landscape of scientific inquiry assessment. Unlike traditional paper-based assessments, which primarily measure what has been achieved or whether students’ responses are correct or incorrect (product data), digital-based assessments shed further insights into how the responses are produced via the analysis of computer-generated log files (process data). Log files document the on-screen behavior of test-takers during the assessment, capturing activities like clicks and keystrokes, each accompanied by its respective timestamp. The integration of product and process data offers unprecedented opportunities to understand not only what students know, but also how they apply that knowledge in problem-solving contexts.

Recent research has begun to integrate both the product and process data from large-scale assessments for various purposes, such as identifying disengaged behavior (Kuang & Sahin, 2023) and detecting a risk of failure (Ulitzsch et al., 2023). Despite these developments, a noticeable gap persists in the literature when it comes to the application of these data within the context of science education (see a review by Teig et al., 2022 and Reis Costa & Leoncio Netto, 2022). Specifically, there has been limited exploration of how process data from large-scale assessments can enhance understanding of students’ problem-solving strategies in scientific inquiry tasks.

Understanding how students solve inquiry tasks is of crucial importance for several reasons. Firstly, it provides insights into their problem-solving strategies and cognitive processes. For instance, by examining process data, researchers can gain a granular understanding of the strategies that lead to correct or incorrect responses (Scalise & Clarke-Midura, 2018; Teig et al., 2020; Zumbo et al., 2023). It allows us to identify strengths and weaknesses in student strategies and, thus, shed light on why some students succeed at solving inquiry tasks while others find them challenging. Even students who receive identical scores on the same task may interact with the computer-based environment differently. Some students may apply the most effective strategy immediately, whereas others need to explore various strategies before successfully completing the task. Insights from process data can offer a more refined picture of individual differences in problem-solving strategies.

Secondly, scientific inquiry tasks often require higher-order thinking skills, such as hypothesis design, investigation, synthesis, and argumentation (Rönnebeck et al., 2016). These skills are key to scientific literacy, which is not only important for STEM careers but also for informed decision-making in a scientifically and technologically advanced society (Schwartz et al., 2023). As such, understanding how students approach and solve scientific inquiry tasks can guide educators in fostering these critical skills. Moreover, understanding students’ problem-solving strategies can inform instructional design. The knowledge of how different students approach the same problem allows educators to tailor their instruction to diverse student needs, potentially improving learning outcomes.

To advance our understanding of student strategies for solving scientific inquiry tasks, the present study seeks to leverage the rich process data from digital-based assessments using the Programme for International Student Assessment (PISA) study. These process data are chosen for two main reasons: first, these data—including the frequency of student actions, response times, and response accuracy—are publicly availableFootnote 1 and underutilized for improving teaching and learning (Reis Costa & Leoncio Netto, 2022; Stadler et al., 2023), especially in the science domain. The availability of these data may open a window of opportunity for ample future studies to investigate student performance across a range of science competencies. Additionally, the integration of the process data with other existing data, such as student achievement and background questionnaires, could yield valuable insights into the relationships between contextual variables (e.g., teaching and learning activities) and student problem-solving strategies. Second, the large-scale nature of the PISA dataset (for instance, PISA 2015 covers 57 countries and economies) offers potential for generalizability and to describe science achievement over time, including within a country and across a range of educational contexts worldwide (Teig et al., 2022). While the original PISA tasks remain confidential, this study also examines process data from a publicly released task from the PISA field trial study, making it possible to show the format and complexity of the inquiry tasks.

The present study illustrates how process data from PISA can be used to facilitate a deeper understanding of student strategies in solving scientific inquiry tasks. Specifically, it uses examples from process mining (PM) to visualize and analyze students’ problem-solving paths and latent profile analyses (LPA) to identify unique groups of students based on their problem-solving patterns. This study aims to introduce science education researchers to the wealth of information available in process data from large-scale assessments, potentially encouraging further exploration of these data to advance educational research, policy, and practice.

Student Log Files and Process Data from Large-Scale Assessments

Technological innovations have transformed the way student performance is assessed, moving from conventional paper-and-pencil to digital-based assessments. In 2006, PISA piloted computer-based assessments of science for the first time in Denmark, Iceland, and South Korea, followed by a worldwide implementation in PISA 2015 (OECD, 2016). A similar digital shift has also been applied in other large-scale assessments, including the Trends in International Mathematics and Science Study (TIMSS) and National Assessment of Educational Progress (NAEP). This digital shift offers an opportunity to evaluate more than just the outcome of the assessment by utilizing log files, which capture all students’ on-screen behaviors and their corresponding timestamps. The information extracted from these log files can be valuable in many ways, including in examining the process-related constructs that lead to the outcome, enhancing measurement precision, optimizing the assessment design, and validating score interpretations (Goldhammer et al., 2021; Provasnik, 2021; Zumbo et al., 2023).

Currently, log files from large-scale assessments are not purposely designed to provide insight into how students think and solve the tasks; instead, they are byproducts of the assessment shift to the digital mode (Provasnik, 2021). As a result, not every recorded event in the log files is immediately useful and considered process data. Process data are “the empirical data that reflect the process of working on a test question—reflecting cognitive and noncognitive, particularly psychological, constructs” (Provasnik, 2021, p. 3). To transform the raw data captured in log files into meaningful process data, researchers need to extract specific indicators that represent their construct of interest. The interpretation of indicators derived from log files needs to be theoretically and empirically validated to make log files valuable as sources of process data (Goldhammer et al., 2021). Data sources for process data are not limited to log files but also include other sources, such as eye-tracking, brain imaging, video recording, and think-aloud. In the present study, various behavioral indicators from log files are selected as process data to represent student strategies for solving inquiry tasks (for further details, see “The Present Study”).

A great number of studies have examined process data from large-scale assessments, especially in problem-solving and reading domains (see review by Reis Costa & Leoncio Netto, 2022). Even though digital-based assessment has been implemented to assess science performance for many years (e.g., since PISA 2006 and TIMSS 2019), hardly any study has explored student process data in science (Reis Costa & Leoncio Netto, 2022). To this end, the present study examines science performance and presents two examples of how process data can be used to uncover typically unobserved problem-solving processes that potentially explain differences in student performance on scientific inquiry tasks.

Scientific Inquiry

Scientific inquiry has been a fundamental aspect of science education reforms around the world, both historically and in contemporary discussions. Despite its critical role, the concept of “inquiry” is not uniformly defined, resulting in a wide array of interpretations in the literature. To address this, Rönnebeck et al. (2016) conducted a systematic review to identify the variations in the range and type of inquiry by distinguishing inquiry as either individual activities or a set of integrated activities. From this analysis, they proposed a framework of scientific inquiry that consists of four distinct phases: preparing (e.g., formulating hypotheses), carrying out (e.g., conducting investigations, analyzing data), explaining and evaluating (e.g., engaging in argumentation and reasoning, developing explanations), and communication. A salient feature of this framework is its emphasis on connecting inquiry with student understanding of both scientific concepts and the inherent nature of science (Rönnebeck et al., 2016). This perspective of inquiry is largely in line with policy reforms (e.g., National Research Council, 2013) as well as science assessment frameworks from large-scale assessments like TIMSS and PISA (Mullis & Martin, 2017; OECD, 2016).

The present study builds on Rönnebeck et al.’s framework of scientific inquiry (2016). It focuses on two pivotal scientific inquiry skills, namely, coordinating the effects of multiple variables and coordinating a theoretical claim with evidence. These inquiry skills—central to the “carrying out” and “explaining and evaluation” phases, respectively—lay a foundation for developing students’ scientific understanding and the nature of science (Kuhn, 2016; Osborne et al., 2016). However, despite their significance, few empirical studies have simultaneously examined these skills using the insights from student process data. Understanding the strategies students employ and the challenges they encounter during inquiry practice is key to designing digital-based learning environments that efficiently support the integration of both skills. The following sections discuss these inquiry skills in greater detail.

Coordinating the Effects of Multiple Variables

Multivariable reasoning is crucial in understanding scientific phenomena, which typically consist of multiple interacting variables whose effects on one another must be identified (Lesperance & Kuhn, 2023; Zimmerman, 2007). This skill is especially relevant during the “carrying out” or investigative phase of inquiry, for instance, when designing and conducting experiments. During this phase, students demonstrate various exploration strategies to examine the causal relations among variables and generate interpretable evidence (Scalise & Clarke-Midura, 2018; Teig et al., 2020).

One such effective strategy is the vary-one-thing-at-a-time or VOTAT strategy, which is used to identify causal relationships among variables (Tschirgi, 1980). This strategy adheres to the principle of isolated variation by changing one variable of interest at a time while ensuring all other variables remain unchanged. For instance, to investigate the effect of sunlight on plant growth, a student might use the VOTAT strategy by placing one plant in the sun and another in the shade, ensuring all other conditions like water, soil, and pot size are identical. This strategy allows for a clearer understanding of how the independent variable (in this case, sunlight) affects the dependent variable (plant growth). The VOTAT strategy is critical in determining the effect of a single variable on an outcome in a univariable context. However, many real-world phenomena, like climate change or human health, arise from interactions of multiple variables rather than a single one. Consider a student exploring various factors affecting human health. The VOTAT strategy is useful to understand an individual effect, such as how diet affects health while keeping exercise constant. However, to fully understand the broader picture, it is crucial to analyze other variables simultaneously, such as how both diet and exercise jointly influence health outcomes, in a multivariable context.

While the VOTAT strategy helps identify the effects of individual variables, it is not adequate for complex inquiry tasks involving several interacting variables (Kuhn, 2016; Kuhn et al., 2017; Teig et al., 2020). To successfully navigate a multivariable task, students should first determine the individual effects of each variable, potentially by applying the VOTAT strategy. Following this, they need to understand how these multiple effects collectively affect an outcome, either in an additive or interactive way (Kuhn, 2016; Teig et al., 2020; Zimmerman, 2007). Learning to coordinate the effects of multiple variables and mastering the multivariable prediction task presents a significant challenge for students (Lesperance & Kuhn, 2023; Teig et al., 2020; Zimmerman, 2007).

Digital-based assessments like PISA offer great potential for investigating students’ control of variable reasoning, both in univariable and multivariable contexts. The process data drawn from the assessment provide information not only on whether students solve the tasks correctly or incorrectly but also on how they interact with the tasks and challenges they encounter during the investigative phase of inquiry.

Coordinating a Theoretical Claim and Evidence

Coordinating a theoretical claim and evidence is a fundamental skill for the “explaining and evaluating” or inferential phase of inquiry (Kuhn et al., 2017). This phase involves assessing evidence to make inferences, draw conclusions, and explain the phenomenon under investigation (Zimmerman, 2007). The skill to coordinate a claim and evidence plays a central role during inquiry practice, especially in constructing a valid scientific argument (Delen & Krajcik, 2015; Osborne et al., 2016). This skill encompasses various stages in inquiry practice: (a) zero degrees of coordination where a claim or evidence is identified without presenting a logical link between them, (b) one degree of coordination where a distinct link between the claim and evidence is established through a warrant, and (c) two or more degrees of coordination where a complex argument is built on multiple justifying warrants (Osborne et al., 2016).

Research highlights common challenges students face when navigating these stages. They may find it difficult to differentiate between claim and evidence, relate data to the hypotheses or claims, provide appropriate evidence backing their claims, or justify the significance of their data as valid evidence to support their claims (for a review of studies, see de Jong & van Joolingen, 1998 and Zimmerman, 2007). Students’ understanding of the different aspects of argument and their level of scientific knowledge influence their ability to coordinate theory and evidence successfully (Osborne et al., 2016).

Enabling students to collect and interpret their own data is a critical aspect of inquiry practice (Delen & Krajcik, 2015). Interestingly, a review by Smetana and Bell (2012) indicates that the quality of participants’ reasoning remains consistent, regardless of whether they interpreted data from a simulated or a physical investigation. This research becomes particularly relevant in light of PISA’s decision to incorporate hands-on investigations into its digital-based assessment of scientific literacy. This innovation offers unique opportunities to conduct a large-scale study to assess students’ ability in coordinating a theoretical claim and evidence based on the data they collected themselves.

The Present Study

The main aim of this study is to demonstrate how process data can be used to examine the strategies students used when solving scientific inquiry tasks. Specifically, the following research questions (RQs) are used to guide the study:

  1. 1.

    What patterns can be identified in students’ strategies for solving scientific inquiry tasks?

  2. 2.

    How do these patterns relate to the (a) accuracy of students’ responses and (b) their exposure to inquiry-based teaching and learning practices?

To address these questions, this study offers two empirical examples using PM and LPA to analyze students’ log files from the 2015 PISA field trial and main studies, respectively. In the first step, relevant information is extracted from the log filesFootnote 2 and organized into a target dataset in order to create a more explicit data structure. This target dataset is then preprocessed, such as to handle missing data, remove noise, and aggregate information. The preprocessed dataset includes several key indicators: (1) inquiry exploration behavior, represented by the number of actions and number of trials; (2) inquiry strategy, denoted by the sequence of action and timestamp; (3) response time, quantified by the amount of time that elapsed before students conducted the first action and the total amount of time students spent during the task or time-on-task; and (4) response accuracy, measured by the scores on multiple-choice and constructed response from the tasks.

Subsequently, PM and LPA are applied to explore the patterns of students’ strategies for solving inquiry tasks and their relations to performance (RQs 1 and 2a). Moreover, through the second example using LPA, this study also demonstrates how process data from the science assessment can be integrated with data from student background questionnaires. This integration allows the examination of differences in inquiry-based teaching and learning practices across various student groups (RQ 2b).

Example 1: Process Mining

Sample and Procedure

The first example examines process data from the PISA 2015 field trial study in Norway. This dataset was selected as it involves interactive tasks that have been publicly released, allowing for a clear demonstration of the task’s format and complexity within the context of scientific inquiry. Of the total 850 tenth-grade students who participated in the study, 81 students (51.9% boys and 48.1% girls) were assigned to the assessment unit Running in Hot Weather.

As shown in Fig. 1, this study examines the second question from the unit (henceforth referred to as Task 1). Students are required to use the simulation to establish the effect of drinking water on dehydration and heat stroke while running for an hour on a hot and humid day. To succeed in this task, students need to grasp the core principle of coordinating multiple variables by determining the effect of a single variable on an outcome. VOTAT is the most effective strategy in solving this task. It can be efficiently addressed through two trials: holding the air temperature and humidity constant (AirTemp = 35 °C and AirHum = 60%) while varying whether the runner drinks water (Drink = Yes on trial 1 and Drink = No on trial 2, or vice versa).

Fig. 1
figure 1

A screenshot of Task 1 from the PISA 2015 field trial item from the Running in Hot Weather unit. To try the simulation, please see https://www.oecd.org/pisa/PISA2015Questions/platform/index.html?user=&domain=SCI&unit=S623-RunningInHotWeather&lang=eng-ZZZ

To solve the task correctly, students must select two specific rows of data from the table and use the data to conclude that drinking water would reduce the risk of heat stroke and dehydration (multiple-choice answer). If a student selected either the correct multiple-choice answer or the appropriate rows of data, their response is considered partially correct. Students who chose neither the correct multiple-choice answer nor the appropriate rows of data answered incorrectly.

Task 1 is anchored in a univariable context (Fig.1). It requires zero degrees of coordination because students need to select the relevant claim (multiple-choice) and two rows of data to support it (table). While students are not specifically asked to explain the logical connection between data and claim, they in practice need to demonstrate this connection in order to answer both responses correctly and receive a credit.

Method: Process Mining

PM is a method of data analysis that is employed to extract insights from event logs available in operational systems. PM is a subset of data mining that focuses on understanding processes by extracting process-related knowledge from event logs to discover, monitor, and improve these processes (Bogarín et al., 2018). This method is useful for analyzing data that involves a sequence of various activities or events, usually resulting from a particular process. This is often the case with interactive tasks in that they require repeated engagement, such as solving an inquiry task in the present study. Through PM, a comprehensive, data-driven perspective of how these processes actually work in practice can be obtained (van der Aalst, 2016).

PM typically involves three core types of analysis: discovery, conformance, and enhancement (van der Aalst, 2016). Discovery creates a process model from the event log, offering a clear picture of the sequence and concurrent events. Conformance compares an existing process model with the event log to detect deviations and understand how reality matches with what has been planned. Enhancement uses data from the event log to extend or improve an existing process model. In this study, the discovery techniques are utilized to create a process model based on a sequence of activities from the students’ log files to describe how they used the simulation to solve Task 1. The following information is needed to capture process-based data in the log files and generate the process model: (1) case or a unique identifier of the individual who performed the process, such as student ID; (2) event or activity, a well-defined step as part of the process, such as AirTemp-1 that represents a student setting the air temperature to a specific value in the first simulation trial; and (3) timestamp from each activity to determine the order of events.

This study employed a PM software called Disco to uncover models of complex processes. To achieve this, Disco utilizes the fuzzy miner algorithm, an approach that is especially beneficial when sifting through massive amounts of unstructured event data (Günther & Rozinat, 2012). It simplifies complex process models by highlighting the most frequent events and relationships, making it easier to identify significant patterns and structures in the data (Günther & Rozinat, 2012). Essentially, the fuzzy miner algorithm accentuates the most significant events and relationships, while less important details are deemphasized or “fuzzed out” (Günther & Rozinat, 2012).

The data analytics in this study were carried out in two stages. First, a universal process model was developed for all participating students. Next, this model was used to compare and contrast the process models of students who answered the task correctly, partially correctly, and incorrectly. Disco provides visual representations of entire processes, making it easier to understand the process and detect hidden strategies that may not be immediately evident from the raw data.

Findings

Descriptive statistics for the inquiry exploration behavior, inquiry strategy, response time, and response accuracy for all the cases can be found in Table 1.

Table 1 Descriptive statistics for all variables on Task 1 (N = 81 students)

Figure 2 presents a process model map with the absolute frequency of cases for 81 students who were assigned to Task 1. In Fig. 3, this map is further categorized based on the accuracy of student responses. These figures represent the process model of how the sequence of activities in a process of solving a task are connected to each other. In the process model, activities are represented by rectangular boxes. Each box corresponds to a specific activity in the process. The colors used in the rectangular boxes typically represent different activity types or categories. Additionally, the color intensity or shade in the boxes represents quantitative measures associated with the activities. Similar to the activities, the color intensity or shade in the arrows can be used to represent quantitative measures related to the transitions. This could include the number of cases following a specific transition.

Fig. 2
figure 2

Process model map for all student strategies during the inquiry process in Task 1 (N = 81 students)

Fig. 3
figure 3

Process model map for students who solved Task 1 a correctly by selecting the correct multiple-choice and rows of data with N = 33 students, b partially correct by selecting either the correct multiple-choice or rows of data with N = 30 students, and c incorrectly with N = 17 students

Figure 2 shows that a total of 72 students conducted the simulation, while others either answered the multiple-choice question immediately without conducting any simulation or skipped the task. Among those who conducted the simulation, 22 unique sequences of events or variants were identified across the cases. About 63.7% of the students shared similar variants that included the VOTAT strategy (AirTemp = 35 °C ➔ AirHum = 60% ➔ Drink = Yes followed by AirTemp = 35 °C ➔ AirHum = 60% ➔ Drink = No). Nevertheless, only 40.7% of the students who demonstrated the VOTAT strategy solved the task correctly (Fig. 3a). These successful students typically demonstrated a goal-oriented approach by immediately applying the VOTAT strategy, except for one student who applied the VOTAT strategy followed by repeated trials without changing any variables. This group of students displayed, on average, the highest quantity of exploration behaviors (Maction = 11.59; Mtrial = 2.32). They also took the highest amount of time before making their initial move or step in a task or a process (Mtime-before-first-action = 19.25 s) and had the longest response time (Mtime-on-task =256.97 s).

Figure 3b shows that approximately 37% of students either only selected two rows of data or only answered the multiple-choice correctly. More than half of these students applied the VOTAT strategy, while others either manipulated more than one variable at a time, conducted random behavior, or repeatedly ran similar trials. Task 1 requires students to provide a claim and generate appropriate data to support it. Hence, students who solved the task partially—either by selecting the relevant claim (multiple-choice) or by providing two rows of data to support it (table)—did not receive any credit. Interestingly, these students showed similar behaviors as the students who did receive full credit. They also performed high levels of exploration behaviors (Maction = 10.62; Mtrial = 2.31) and spent a considerable amount of time before starting their first action (Mtime-before-first-action = 17.57 s) and solving the task (Mtime-on-task =249.92 s).

Figure 3c shows that students who answered both the multiple-choice and data questions incorrectly (22.2%) did not apply the VOTAT strategy. This group of students either conducted no trial (8.6%) or engaged in unsuccessful exploration behaviors (12.3%), which included varying more than one variable at a time, executing trials without changing any variable, or running other unstructured trials. The students who conducted the simulation but did not solve Task 1 performed fewer exploration behaviors than those who solved the task correctly or partially correctly (Maction = 9.11; Mtrial = 1.08). Compared to other groups of students, they spent the lowest amount of time before starting their first action (Mtime-before-first-action = 16.00 s) and the lowest response time (Mtime-on-task = 89.15 s).

Example 2: Latent Profile Analysis

Sample and Procedure

The second example uses the PISA 2015 data from Norway. This dataset was chosen because it represents the actual PISA, characterized by a large sample size and offers a reliable representation of the inquiry strategies in this country. The sample was comprised of 1222 students (51.3% girls) in Grade 9 (0.8 %) and Grade 10 (99.2%) from 221 schools.

The original interactive PISA task examined in this study cannot be shown here in compliance with the OECD’s confidentiality policies. Instead, a modified version of the released PISA field trial item from the Running in Hot Weather unit (Fig. 1) is used to illustrate the complexities of inquiry skills required to solve the original PISA task, hereafter referred to as Task 2 (Fig. 4).

Fig. 4
figure 4

A screenshot of Task 2, a modified PISA 2015 field trial item that represents the original PISA task used in the study

As shown in Fig. 4, students use the simulation to assess whether it is safe or unsafe to run under specific situations. The challenge lies in understanding how the interactions between the variables of running duration and amount of water influence two outcome variables: water loss and body temperature. To solve this task, students can adjust the values of the key variables in uniform intervals for each trial (e.g., setting the slider for the running duration at 45 min and the amount of water at 150 ml) while holding other variables constant such as maintaining the air temperature at 30 °C and air humidity at 40%. Hereinafter, this approach is labeled as the “Interactive strategy.” By employing this strategy, students can identify patterns in how the variables interact and predict outcomes, even allowing them to predict the outcomes that are not directly presented in the simulation.

Task 2 is situated within a multivariable context. It involves one degree of coordination because students are required to make their claims using multiple-choice responses and provide reasoning that explains the links between their claims and the evidence they generated from the simulations (constructed text response). To successfully complete the task and receive credit, students must correctly answer both the claim and reasoning components.

In addition to the science assessment, PISA also asked students to complete a background questionnaire that gathered contextual information about students and their learning environments. The questionnaire may offer a nuanced understanding of the factors that could potentially influence students’ academic performance. This study used nine items from the student questionnaire to capture inquiry-based teaching and learning practices they experienced in school. The items are presented in Table 2 in the “Findings” section. Analysis of the responses to these items could further illuminate the correlation between students’ exposure to inquiry practice and their performance on the scientific inquiry tasks.

Table 2 Descriptive statistics for all variables on Task 2 (N = 1222)

Method: Latent Profile Analysis

In the second example, LPA was applied as a statistical technique to identify unobserved or latent subgroups within a population. LPA is a person-centered approach that identifies distinct profiles or groups of individuals who show similar response patterns across a set of variables or indicators (Masyn, 2013). LPA is a type of finite mixture modeling which assumes that the population consists of a certain number of latent or unobserved groups (Masyn, 2013). Each latent group is characterized by a specific profile or pattern of means on the observed variables. In this study, LPA was applied to cluster students who demonstrated similar inquiry performance based on the continuous and categorical indicators from the process data (Fig. 5).

Fig. 5
figure 5

Conceptual model for the LPA on Task 2. Note. c, latent profile; Continuous indicators: Action, number of actions; Trial, number of trials; Time-on-task, the total amount of time students spent during the task; Categorical indicators: Interactive, the interactive strategy to determine the effects of interactions among variables on outcomes; MC, multiple-choice response; Text, open constructed text response

The following steps were used to guide the analytical approach: First, models with an increasing number of profiles were compared to determine the optimal number and to derive well-defined profiles with theoretical interpretability. As LPA is explanatory in nature, there is no predefined assumption about the number of latent profiles. The following multiple fit statistics and other measures were inspected to identify the model that best fits the data (Masyn, 2013): the Akaike’s Information Criterion (AIC), Consistent AIC (CAIC), Bayesian Information Criterion (BIC), Vuong-Lo-Mendell-Rubin (VLMR), Lo–Mendell–Rubin likelihood ratio tests (LMR-LRT), entropy, class separation, and homogeneity. It is important to note that the final decision on the number of profiles is not entirely statistical but also involves substantive and theoretical considerations. Once the optimal number of profiles has been determined, each individual in the dataset was assigned to the profile for which they have the highest probability of membership.

After the number of profiles from the previous step had been determined, a multivariate analysis of variance (MANOVA) was used to test the profile mean differences in the nine items related to inquiry-based teaching and learning practices. This step aims to uncover the potential influences of different teaching practices on the distinct profiles of student inquiry performance.

The LPA was performed in Mplus 8.5 (Muthén & Muthén, 1998-2021), whereas the MANOVA analyses were conducted in IBM SPSS Statistics 28.

Findings

Table 2 presents descriptive statistics for the inquiry exploration behavior, inquiry strategy, response time, and response accuracy for Task 2 as well as students’ exposure to various inquiry-based science teaching and learning practices.

Results from the LPA showed that a model with three profiles was the most optimal. The log-likelihood and information criteria values were low, and the VLMR and LMR-LRT tests were significant. Furthermore, the three profiles model demonstrated a high entropy measure as well as a distinct level of class separation and homogeneity. For detailed supporting data, please refer to Tables A1 and A2 in the Supplementary Information.

Figure 6 shows the means and conditional item probabilities derived from the three-profile model. These profiles—labeled as strategic, emergent, and disengaged—depict the distinct student behaviors during the inquiry task. For effective simulation, students are required to execute a minimum of nine actions across three trials using the Interactive strategy. This approach aids in identifying patterns of how variables interact and influence the outcomes, which in turn, enables the prediction of values that are not explicitly shown in the simulation.

Fig. 6
figure 6

Characteristics of the three-profile model of strategic, emergent, and disengaged students on Task 2

Students who clustered into the strategic profile exhibited the highest likelihood of employing the Interactive strategy and solving the task. Students in this profile conducted the most exploration behaviors (Maction = 14.99; Mtrial = 4.58). They spent on average the longest time before they initiated their first action on the screen (Mtime-before-first-action = 31.69 s) and the highest average time-on-task (Mtime-on-task = 186.58 s). In comparison, the emergent students conducted higher exploration behaviors (Maction = 10.77; Mtrial = 2.13) than the disengaged students (Maction = 6.88; Mtrial = 1.45). However, there was no significant difference in the time before the first action between the emergent and disengaged students (Mtime-before-first-action = 29.03 s and Mtime-before-first-action = 29.97 s, respectively). In contrast, the emergent students took average significantly longer to solve the task (Mtime-on-task = 102.59 s) compared to the disengaged students (Mtime-on-task = 43.97 s). Interestingly, while students in the emergent profile were more likely to use the Interactive strategy and choose the right claim, their probabilities to explain the relationships between the claim and evidence appropriately were as low as those of the disengaged students.

Furthermore, the results from MANOVA tests indicated significant mean differences in the frequency of particular inquiry-based teaching and learning practices reported by the students in the strategic, emergent, and disengaged profiles. Students in the strategic profile reported a higher frequency compared to the disengaged profile for several inquiry activities: students spend time in the laboratory doing practical experiments (Lab), students are asked to draw conclusions from an experiment they have conducted (Draw), students are given opportunities to explain their ideas (Idea), students are allowed to design their own experiments (Design), there is a class debate about investigations (Debate), and students are asked to do an investigation to test ideas (Test). Similarly, emergent students also reported higher occurrences on the items Lab, Draw, and Debate compared to the disengaged students. The activity of engaging students in a class debate about investigation (Debate) was the only item that clearly differentiated between the strategic, emergent, and disengaged profiles. The strategic profile reported the highest occurrences on the item Debate, followed by the emergent and disengaged profiles. In contrast, no significant differences were observed across these profiles concerning the frequency of three inquiry activities: students are given opportunities to explain their ideas (Explain), students are required to argue about science questions (Argue), and the teacher clearly explains the relevance of <broad science> concepts to our lives (Concept). Table 3 summarizes the mean differences between strategic, emergent, and disengaged profiles on Task 2 across different inquiry-based teaching and learning practices.

Table 3 Profile-specific descriptive statistics, multivariate analysis of variance statistics, and pairwise comparisons for inquiry-based teaching and learning practices on Task 2

Discussion and Implications

This study represents one of the first efforts to examine log files from large-scale assessments to shed light on students’ strategies in solving scientific inquiry tasks by combining product data (i.e., response accuracy) with fine-grained process data (students’ exploration behavior, inquiry strategy, and response time). This study reveals that students frequently struggled with tasks requiring univariable and multivariable reasoning, as well as coordinating theory and evidence. The findings also indicated a significant association between certain inquiry-based practices and higher inquiry performance.

Students face various challenges in solving scientific inquiry tasks that demand the skill to coordinate the effects of multiple variables. As shown in Task 1, while a majority of students were able to implement the VOTAT strategy, only 40.7% solved the task correctly by determining the effect of a single predictor on an outcome within an univariable context. This observation aligns with previous studies that show unsuccessful students tend to vary more than one variable at a time instead of applying the VOTAT strategy, repeat similar trials without changing any variables, conduct random behavior such as unstructured clicks and trials, and spend very little time planning a goal-oriented approach or exploring the task environment (e.g., de Jong & van Joolingen, 1998; Greiff et al., 2016; Lesperance & Kuhn, 2023; Teig et al., 2020). These findings have significant implications for classroom instruction, especially in teaching the VOTAT strategy. Teachers can take a proactive role by emphasizing the importance of changing one variable at a time to determine a fair and reliable investigation. They should also highlight the need to plan a goal-oriented approach before starting the investigation and to take the time to explore the task environment. In practice, this could involve structured exercises that provide students with a scaffolded environment to practice these skills (see a review by Zacharia et al., 2015 on different types of guidance within computer-supported inquiry learning).

Furthermore, this study shows that even when students have mastered univariable reasoning, they face significant challenges in transferring the VOTAT strategy into multivariable situations, such as by identifying the additive or interactive effects among the variables that simultaneously contribute to an outcome (Kuhn, 2016; Teig et al., 2020). Thus, it is hardly surprising that this study found that only 38.7% of the students who were assigned to Task 2 in PISA 2015 showed a high probability of adapting the VOTAT strategy into a multivariable context. Applying multivariable reasoning during the investigative phase of inquiry is demanding for students, even for those who have mastered the VOTAT strategy (Kuhn, 2016; Kuhn et al., 2017; Teig et al., 2020). To solve a multivariable task, they must be able to manage multiple cognitive resources in the working memory in order to identify the interactive or additive effects of dependent variables on one or more independent variables (Kuhn, 2016; Kuhn et al., 2017). Future research could investigate different types of support aimed at reducing cognitive demand in the working memory capacity (e.g., knowing and remembering multiple relevant effects). Understanding how to integrate such support into a digital-based learning environment could be instrumental in aiding students’ mastery of multivariable reasoning.

Coordinating theory and evidence is another pivotal skill that students need to navigate, especially during the inferential phase of inquiry. Tasks 1 and 2 in this study require students to identify a claim, evidence, and explicit connection between them. Although most students who did not solve the tasks correctly were able to produce sufficient data through the simulation, they could not explain why these data serve as evidence or how the data may support their claims. This finding supports previous studies that indicated student difficulties in providing a logical connection between their observations and scientific theories (e.g., Kuhn et al., 2017; McLure, 2023; Teig et al., 2020; Zimmerman, 2007) by focusing specifically on tasks situated in univariable and multivariable contexts.

Constructing a scientific explanation can be very demanding for students as they need to incorporate many different elements (Osborne et al., 2016; McLure et al., 2022; Sandoval & Millwood, 2005). These include choosing appropriate evidence, providing reasoning to support claims, and connecting evidence to underlying theoretical models to justify their explanations (Sandoval & Millwood, 2005). This process involves a high degree of theory-evidence coordination, which relies heavily upon the level of students’ scientific knowledge (Kuhn et al., 2017; Osborne et al., 2016). The findings underscore the need for explicit instruction and well-structured learning activities that challenge students to consider how to simultaneously coordinate theory and evidence during inquiry practice. For instance, the inquiry practice may involve activities where students need to gather authentic data, identify appropriate evidence, and determine how this evidence supports a given claim, all while linking these steps to an underlying theoretical model. These activities should ideally span from univariable to various multivariable contexts, thereby offering diverse learning opportunities to students.

Furthermore, this study revealed that students who were exposed to particular inquiry-based teaching and learning practices, such as engaging in a class debate about investigations, were more likely to be in a profile with higher inquiry performance (Table 3). The link between specific inquiry practices and student performance aligns with previous research that emphasizes the benefits of students’ active engagement through inquiry learning (e.g., de Jong et al., 2023; Teig et al., 2018). Class debates about investigations, for instance, can facilitate students’ understanding of the different phases of inquiry, such as hypothesis formulation, experimental design, and interpretation of results, as well as linking these phases with the development of their scientific concepts and the nature of science. Given that this study and others (e.g., Kuhn et al., 2017; McLure et al., 2022; Teig et al., 2022) found that students were struggling to apply scientific reasoning skills, teachers may need to provide ample opportunities for students to engage with diverse inquiry practices. This is particularly crucial for practices that centered around exploring multivariable phenomena and linking evidence with theory to make sense of these phenomena.

Concluding Remarks

This study extended previous research in two ways. First, the study demonstrates the advantages of integrating both the product and process data to explore strategies and challenges students face while applying the skills to coordinate the effects of multiple variables and coordinate a theoretical claim and evidence. PM was helpful in visualizing the sequence of students’ steps and actions during the simulation, whereas LPA proved to be an effective method for uncovering unique patterns in students’ interactions with the technology-based environment, leading to the identification of distinct profiles of their inquiry performance. PM and LPA shed light on students’ strategies, capturing how they adapt and modify their approach over time in response to the requirements of a complex task. Furthermore, these analyses can reveal where students encounter difficulties, making it possible to design targeted interventions.

Second, this study showcases the potential of integrating publicly available assessment, log files, and questionnaire data from PISA to provide an in-depth understanding of students’ science performance and their relationships to teaching and learning activities. It suggests the potential for researchers to use similar data from large-scale assessments to explore student performance across other science competencies, assessment cycles, and educational systems. The integration of product and process data presented here offers a more nuanced understanding of student performance in scientific inquiry tasks. The practical significance of this lies in its potential to guide pedagogical interventions and contribute to more effective science instruction.