LEVELING UP OR LEVELING DOWN? THE IMPACT OF GENERATIVE AI ON STUDENT PERFORMANCE IN BUSINESS SCHOOLS

LEVELING UP OR LEVELING DOWN? THE IMPACT OF GENERATIVE AI
ON STUDENT PERFORMANCE IN BUSINESS SCHOOLS CARSTEN BERGENHOLTZ * OANA VUCULESCU FRANZISKA GÜNZEL-JENSEN LARS FREDERIKSEN Aarhus University This study investigates how generative AI (GenAI) access impacts student performance in ill-defined, time-pressured business school exams. Through an embedded mixed-methods design combining an experimental study with qualitative interviews, we identify an equalizing effect: low performers improve while high performers decline, resulting in performance convergence. Our qualitative analysis reveals the mechanism driving this convergence— GenAI-induced cognitive load inversion. Low performers experience cognitive load relief by copying chatbot output, thus bypassing the analytical work the task requires. High performers experience cognitive load amplification, struggling to process voluminous output under time pressure, disrupting their analytical processes. We argue that task structure shapes GenAI’s effects in time-constrained situations: the ill-defined nature of our task eli- cits different cognitive challenges compared to well-defined tasks of prior research, helping reconcile mixed findings on GenAI’s democratizing effects. Furthermore, the findings reveal how traditional assessments fail when GenAI masks performance differences. Business schools face a pressing question: Should they systematically redesign their assessment practices to account for students’ access to Generative AI (GenAI)1 tools like ChatGPT? This question strikes at the heart of educational evaluation, as assessing student performance is crucial in management education, providing essential feedback on learning and skill development (Armstrong & Fukami, 2010). Busi- ness schools have long relied on carefully designed assessments to measure students’ ability to apply theoretical knowledge to complex, real-world scenarios. However, if GenAI access changes how students are able to engage with these tasks, traditional assessment methods may no longer capture the student capabilities they aimed to assess (Krammer, 2023; Rudolph, Tan & Tan, 2023). Understanding to what extent business schools need to adjust their assessment practices requires understanding how GenAI affects student performance during assessment. Recent studies show performance enhancements across professional tasks, with lower performers particularly benefiting (Brynjolfsson, Li & Raymond, 2025; Dell’Acqua et al., 2023; Noy & Zhang, 2023). This implies a democratizing effect where performance differences are reduced, while raising both low and high performers. However, educational contexts reveal more varied results (Choi & Schwarcz, 2024; Fisher et al., 2025; Prather et al., 2024; Stadler, Bannert & Sailer, 2024). Some report an overall negative effect (Stadler et al., 2024), others show field- specific differences (Fisher et al., 2025), and some We appreciate the generous and constructive engagement of associate editor Christine Moser and the three anonymous reviewers, whose excellent guidance helped us advance this manuscript substantially. We also thank the participants at the Faculty AI Forum at CBS and the Academy of Manage- ment Learning & Education Paper Development Workshop in Copenhagen for their helpful feedback, as well as Chris- tian Hendriksen, Rasmus R. Hansen, Jacob V. Simonsen, and Niklas Stausberg for their valuable comments on prior versions of the manuscript. Finally, we are grateful to Pedro N. Jørgensen for his assistance with data collection in the lab, and to It-vest and the Department of Management, BSS, Aarhus University for providing research support. ÃCorresponding author Accepted by Christine Moser 1 When using the term generative AI (GenAI) we refer to large language models (LLMs) that students typically access through conversational chatbot interfaces. In this paper, we use GenAI and chatbots interchangeably when describing the interface or application through which students predominantly access GenAI capabilities. Our study specifically examines ChatGPT-4, which exemplifies this chatbot-mediated access to LLM technology. 1 Copyright of the Academy of Management, all rights reserved. Contents may not be copied, emailed, posted to a listserv, or otherwise transmitted without the copyright holder's express written permission. Users may print, download, or email articles for individual use only. r Academy of Management Learning & Education 2026, Vol. 00, No. 00, 1–32. https://doi.org/10.5465/amle.2025.0029

indicate diverging impacts for high versus low performers (Choi
& Schwarcz, 2024; Prather et al., 2024). Emerging evidence points to task structure as a critical reason for the variance of GenAI effectiveness on performance (Otis, Clarke, Delecourt, Holtz & Koning, 2024; Valcea, Hamdani & Wang, 2024). At one end of the spectrum of task structure are well-defined tasks with specified constraints and clear success criteria, such as highly structured questions or specific writing prompts. Here GenAI can act as a straightforward performance booster, especially for low performers (Dell’Acqua et al., 2023; Doshi & Hauser, 2024). In contrast, in ill-defined challenges a solver must define problems and evaluate GenAI’s output. Nota- bly, Otis et al. (2024) show that low performers experience a performance decline in such settings. Business school case analysis can present an ill- defined challenge. A case might specify a focal issue, yet students must independently frame problems and structure their analytical approach under time constraints. When students use GenAI for such tasks, we posit that the composition of mental work might change (Simkute, Tankelevitch, Kewenig, Scott, Sellen & Rintel, 2025; Valcea et al., 2024). Rather than simply adding content, they shift cognitive effort from composing to monitoring, evaluating, and integrating GenAI output. From a cognitive load theory perspective, GenAI can lower intrinsic cognitive load by providing structure, domain vocabulary, and fluent phrasing. Yet simultaneously extraneous load can be raised due to the voluminous, plausible text that must be processed (Tankelevitch et al., 2024). The net effect depends on students’ metacognitive ability to plan, monitor, and regulate their interaction with GenAI output—capabilities that may be particularly constrained under exam time pressure (Kalyuga & Singh, 2016; Yan, Greiff, Lodge & Ga sevi c, 2025). Given the mixed findings across different contexts (e.g., Brynjolfsson et al., 2025; Choi & Schwarcz, 2024; Dell’Acqua et al., 2023; Prather et al., 2024), it remains unclear whether GenAI helps or hinders student performance in ill-defined business case exams, and whether its effect is consistent between high- and low- performing students. Understanding this effect and how it may vary across ability levels requires examin- ing the underlying cognitive mechanisms at play—an area that remains notably underexplored (Simkute et al., 2025) despite business schools’ fast adoption of GenAI tools (AACSB International, 2025). Given these considerations, we ask: How does access to GenAI impact high- and low-performing students’ performance in an ill-defined business school exam? To answer this question, we employ an embedded mixed-methods study (Creswell & Plano Clark, 2011) following a test-and-explore approach (Wellman, Tröster, Grimes, Roberson, Rink & Gruber, 2023). We first experimentally establish to what extent GenAI differentially affects high and low performers, then qualitatively uncover the cognitive mechanisms that explain why these differences emerge. In our experimental study, students at a business school performed a traditional business school case analysis in organizational behavior. We find that low performers increase their performance when allowed to work with GenAI, while high performers’ performance decreases. This constitutes a specific, undesirable equalizing effect compared to the notion of a democratizing effect that is highlighted by prior studies (Brynjolfsson et al., 2025; Dell’Acqua et al., 2023; Merali, 2024; Noy & Zhang, 2023). Where a democratizing effect lifts both groups toward a higher level, the equalizing effect narrows the gap by raising lower performers and hindering higher performers. Our qualitative analysis reveals the mechanisms underlying this equalizing effect: GenAI-induced cognitive load inversion. Low-performing students experience cognitive load relief as the chatbot performs analytic steps they struggle to execute on their own (Kalyuga & Singh, 2016). This enables them to adopt metacognitive shortcuts by copying and modifying comprehensive GenAI output rather than engaging in demanding evaluation processes. In contrast, high- performing students experience cognitive load amplification when the chatbot’s output creates cognitive load that disrupts their established iterative analytical processes (Kalyuga & Singh, 2016). In the end, both groups adopt similar duplication strategies (i.e. copying big parts of the chatbot’s output), leading to similar performance outcomes. Our findings contribute to theory and practice as follows. Theoretically, we distinguish an equalizing effect (as opposed to democratizing; Brynjolfsson et al., 2025; Dell’Acqua et al., 2023; Noy & Zhang, 2023) of GenAI in ill-defined, time-limited case exams and explain the cognitive mechanism driving the equalization. We hereby clarify when and why GenAI helps or hinders student performance and thus reconcile prior mixed findings (Choi & Schwarcz, 2024; Fisher et al., 2025; Prather et al., 2024; Stadler et al., 2024) by fore- grounding task structure and varying cognitive loads. For business school educators, we highlight how traditional assessments can mis-measure capability when GenAI is available: the equalizing effect shows how very different abilities converge on similar scores. For management education practice, our findings suggest introducing a staged approach to curriculum design 2 Academy of Management Learning & Education Month

and assessment, capacity building for students and faculty, and
ultimately reclaiming institutional agency over GenAI architecture through thoughtful integration of GenAI in teaching, for example, with the help of customized learning tools. TECHNOLOGY AND HIGHER EDUCATION Technology is intertwined with education, influenc- ing how students study and how student performance is assessed (Farazouli, Cerratto-Pargman, Bolander- Laksov & McGrath, 2024; Lajoie & Azevedo, 2006). Salomon, Perkins, and Globerson (1991) offered an important distinction regarding technology’s impact on individual performance by arguing that technology can have two cognitive effects: effects with technology and effects of technology. Effects with technology refer to performance enhancements observed during technology use, such as improved problem-solving or reasoning. Effects of technology, on the other hand, are transferable cognitive benefits that persist even after the technology is no longer in use.2 This framing highlights how a given technology shapes a student’s cognitive processes. GenAI constitutes the most recent development in technology and education (Valcea et al., 2024), with surveys showing widespread student uptake (Flaherty, 2025; Freeman, 2025). Outside education, the domi- nant emerging narrative is that using GenAI has positive effects. Generally, individuals who obtain access to a chatbot perform better than those who do not have access when performing the same task. These results have emerged across various tasks, such as customer service (Brynjolfsson et al., 2025), consul- tancy (Dell’Acqua et al., 2023), coding (Cui, Demirer, Jaffe, Musolff, Peng & Salz, 2025; Nie et al., 2024; Peng, Kalliamvakou, Cihon & Demirer, 2023), creativity (Doshi & Hauser, 2024), and more general writing tasks (Noy & Zhang, 2023; Yu, Xu, CH-Wang & Arum, 2024). Beyond the positive average treatment effect, studies often imply a democratizing distributional effect of GenAI where lower performers improve more than high performers (e.g., Brynjolfsson et al., 2025; Cui et al., 2025; Dell’Acqua et al., 2023; Doshi & Hau- ser, 2024; Merali, 2024; Noy & Zhang, 2023; Riedl & Weidmann, 2025). However, the results are in fact heterogeneous, with some studies showing mixed (Otis et al., 2024) or even negative results (Becker, Rush, Barnes & Rein, 2025), particularly in the context of higher education (Choi & Schwarcz, 2024; Fisher et al., 2025; Lehmann, Cornelius & Sting, 2024; Prather et al., 2024; Stadler et al., 2024). For example, Choi & Schwarcz (2024) showed that ChatGPT-4 improved performance on multiple-choice tests in a legal educational context, yet they found no positive effect in a more complex essay-writing exam. Similarly, Prather et al. (2024) noted that novice programmers often struggled when using GenAI. This challenge is not unique to students; even experts (with both domain and GenAI expertise) face similar hurdles (Tankelevitch et al., 2024). For instance, a recent study of experienced software developers found that access to GenAI actu- ally made them slower when coding, due to the increased cognitive overhead of problem-framing and evaluation (Becker et al., 2025, see also Simkute et al., 2025). Nature of the Task What explains this discrepancy in results? We argue that the nature of the task shapes the user’s cognitive engagement. Tasks can be described as existing on a continuum from well-defined to ill- defined (Reitman, 1964; Simon, 1973). Well-defined problems are unambiguous: they specify initial conditions, and involve known procedures and clear end goals. In contrast, problems are ill-defined when one or more of these elements is underspecified; for example, the goal is ambiguous and procedures are unknown or tacit, thus requiring problem-solvers to (re)frame the problem and structure an approach. This distinction has cognitive consequences, since well-defined and ill-defined problems engage distinct cognitive processes, in particular in terms of monitoring and evaluating the problem-solving process (Schraw, Dunkle & Bendixen, 1995). Strikingly, most studies that document clear performance gains from the use of GenAI rely on well- defined tasks with specified framing, end goals, and relatively clear solutions (e.g., Dell’Acqua et al., 2023; Doshi & Hauser, 2024; Noy & Zhang, 2023; Riedl & Weidmann, 2025). Experiment participants may have to solve a business case based on 18 well-defined questions (Dell’Acqua et al., 2023) or write an eight- sentence creative story about an adventure on the 2 The emergence of using calculators illustrates the distinction: when used, they can boost performance by off- loading routine calculations (an effect with technology), whereas continued use can lead to weakened ability to perform calculations independently, while also enabling deeper engagement with conceptual reasoning (an effect of technology). Ellington (2003) shows that the main effects of technology were generally positive, that is the introduction of calculators in teaching led to deeper problem- solving efforts and abilities. 2026 Bergenholtz, Vuculescu, G€ unzel-Jensen, and Frederiksen 3

open seas (Doshi & Hauser, 2024). In both scenarios, it
was possible and efficient to simply copy the task into a chatbot to obtain a better response than what the average human would have composed in the given timeframe. Recent results on Olympiad-style math and competitive programming align with this pattern: state-of-the-art GenAI (e.g., ChatGPT-5, Gem- ini 2.5 Pro) achieve gold-level or top-rank performance on well-defined tasks with clear solutions (Financial Times, 2025; Luong & Lockhart, 2025). In contrast, for ill-defined tasks, where participants must therefore define the problem, choose an approach, and monitor and revise model output, the pattern can change. This is showcased in Otis et al.’s (2024) study of Kenyan entrepreneurs; on average, participants did not benefit from accessing the chatbot. Yet outcomes diverged, since stronger performing entrepreneurs improved their task performance when given access to GenAI, whereas lower-performing entrepreneurs saw their performance decline. Analysis of chatbot logs indicates that high performers asked more narrowly defined and implementable questions, while low performers asked ambiguous or even unan- swerable questions, which yielded advice they could not implement (Otis et al., 2024). The aforementioned Becker et al. (2025) study is of a somewhat similar nature. Here expert programmers were tasked with solving some of their own ongoing challenges, and after having selected a problem, they were randomized into a condition where GenAI could be used, or not. During the problem-solving process, they had to continuously reframe the problem and evaluate the obtained GenAI output. Overall, the experienced programmers performed worse when using state-of-the-art GenAI, compared to when they did not use this technology (Becker et al., 2025). Taken together, these mixed results are often attrib- uted to differences in discipline, task difficulty, GenAI model, or user population. Following Otis et al. (2024), we argue that the nature of the task plays an important role. However, it is also crucial to explain the cognitive mechanisms at play. GenAI changes the composition of mental work because it typically reduces effort spent on planning and writing, while increasing the effort needed to monitor, evaluate, select, and revise suggested text (Tankele- vitch et al., 2024). In well-defined tasks, this realloca- tion aligns with the assignment since the model’s polished, pre-structured output can be copied and submitted—maybe involving small refinements. Yet, in ill-defined tasks, users must first define the problem and then monitor, evaluate, and revise model output (see Otis et al., 2024). This creates a pedagogical tension: GenAI produces fluent, high-volume, seemingly sophisticated, well- structured outputs across domains (Hannigan, McCarthy & Spicer, 2024; Lindebaum & Fleming, 2024) that can either scaffold performance or displace the cognitive processes the assessment seeks to elicit, especially under time pressure. Put differently, task type (well- vs. ill-defined) and the epistemic character of GenAI outputs shifts the mental effort required to engage effectively with the task. Cognitive load theory offers a relevant lens to theorize these shifts in effort and cognitive processing (Tankelevitch et al., 2024; Yan et al., 2025). Cognitive Load and Metacognition Defined as the intensity of goal-directed cognitive activity (Kalyuga & Singh, 2016), cognitive load arises while learners process information in order to complete a task. It is commonly split into three types: intrinsic load, which is the unavoidable effort caused by the task’s inherent complexity, where easier tasks have lower intrinsic load. Extraneous load is the mental effort caused by the poor design or presentation of instructional materials. For example, unnecessary animations, redundant text, or a confusing layout can force individuals to use mental resources on activities that do not support one’s ability to solve the given task. Finally, germane load is the effort invested in organizing, connecting and refining information into useful schemas and understanding (Kalyuga & Singh, 2016; Sweller & Chandler, 1994). To illustrate, watch- ing a video about a complicated, unfamiliar case produces intrinsic load; poorly structured slides or other kinds of digital assistance add extraneous load; and deliberately integrating information and creating understanding constitutes germane load. When total load exceeds working memory capacity, individuals cannot allocate enough resources to evaluate and construct task-relevant schemas, and performance is challenged (Kalyuga & Singh, 2016). Thus, individuals must actively manage their own cognitive processes; metacognition, “the psychological ability to monitor and control one’s thoughts and behavior” (Tankelevitch et al., 2024: 1), guides what to attend to and what to ignore, keeping extraneous load in check. GenAI poses a particular challenge to metacognitive regulation and overall cognitive load because GenAI can generate large amounts of text very quickly. In other words, GenAI may lower intrinsic load by providing structure, examples, and relevant terminology, yet this technology can also raise extraneous load by overwhelming users with plausible, 4 Academy of Management Learning & Education Month

pre-organized text that must be monitored and evalu- ated (Simkute
et al., 2025). The net effect might be higher metacognitive demand, especially under time pressure, when attention for monitoring and selection is limited (Galy, Cariou & M elan, 2012). In other words, the “cognitive ease of chatbots” may come at a cost, as illustrated in Stadler et al. (2024) who show that students using ChatGPT (vs. conventional web search) reported lower mental effort across all three kinds of cognitive loads and produced weaker arguments (compared to accessing the web). This discrepancy occurs because the reduction in cognitive load does not guarantee that cognitive resources are directed toward germane processing, the deep cognitive work necessary for learning and critical thinking. Without germane processing, GenAI’s voluminous and convincing output can create an illusion of understanding without achieving real depth (Stadler et al., 2024). Consistent with this, recent meta- analytic evidence on cognitive load studies (Tetzlaff, Simonsmeier, Peters & Brod, 2025)3 indicate uneven benefits: lower performers tend to gain from assistance since they may receive relevant information that they did not have (increased intrinsic cognitive load). In contrast, higher performers can see smaller or even negative effects because additional information introduces redundancy and may disrupt the existing schemas and workflows. To summarize, the “democratizing” narrative lacks a clear cognitive mechanism to explain conflicting results. A cognitive-process lens clarifies that GenAI redistributes mental work in ways that vary by task structure and user ability: in well-defined tasks with clear criteria, GenAI can function as straightforward boosters by supplying fluent, pre-structured text that imposes relatively low metacognitive load (e.g., Del- l’Acqua et al., 2023). In ill-defined tasks, by contrast, users must first specify the problem, decide how to interact with the model, and then monitor, evaluate, and revise its outputs, raising extraneous load and metacognitive demands, particularly under time pressure (Simkute et al., 2025; Tankelevitch et al., 2024). From this cognitive perspective, gains from chatbots are likely to be unevenly distributed across high and low performers because they experience different load compositions and possess different capacities for metacognitive regulation. These considerations directly motivate the following hypotheses. Hypothesis 1 Our study is anchored in a realistic exam scenario in a Nordic business school setting. Students are asked to analyze a traditional business case, identify and prioritize relevant information, and construct a coherent, structured written response, under time constraints. The questions they respond to are not well-defined. In contrast, they have ill-defined characteristics, since the questions are ambiguously open- ended and do not specify a particular procedure for answering them (see Supplementary Material, Sec- tion B). GenAI is extensively trained on texts that include academic literature, managerial frameworks, and real-world examples from business contexts. As such, it is well-positioned to assist students in generating relevant, domain-specific content (Dell’Acqua et al., 2023). Furthermore, the chatbot also supports the structuring and synthesis of information, helping students with the cognitive challenge to structure arguments, develop a logical flow, and articulate ideas fluently (Lodge, Yang, Furze & Dawson, 2023; Noy & Zhang, 2023). From a cognitive load theory perspective, this scaffolding function can reduce intrinsic load. Also, GenAI can provide relevant schemas to deploy, which frees up working memory for task engagement and reasoning (Kalyuga & Singh, 2016; Sweller & Chandler, 1994). It can also help produce coherent, fluent writing, which is particularly beneficial for non-native English speakers (Herbold, Hautli- Janisz, Heuer, Kikteva & Trautsch, 2023). Taken together, we expect that students with access to the chatbot will, on average, produce stronger responses than those without such access. Hypothesis 1. When analyzing a business case, having access to ChatGPT-4 improves student performance. Hypothesis 2 The reviewed studies suggest that the performance benefit of GenAI is not uniform across ability levels. In particular, lower-performing individuals often gain more from GenAI support than their higher- performing peers (e.g., Brynjolfsson et al., 2025; Choi & Schwarcz, 2024; Dell’Acqua et al., 2023; Doshi & Hauser, 2024). We expect a similar, uneven pattern in our setup for the following reasons. First, low performers often struggle with two basic challenges: lack of domain-specific knowledge and difficulty structuring their task approach under time pressure. GenAI addresses these challenges by offering organized, relevant content (Lodge et al., 2023) particularly 3 This meta-analysis did not aim to cover studies involving GenAI, and their conclusions thus draw on non-GenAI instructional settings. 2026 Bergenholtz, Vuculescu, G€ unzel-Jensen, and Frederiksen 5

benefiting students who struggle with English profi- ciency or academic
writing (Herbold et al., 2023). For these students, the chatbot reduces intrinsic cognitive load, acting as a scaffold that raises performance with relatively little additional effort. Second, task design influences the effectiveness of GenAI’s support. Although the chatbot performs well with detailed prompts (Dell’Acqua et al., 2023; Doshi & Hauser, 2024), its output quality decreases when encountering ambiguous expectations or limited contextual details (Mollick, 2024). We argue that achieving excellence with GenAI for ill-defined tasks requires additional planning, monitoring, and evaluative effort for high performers. Yet, under time pressure (such as an exam situation), there might not be adequate time for such metacognitive regulation, making it harder to monitor and evaluate guidance efficiently (Becker et al., 2025; Choi & Schwarcz, 2024; Otis et al., 2024; Stadler et al., 2024). In contrast, low-performing students stand to gain substantially more from using GenAI than high-performing students because its structured responses effectively bridge their knowledge gaps and can help them move from a low to average performance with relatively less effort (Choi & Schwarcz, 2024). Finally, the closed grading scale used in our task setting constrains the potential for further improvement among high performers if they are already near the upper limit of the scale. In contrast, starting from a much lower baseline, low performers have more potential for substantial improvement. This grading structure reinforces our expectation that low performers will benefit more than high performers when given access to GenAI. Hence, our second hypothesis predicts an uneven impact, as follows: Hypothesis 2. When analyzing a business case, having access to ChatGPT-4 improves student performance more for low performers than for high performers. METHODS We employed an embedded mixed-methods design (Creswell & Plano Clark, 2011) following a test-and- explore approach (Wellman et al., 2023), with methodological integration at both the design and inferential levels. We first test the causal effect of GenAI on student performance through an experiment, then we explore the underlying cognitive mechanisms through qualitative inquiry. The quantitative component provides causal identification of GenAI’s impact across different performance levels, while the qualitative component uncovers the cognitive processes and interaction patterns that explain why these effects occur. This integration enables us to move from identifying to what extent GenAI differentially impacts high and low performers to understanding why these differences emerge, generating actionable implementation recommendations for management education. Quantitative Study: Experimental Design We conducted a randomized controlled trial to assess the impact of using ChatGPT-4. The study setup, hypotheses, and analytical approach are preregistered, and we received ethical approval from our university. Participants were recruited from the university’s behavioral lab subject pool via email invita- tions; no course enrollment or course credit was involved. Data were collected in October–November 2023 in the on-campus social science lab, with sessions of 24 participants. The sample comprised 146 participants (73 in each condition4), almost all students, with an average age of 24.1 years. Of the participants, 85 identified as female and 55 as male (see Table 1). English fluency was the only inclusion criterion. Sessions lasted approximately 70 minutes. Participants were paid a flat rate (about e19) and a potential performance bonus; the top 20% of best-performing participants had a 10% chance of winning about e53 in a lottery. Experimental task: Procedure. Participants watched a 13-minute video presenting an organizational behavior case typical of business school exams—see Supplementary Material (Section B) for a complete overview of case material. The problem was ambiguous by design, since students needed to clarify the end goal, reframe the problem, and argue a position without a well-defined route to “the” right answer. Core issues related to extrinsic and intrinsic motivation, group or individual incentives, and creating a collective identity. We relied on two attention checks: the video contained a number of female quotes and they were asked to recall the number of such quotes and the name of the case company. No participants failed the attention check. Task 1: Baseline performance: In their first writing task, all participants answered a question about the 4 Although we had an even distribution of participants, a handful in the treatment group did not use the chatbot. We decided to focus our subsequent analyses on differences between those who used the chatbot and those who did not, rather than on intention to treat, which is the treatment allocation. Results obtained using the treatment allocation distinction are virtually identical. 6 Academy of Management Learning & Education Month

given case study. They had 22 minutes to complete Task
1 and were given access to a transcript of the video. Task 1 constituted the baseline performance in the experiment and was subsequently used to distinguish high or low performers in the quantitative and qualitative analysis. Following this first task, participants completed a short survey on efficacy and enjoyment. Task 2: Treatment and control: In Task 2, participants received a new question5 and were again informed they had 22 minutes to respond. Some participants were now randomly given access to a React web app interface linked to a GPT-4 API. The interface of the web app was similar to OpenAI’s chatbot interface (see Supplementary Material, Section A, Figure S1) and participants could access all their interactions through the interface. Through the web app, we could record all participant interactions (prompts and answers). The system prompt was: “I am a helpful assistant. Ask me anything.” The temperature parameter6 was left at the default value of 0.8, which is appropriate given that the task was not well-defined. Thus, generated responses were expected to be diverse, but not random (the upper limit is 2). As in Task 1, participants had access to the transcript of the video but could not copy from it.7 For logistic reasons, the randomization was carried out per experimental session, not per participant. The setup explicitly imitated an exam situation. The treatment group was informed that they had access to ChatGPT-4. The control group was informed that this was an exam setting and accessing the Internet was allowed, while chatbots could not be accessed. We could monitor all behavior and no one in the control group was excluded. Receiving this kind of information on availability of resources was typical for students at the focal university. It was thus not straightforward for participants to infer if they were in a control or treatment condition. Post-task survey: After completing Task 2, participants answered a survey, including demographic questions, questions about how they experienced the tasks, and their experience with cases. Participants in the treatment group were asked about their prior experience with GenAI. Finally, all participants rated their optimism about using AI technologies in higher education (Table 1). TABLE 1 Descriptive Statistics Measure Treatment Control Measured after Task 1 Task 2 Task 1 Task 2 Gender 45 Females and 24 Males 40 Females and 31 Males Age 23.6 (3.86 SD) 24.6 (6.38 SD) Variable M SD M SD M SD M SD Case experiencea NA NA 2.913 1.369 NA NA 3.084 1.295 Felt effectiveb 5.913 2.349 6.797 2.304 5.577 1.887 6.366 2.051 Felt enjoymentc 6.565 2.172 7.057 2.120 6.098 2.224 6.549 2.068 Felt skilledd 5.724 2.064 6.217 2.215 5.394 1.744 6.408 1.761 ChatGPT experiencee NA NA 5.666 2.888 NA NA 5.169 3.018 Useful aidf NA NA 3.289 1.295 NA NA 2.323 1.317 Effortg NA NA 3.768 0.876 NA NA 3.591 0.803 Optimistic AI productivityh NA NA 6.927 2.475 NA NA 6.239 2.411 AI impacti NA NA 6.434 2.245 NA NA 6.084 2.334 Task evaluation (grade) 5.623 1.791 5.478 1.819 4.633 2.205 4.323 1.895 5 The question participants answered in Task 2 was: “Keeping in mind that scientific work is often more successful when undertaken in teams, please discuss what the Cure Factory Copenhagen [name of the case company] should consider when motivating teamwork.” 6 The temperature parameter controls how predictable (versus random) the chatbot’s output is. Low temperatures render outputs that are predictable and repetitive. 7 This was a deliberate choice because we expected participants in the treatment group would have an unfair advantage if they could copy-paste the entire case. This also captures a real-life exam situation in which students must deal with considerably longer cases that could not be parsed easily by the—at the time—state-of-the-art GenAI. 2026 Bergenholtz, Vuculescu, G€ unzel-Jensen, and Frederiksen 7

Measures Task 1 and 2 evaluation: Performance in Task 2
constituted our dependent variable. On average, participants produced about 270 words across both conditions. Performances in Tasks 1 and 2 were assessed by two of the authors with extensive experience in teaching and grading organizational behavior exams of a similar nature. All tasks were rated by both on a scale from 1 to 10. Interrater agreement was 90.5% and 90.9% for Tasks 1 and 2, respectively. In other words, on a scale from 1 to 10, the assessors differed on average 0.85 (Task 1) and 0.82 (Task 2). If the assessors disagreed on a score, they discussed the submission and agreed on a rating. The grading criteria followed how exams in an organizational behavior bachelor-level course would be graded at the given university: identifying and prioritizing key challenges implied by the exam case and question; using relevant terminology and clear, consistent language; and using relevant insights to answer the question. See Supplementary Material (Section C) for selected examples of low-, medium-, and high-quality submissions to Tasks 1 and 2. Case study experience: Participants were asked, on a scale from 1 to 5, to assess if this case-based task reflects something they have done in the past. Felt enjoyment: Participants were asked to assess how much they enjoyed the task, on a scale from 1 to 10 (following Noy & Zhang 2023). This was asked for both Task 1 and Task 2. Felt skilled: Participants were asked to assess how skilled they felt while completing the task, on a scale from 1 to 10. This was asked for both Task 1 and Task 2. Felt effective: Participants were asked to assess how effective they felt while completing the task, on a scale from 1 to 10. This was asked for both Task 1 and Task 2. The last two variables are categorized as self- efficacy by Noy and Zhang (2023). ChatGPT experience: Participants were asked to assess if they have previously engaged with chatbots such as ChatGPT, Bing, or others (following Doshi & Hauser 2024, Noy & Zhang, 2023). Useful aid: Participants were asked to assess how useful their aid was; either the Internet or ChatGPT- 41 the Internet, depending on the experimental condition (following Noy & Zhang, 2023). Effort: Participants were asked to assess how much effort they put into the tasks, on a scale from 1 to 5 (adapted from Ryan, Mims & Koestner, 1983). Optimistic AI productivity: On a scale from 1 to 10, we asked participants how optimistic they are about AI making higher education students more productive (adapted from Noy & Zhang, 2023). AI impact: On a scale from 1 to 10, we asked participants how they feel about the impacts of future advances in AI on higher education (adapted from Noy & Zhang, 2023). Education: We asked participants about their field of study and level of education. Mother tongue: We asked participants what their mother tongue is (5% stated English). Preliminary data analysis: One participant was dropped from the dataset due to technical issues recording their data. Two participants were excluded because their answers did not match the minimum requirements set by the preregistration. Finally, after inspecting the records of the chatbot, two more participants were flagged because they had found a technical workaround and copy-pasted the entire transcript.8 Thus, the final sample size was 141 participants. In Supplementary Material Table S1 we show the correlation between all collected controls and the main dependent variable: performance in Task 2. Qualitative Study: Semi-Structured Interviews To identify the mechanisms underlying the performance effects of chatbot utilization in business school exam settings, we conducted qualitative interviews with participants from the treatment group within 24 hours post experiment to maximize recall accuracy. Following the experiment, interviews with treatment group participants were conducted on a first come, first served basis. We stopped adding further participants to the interviews once new data began to echo earlier observations (Glaser & Strauss, 1967; Lund, 2014). Alto- gether, we interviewed 18 participants, all of whom were students at the business school; seven were low performers (labeled L1–L7 in the findings section), and 11 were high performers (labeled H1–H11). The semi-structured interviews were conducted online and designed to foster open dialogue. Follow- ing a narrative approach (Weiss, 1995), we explored participants’ prior GenAI experience, experiences with Tasks 1 and 2, objectives for using GenAI in Task 2, assessments of Tasks 1 and 2 outcomes, and preferences for working independently or with the chatbot. We used an interview protocol as a guide and encouraged participants to freely share their experiences and emotions to capture what they 8 Results are robust to these exclusions with coefficient estimates showing minimal variation and maintaining the same level of statistical significance. 8 Academy of Management Learning & Education Month

considered relevant and important. To encourage candor, we ensured anonymity.
The interviews aver- aged 25 minutes and were recorded and transcribed verbatim, yielding 173 single-spaced pages. In addi- tion, we collected all chatbot interactions—including the 18 participants’ prompts and the chatbot output they received—to triangulate self-reported experiences with actual usage patterns. We conducted a rigorous coding and analysis process following established qualitative procedures (Gioia, Corley & Hamilton, 2013; Miles, Huberman & Saldana, 2014) of interview data as well as the interview participant’s chatbot interactions. The inductive coding process began with first-order coding. We identified informant-centric concepts regarding how students used the chatbot, their experiences with Tasks 1 and 2, and their perceptions of the outcomes and process of working with and without the chatbot. This initial coding revealed a striking contrast: although students demonstrated markedly different experiences in Task 1, their experiences became similarly chatbot- dependent in Task 2. This observation led us to con- sult literature on cognitive load theory and human–AI interactions (Kalyuga & Singh, 2016; Simkute et al., 2025; Sweller & Chandler 1994; Tankelevitch et al., 2024) to develop our second-order themes, looking for deeper theoretical patterns in how students approached and integrated chatbot outputs. In the final stage, we aggregated these themes into overarching theoretical dimensions, which revealed contrasting patterns between high and low performers, particularly in their intended versus actual utilization of the chatbot, in cognitive processing of chatbot output, and preferences in working with chatbots. RESULTS: EXPERIMENTAL STUDY Figure 1 presents performance score distributions for Task 2 across control and treatment conditions. The distribution indicated that participants in the treatment condition tended to obtain higher scores. The first preregistered hypothesis stated that having access to ChatGPT-4 improves performance. To test this, we conducted a paired sample t-test. We found support for our hypothesis (p5 .0014); participants in the control group displayed lower performance than those in the treatment group (Figure 2a). We then conducted an ordinary least squares (OLS) regression that included demographic controls (age and gender) and our independent variable (Table 2).9 Model 3 includes the grade in the first period as a control because we expect performances to be correlated. We included additional controls in subsequent models (Models 1–5 in Table S2 in Supplementary Mate- rial). No demographic variables were significantly associated with the dependent variable. The additional task-related variables (e.g., case experience, perceived enjoyment, perceived skill, or effectiveness) were also poor predictors of the dependent variable. A similar trend aligns with general perspectives on AI (use of GenAI, AI impact, and perception of AI in education). See Table S2 in Supplementary Materi- als for a detailed overview of the results of the OLS, including all variables. For our second hypothesis, we anticipated that low performers (measured as performance in Task 1) would benefit more from using the chatbot than high performers. Consistent with this, we carried out within-subject comparisons that show that low performers improved from Task 1 to Task 2 (paired t-test, p, .0001), whereas high performers, on average, did not. In fact, their mean score decreased (p, .001). To further explore these results, we fit a regression model with an interaction effect between a dummy variable (low performers, coded 1 if their performance in Task 1 was less than or equal to 5, and 0 otherwise10) and the treatment. Results are shown in Table 3 and support Figure 2: The differences between treatment and control seem to be driven by the gains in performance among low performers, while those who performed well in Task 1 did not benefit from using the chatbot. Exploratory Results We analyzed participants’ prompting behavior using data collected through the React web app to explain the chatbot’s impact on performance. Specifi- cally, we constructed two additional measures: the number of prompts a participant has used11 and the average number of words per prompt. Results showed that good performance in Task 2 was correlated with the number of prompts used (p5 .039, r5 .27) and the average number of words used per prompt (p5 .075, r5 .24, log transformed). However, as illustrated in Figure 3, most participants used 9 Mother tongue is not included because only seven participants self-categorized as native English speakers. 10 The results are qualitatively the same if we use 6 as a threshold, although the interaction effect is somewhat reduced. By using 4 and 7 as a threshold, the interaction effect becomes insignificant at the conventional level of 0.05, but the direction remains unchanged. 11 After eliminating prompts like “Hi” or “Thank you!” 153 prompts remained, on average 2.5 per user. 2026 Bergenholtz, Vuculescu, G€ unzel-Jensen, and Frederiksen 9

FIGURE 1 Performance in Task 2 for Control (Diagonal Stripes)
or Treatment (Solid Gray) Conditions Performance histogram Treatment Control Treatment 16 14 12 10 8 6 4 2 0 0.0 2.5 5.0 Score Count 7.5 10.0 Note: Figure 1 compares performance of Task 2 across treatment (solid gray) and control (diagonal stripes) conditions. FIGURE 2 Performance Comparison in Task 2 10.0 7.5 5.0 2.5 0.0 Control p = 0.003018 p = 0.939 p < 0.001 Treatment Control Treatment Treatment Control Treatment Treatment High Performers Low Performers 10.0 7.5 5.0 2.5 0.0 Control Treatment Control Treatment Score on Task 2 Score on Task 2 A B Notes: Figure 2A compares performance of Task 2 across treatment (solid gray) and control (diagonal stripes) groups. Figure 2B contrasts the performance in Task 2 of those who scored in the bottom half versus the top half in Task 1 with treatment (solid gray) and control (diagonal stripes) groups. 10 Academy of Management Learning & Education Month

relatively few and short prompts. Only 25% of participants
had prompts averaging more than 26 words,12 and the median number of words per prompt was 15. We found no significant differences between low and high performers, as defined by their performance in Task 1, concerning the number or length of prompts. RESULTS: QUALITATIVE STUDY Our qualitative analysis reveals the mechanism we term GenAI-induced cognitive load inversion that explains how utilizing the chatbot is associated with contrasting cognitive experiences for low and high performers, resulting in performance convergence. While working with the chatbot-enabled low performers to adopt cognitive shortcuts, resulting in cognitive load relief, high performers faced cognitive load amplification that constrained them to abandon their established gradual, iterative work processes. Low Performers Low-performing participants encountered significant challenges with Task 1, struggling to simultaneously manage the multiple cognitive demands of the task, ranging from understanding the case content to structuring their thoughts and articulating responses. They expressed being “overwhelmed with all the information provided in the case” (L3), while others had difficulty with organization, stating, “It was a bit hard for me to begin and to do it” (L6), with one describing feeling “lost” (L6) about how to structure their work. Many also perceived that they lacked sufficient domain knowledge about the case’s challenge to process the relevant information, “I don’t know much about this topic” (L7), leaving them dissatisfied with their ability to articulate coherent responses, stating, “I’m not proud of what I’ve written there” (L6). Altogether, participants felt “frustrated” (L3) and “stuck” (L5), echoing a sentiment of unease that pervaded their experience with Task 1. Introducing the chatbot for Task 2 marked a turning point for these participants, transforming their task experience from struggle to ease. One participant noted that the chatbot could provide a “helicopter view” (L3) while another described it as having “a catalyst” (L6) that “jumpstarted my process” (L1). This shift stemmed from the chatbot’s ability to reduce the complexity participants needed to navigate independently, as participants discovered it could serve multiple functions; generating “structure” (L5), providing comprehensive case responses, and delivering polished TABLE 2 OLS Regression Dependent variable Task evaluation, Period 2 (1) (2) (3) Task evaluation, Period 1 0.372ÃÃÃ (0.071) Treatmenta 1.204ÃÃÃ 0.823ÃÃÃ (0.312) (0.295) Gender (male) 0.206 0.335 0.234 (0.332) (0.319) (0.292) Age 0.019 0.030 0.033 (0.031) (0.029) (0.027) Constant 4.414ÃÃÃ 3.479ÃÃÃ 1.707ÃÃ (0.764) (0.767) (0.780) Observations 141 141 141 R2 .006 .103 .253 Adjusted R2 2.009 .083 .231 Residual SE 1.924 (df 5 138) 1.834 (df 5 137) 1.679 (df 5 136) F statistic 0.389 (df 5 2, 138) 5.241ÃÃÃ (df 5 3, 137) 11.523ÃÃÃ (df 5 4, 136) Notes: aThis variable does not signify belonging to a treatment group, but whether participants used the chatbot. Results are virtually unchanged for treatment allocation. Ã p , .1 ÃÃ p , .05 ÃÃÃ p , .01 12 We used the average number of words for participants who used more than one prompt. Results are virtually unchanged when taking the maximum number of words across their prompts. 2026 Bergenholtz, Vuculescu, G€ unzel-Jensen, and Frederiksen 11

FIGURE 3 Frequency of Prompts and Number of Words Used
Number of prompts per user 20 15 10 5 0 0 3 6 9 12 0 100 200 Value Value 300 400 20 25 15 10 Frequency Frequency 5 0 Number of words per prompt per user Notes: The figure on the left shows the number of prompts per user. The figure on the right shows the number of words per prompt per user. TABLE 3 Interaction Effect Dependent variable Task evaluation, Period 2 (1) (2) Low performers (yes) 21.160ÃÃÃ 21.940ÃÃÃ (0.306) (0.396) Treatment 0.787ÃÃ 20.186 (0.307) (0.443) Low performers (yes)Ã treatment 1.788ÃÃÃ (0.600) Constant 5.272ÃÃÃ 5.767ÃÃÃ (0.276) (0.316) Observations 141 141 R2 .152 .203 Adjusted R2 .139 .186 Residual SE 1.777 (df 5 138) 1.728 (df 5 137) F statistic 12.323ÃÃÃ (df 5 2, 138) 11.643ÃÃÃ (df 5 3, 137) Ã p , .1 ÃÃ p , .05 ÃÃÃ p , .01 12 Academy of Management Learning & Education Month

language with “good grammar and rich vocabulary” (L4). Given these
varied capabilities that handled what had previously overwhelmed them, many low performers became heavily reliant on the chatbot for creating the answer to Task 2, using it “for everything” (L2). This heavy reliance, however, exposed a challenge. While the chatbot could generate comprehensive output, participants lacked the established frameworks or sufficient domain knowledge to evaluate what it produced. In consequence, students struggled to engage critically with the chatbot’s analysis, as one participant admitted: “I lacked some basic knowledge to understand the output” (L6). Instead of systematic evaluation, they relied on their “intuition” (L3) or drew limited connections to personal work experiences, with another noting, “I also work in teams every day” (L5). Unable to perform this evaluation, participants took a metacognitive shortcut, applying a duplication approach where they either “copy and modify it slightly” (L3) or simply “summarized [the chatbot’s output] in my answer” (L5). This pattern was reflected in most low performers’ prompting behavior: they typically asked one or two simple, direct questions, accepting whatever the chatbot produced without iterative refinement or follow-up requests to elaborate, clarify, or revise specific aspects of the response (see Supplementary Material Section D for examples). This approach allowed them to bypass the cognitively demanding work of content evaluation and synthesis, with any modifications undertaken limited to cosmetic refinements to improve readability. Despite limited critical engagement, low performers preferred using the chatbot because it reduced the intrinsic cognitive load they had struggled to manage in Task 1. The ill-defined task required students to simultaneously understand case content, select relevant information, structure arguments, and articulate responses which overwhelmed their metacognitive abilities. By providing ready-made case framing, argumentative structure, and draft text, the chatbot off- loaded these cognitive demands, enabling low performers to adopt cognitive shortcuts. Their duplication approach did not add extraneous load. As they explained, “I didn’t need to figure out how to start this task” (L2), “I knew what to write about” (L3), and they “had more to say” (L1), creating a sense of support one participant likened to “the feeling when your father’s got your back, it gives you that feeling like somebody is with you” (L6). The chatbot produced substantial cognitive load relief, allowing participants to “feel more relaxed” (L1) and work without the cognitive strain they experienced in Task 1. High Performers Most of the high performers articulated that they found joy in Task 1. In comparison to the low performers, they felt they had “sufficient knowledge about motivation” (H8) to understand the case content and actively embraced the analytical work required, with one noting, “I found it rewarding to think about the case deeply” (H5). Furthermore, they valued the analytical process involved in structuring their approach and articulating their responses, as another explained, “it was satisfying to organize my thoughts” (H8), and many experienced that they delivered “a good answer” (H11). Their positive experience reflected their ability to effectively manage all elements of the task, from case comprehension to response formulation. Introducing the chatbot in Task 2 was initially met with enthusiasm by many high performers, like their lower-performing peers. They understood its potential to enhance “structure and efficiency” (H3) in their work, provide “some inspiration” (H1), help them “work more efficiently” (H5), and elevate their written language to a more “scientific level” (H4). How- ever, unlike the low performers who embraced the tool immediately, some high performers also reported feeling pressured to utilize the chatbot due to concerns about performance and peer perception: “If I’m not working with the chatbot but the others are, I’ll not perform as good and will look bad” (H7). This awareness that the chatbot introduced new considerations beyond the straightforward analytical work they typically excelled at foreshadowed the challenges they would encounter. The high-performing students initially aimed to collaborate with the chatbot. They intended to actively integrate their perspectives with the chatbot’s output, with one stating “I aimed at modifying the way that the chatbot presented the output” (H6) and another embracing a “teamwork makes dream work” (H10) attitude. This collaborative approach mirrored their established problem-solving method, which was gradual and iterative. One student described this process: “Normally, I would start thinking a bit, then formulate some ideas and writing out the most promising” (H3)—an approach that had served them well in Task 1. However, this intended co-creation proved challenging in practice as the chatbot’s output created overwhelming cognitive demands. Instead of providing starting points for their iterative approach, participants encountered complete, well-articulated answers in sheer volume, as one noted: “the chatbot started listing 2026 Bergenholtz, Vuculescu, G€ unzel-Jensen, and Frederiksen 13

quite a lot of stuff” (H8). This demanded simultaneous processing
and evaluation, leading to “information overload” (H7). The experienced cognitive load, com- bined with time pressure, disrupted their established problem-solving approach. Unable to manage the cognitive demands of working with the chatbot while maintaining their preferred analytical approach, many high performers ultimately felt constrained to adopt a duplication approach, just like the low performers. One participant stated that their submission was “basically, 99% ChatGPT” (H1), while another noted “in the end I just copied most of the chatbot output” (H7). Notably, many high performers used only one or two simple prompts similar to low performers, failing to employ iterative prompting strategies that might have helped manage this complexity (see Supplementary Material Section D). Even the few who did attempt extensive revisions through multiple follow-up prompts found they lacked time to process each iteration given the time pressure of the exam situation. Despite their initial intentions to collaborate with the chatbot, the cognitive demands of processing and evaluating the comprehensive output forced them to abandon their preferred gradual, iterative approach. Many high performers ultimately expressed a pref- erence for working without the chatbot, despite recognizing its potential benefits. Rather than experiencing cognitive relief like low performers, high performers faced additional cognitive load demands across multiple dimensions. First, the chatbot provided (at best) minimal reduction in intrinsic load because high performers already possessed the domain knowledge and analytical schemas needed to understand the case and structure their arguments. Second, the chatbot’s voluminous output imposed substantial extraneous cognitive load: participants had to process, compare, and integrate large amounts of new information, much of which duplicated or conflicted with their own emerging ideas, creating what one described as “too much information to handle” (H6). This extraneous overload strained working memory, depleting resources that would otherwise be available for germane cognitive processing—the deep, generative thinking necessary for developing new insights and refining mental models that they could draw on in Task 1. Forced to rely on duplication rather than their preferred gradual, iterative approach, high performers questioned whether the final output could genuinely be considered their own work, with one stating, “I didn’t feel like the author” (H7). The detachment from their normal work process led to feelings of diminished accomplishment, as they reflected: “I feel I have accomplished something in Task 1, but not Task 2” (H8), “I don’t think I learned [while performing task 2]” (H5), and “I didn’t challenge myself” (H1). DISCUSSION Our study assesses the impact of GenAI on student performance in a traditional business case analysis, a task representative of typical business school exams. We hypothesized that GenAI disproportionately benefits lower-performing students, a prediction supported by our findings. However, our results also revealed an unexpected and substantial decline in performance by high-performing students, resulting in an equalizing effect wherein performance levels across both groups converged. This equalizing effect diverges from the more positive democratizing effect noted in prior studies, in which both high and low performers benefited, albeit with low performers see- ing more pronounced gains (e.g., Brynjolfsson et al., 2025; Dell’Acqua et al., 2023; Doshi & Hauser, 2024; Merali, 2024; Noy & Zhang, 2023; Riedl & Weidmann, 2025). Our results imply that whether GenAI “democratizes” or “equalizes” performance depends less on GenAI model quality and more on how the usage of the tool reshapes cognitive work in ill- defined tasks. Because the case analysis was ill- defined, the chatbot could not fully complete the task; students had to (re)frame the problem, monitor outputs, and actively engage in revisions. We therefore interpret the equalizing pattern through a cognitive lens, where GenAI imposes heterogeneous cognitive load demands on students. We label this GenAI-induced cognitive load inversion—a mechanism where GenAI simultaneously can relieve the cognitive burden for low performers while amplifying it for high performers, which ultimately drives both groups toward similar performance outcomes. We thus contribute to current work on GenAI in higher education by shifting the emphasis from the output generated by GenAI tools to how cognitive work is carried out and orchestrated in collaboration with GenAI tools (Simkute et al., 2025). This lens clarifies when and why equalization, rather than democratization, emerges, by specifying the load-based pathways through which GenAI alters cognitive processes for (high vs. low) performers. In terms of low performers, GenAI helps with the two typical challenges: gaps in domain-specific knowledge and difficulty structuring responses by offering relevant and well-organized output 14 Academy of Management Learning & Education Month

(Lodge et al., 2023). This reduces their intrinsic cognitive
load, essentially scaffolding their analytical capabilities with minimal additional effort required (Kalyuga & Singh, 2016). In contrast, high- performing students encounter a different set of demands. Rather than benefiting from ready-made content, they face the cognitively demanding task of monitoring, evaluating, and integrating GenAI’s output to meet their typical high standards. In our time- constraint exam setting, this process is challenging. The voluminous, plausible-sounding output from the chatbot often introduced information that disrupted their established analytical workflows, creating what cognitive load theory would describe as extraneous load for experts. This can be considered a manifestation of the expertise-reversal effect where support designed to help novices becomes counter- productive for those with existing expertise (Tetzlaff et al., 2025). Furthermore, our analysis of prompting behavior reveals a critical dimension of the GenAI-induced cognitive load inversion process. Students who engaged in more extensive prompting—using longer, more detailed queries—achieved better performance outcomes. While this indicates that effective GenAI interaction requires active, iterative engagement rather than passive consumption, it also highlights a missed opportunity for high-performing students: although they possessed the domain-specific knowledge to monitor and evaluate the chatbot’s outputs, their struggle to employ task-specific metacognitive regulation (e.g., critical evaluation and iterative revision of the output) limited their ability to leverage the chatbot effectively. Those who end up underperform- ing while using the chatbot rely on fewer and simpler prompts, which reflects their lack of ability to plan and delegate processes to the chatbot (Tankelevitch et al., 2024). This suggests that successful GenAI integration demands a dual competency: both domain expertise and GenAI interaction skills—a combination that neither group fully displayed in our time- constrained setting. Seen through this lens, effective GenAI use in ill- defined tasks makes students the orchestrators of cognition: decomposing the task, delegating sub-tasks to the chatbot, and continuously monitoring, evaluating, and integrating outputs under time pressure. Even high-performing students can be overloaded by these demands, which shift effort from production to metacognitive regulation (see also Simkute et al., 2025). In this respect, GenAI functions less like a calculator (see also Lodge et al., 2023) and more like a high-volume social-media environment where attention is fragmented and working memory is taxed (Haidt, 2024); the result is increased extraneous load and heavier metacognitive control demands, especially under time limits. This helps explain convergence in our setting: high performers incurred monitoring and integration costs while low performers benefited from scaffolded structure. These dynamics also clarify when different patterns should be expected across exam task environments. In well-defined tasks where the GenAI model holds relevant domain knowledge, both low and high performers can benefit because outputs are readily transferable. In time-limited, ill-defined tasks (e.g., our traditional business case analysis), low performers typically gain more via domain-relevant information and structural scaffolds, while high performers face increased evaluative and integration load—leading to equalization. Time pressure thus appears to act as a moderator in these dynamics: the cognitive load burden intensifies in time-pressured situations, where evaluating and integrating GenAI output must happen rapidly. This does not imply that equalization effects would disappear in less time- pressured settings, but rather that time constraints amplify the equalization effect. Finally, in settings without tight time constraints, the advantage may shift to high performers: richer domain knowledge and stronger metacognitive strategies enable them to define the problem, monitor and evaluate output, and iteratively revise responses, using the chatbot as an enhancement tool (Bubeck et al., 2025; Ide & Talamas, 2024; Otis et al., 2024). In different kinds of assessment formats, GenAI therefore might have varying impacts on different student performance levels. These task-dependent performance patterns have implications for assessment validity in management education. Traditional assessments have long served a dual evaluative function: they measure students’ ability to apply domain knowledge (Armstrong & Fukami, 2010; Biggs, 1999) while simultaneously capturing metacognitive skills (planning, monitoring, and evaluating one’s work) that predict performance beyond domain knowledge alone (Veenman, Van Hout-Wolters & Afflerbach, 2006). High and low performers typically differ more in their monitoring accuracy and strategic regulation than in domain knowledge (Dent & Koenka, 2016; Zimmerman & Schunk, 2011). With GenAI access, this differentia- tion erodes as technology masks underlying performance differences: low performers appear more competent than they are, while high performers’ metacognitive abilities and domain knowledge remain undemonstrated. Assessments thus fail to 2026 Bergenholtz, Vuculescu, G€ unzel-Jensen, and Frederiksen 15

capture core competencies (Bearman, Tai, Dawson, Boud & Ajjawi, 2024)
that are central to management education (Larson, Moser, Caza, Muehfeld & Colombo, 2024; Powley & Taylor, 2014). Over time, the cognitive load inversion we observed may not only compromise assessment validity but also hinder learning itself. GenAI’s dual effect of providing procedural guidance on content and structure while simultaneously masking underlying performance deficits can deprive students of the critical feedback signals necessary for accurate self- assessment (Armstrong & Fukami, 2010), skill acqui- sition, and self-regulated learning (Ritz, Rietsche & Leimeister, 2023). We thus argue that addressing this assessment validity challenge requires diversifying assessment formats (Corbin, Dawson & Liu, 2025) and rethinking how GenAI is integrated in management education programs, which we will now turn to. PRACTICAL IMPLICATIONS Our findings offer three key implications for management learning and education practice. First, the equalization effect we identified shows how traditional assessment methods fail to reliably measure student capabilities when GenAI tools are available. This necessitates a reassessment of evaluation practices in management education. We propose a staged approach to curriculum design and assessment on a program level that addresses the cognitive load challenges our study unveils. In foundational courses, assessments must occur without GenAI access to establish students’ baseline domain knowledge and analytical capabilities. This foundation is essential because, as our findings show, students need robust conceptual understanding to engage meaningfully with GenAI output rather than simply adopting it wholesale. Only after demonstrating domain knowledge in exams without GenAI access should students progress to GenAI-integrated assessments that evaluate their ability to create, monitor, evaluate, and revise GenAI output. Such a staged approach pre- vents the cognitive shortcuts we observed among low performers while ensuring students develop the metacognitive skills necessary for effective GenAI collaboration rather than facing information overload. This challenge is particularly acute in online management education, where students already manage substantial cognitive load from navigating learning management systems, video platforms, and digital collaboration tools (Bol & Garner, 2011). Introducing GenAI into this already technology-dense environment may compound cognitive demands, particularly for students who struggle with self-regulation in remote settings. Consequently, the combination of GenAI’s equalizing effect and the inherent challenge of validat- ing performance in remote education create conditions where online educational credentials risk becoming unreliable indicators of competence. Second, institutions should invest in capacity building for both faculty and students. When working with GenAI tools, students must learn how to guide their inquiry process: clarifying their current understanding, breaking down complex problems, monitoring and evaluating output, and continuously reflecting on knowledge gaps and reasoning. This self-guided metacognitive process is essential for using chatbots effectively and maintaining cognitive balance and critical thinking amidst the overwhelming content these chatbots provide (Larson et al., 2024). Faculty development programs must parallel this effort and enable student learning through interconnected components (Valcea et al., 2024). Educators need foundational GenAI literacy, that is an understanding of how these tools work, their capabilities and limitations, and effective interaction strategies (Yan, Greiff, Teuber & Ga sevi c, 2024). However, while considerable attention has been paid to GenAI literacy (Southworth et al., 2023) and prompting strategies (Mollick & Mollick, 2023), our findings suggest that management education must prioritize addressing the deeper pedagogical and cognitive challenges GenAI introduces. This pressing challenge implies a return to educational fundamentals, placing metacognitive pedagogy at the forefront of faculty training to preserve human judgment in an age of algorith- mic reckoning (Moser, Den Hond & Lindebaum, 2022). Instructors must be explicitly trained to understand the cognitive load implications of GenAI integration: specifically, how these tools differentially affect students with varying expertise levels, and subsequently how to teach and assess the self-regulatory skills that effective tool use requires (Xu, Qiao, Cheng, Liu & Zhao, 2025). We recognize that organizational inertia is a persis- tent challenge for institutions attempting to engage in a comprehensive pedagogical redesign. However, it is noteworthy that our results link back to established educational literature and core pedagogical principles. The challenges of managing cognitive friction and developing student self-regulation during tool use are, fundamentally, generic learning problems (Kalyuga & Singh, 2016; Tankelevitch et al., 2024). This strategic focus demands less reliance on new (tech) investments and more on a basic rethinking of teaching. Therefore, institutions should leverage existing structures, such as educational development centers, to drive a centralized approach. This ensures 16 Academy of Management Learning & Education Month

a consistent, research-backed GenAI strategy focused on pedagogy that unifies
efforts across different business school departments and faculty, rather than outsourcing the response to individual, isolated efforts. Finally, while the organizational and pedagogical shift is the immediate priority for preserving assessment validity, a more demanding next step is required to secure the long-term future: institutions must actively reclaim agency over the cognitive architecture of learning from commercial GenAI providers. This requires building the relevant GenAI architecture that bridges the technical and pedagogical domains. This execution hinges on building capacity among selected faculty to integrate GenAI thought- fully (Hyde, Busby & Bonner, 2024), for example through the development of customized chatbots (e.g., GPTs in ChatGPT or retrieval-augmented generation systems; see Lee, 2024) that can explicitly support metacognitive processes through guided inquiry in specific courses. Commercial chatbots readily provide answers yet can disrupt users’ natural workflows by interrupting problem-solving processes and providing suggestions that are difficult to integrate. Hence, rather than simply providing answers or forcing students to restructure their studying processes around the tool, these customized systems could be designed to align with students’ learning workflows—walking them through structured thinking processes that develop deeper understanding at appropriate moments (Dalsgaard, 2025). For those who lack domain-specific knowledge, such systems could provide additional resources, such as defini- tions of core concepts and specific examples. Further, these customized systems could also be designed to provide feedback on students’ metacognitive processes, which encourages the further development of these skills. This approach transforms chatbots from simple answer providers into metacognitive scaffolds that work with students’ natural studying processes (Salomon et al., 1991) rather than against them, helping students internalize critical thinking patterns while engaging with course material (Larson et al., 2024). LIMITATIONS AND FUTURE RESEARCH Our data comes from a single Nordic business school and was gathered during autumn 2023. Stu- dents completed an ill-structured, time-pressured traditional business case analysis with access to ChatGPT-4. This context is closest to on-site, essay- style assessments in management education, while being less informative for multiple-choice tests (Choi & Schwarcz, 2024) or longer projects (such as a master thesis). Most participants were non-native English speakers: while they could interact with the chatbot in their mother tongue, final submissions were in English. This may shape how students interpret, condense, and justify GenAI output (e.g., vocabulary and argumentation demand; Herbold et al., 2023). It also differentiates our study from prior experiments that primarily relied on native or highly proficient English speakers (e.g., Choi & Schwarcz, 2024; Doshi & Hauser, 2024; Noy & Zhang, 2023; Prather et al., 2024). Thus, our non-native sample both bounds external validity to similar settings and extends it by enabling theorization about multilingual use and language- moderated cognitive load in management education. Our study also captured an early stage of institutional adaptation. At the time, the university still restricted GenAI use for exam work, limiting students’ opportunity to develop fluency with prompting and iterative use. Institutional policies are changing, yet the metacognitive challenge persists. Granted, newer models (e.g., GPT-5 and successors) do continuously raise average output quality, but our central mechanism—the metacognitive work of monitoring and evaluating large volumes of domain- specific suggestions under time pressure—likely remains. In fact, as models generate more (high-quality) content, cognitive load can even increase unless students are taught to monitor, evaluate, and orchestrate output. An interesting parallel comes from aviation: As autopilots have improved, pilots face higher demands for cognitive control and metacognitive regulation to supervise, intervene, and validate system outputs (Simkute et al., 2025). We explored whether individual differences explained performance but found no consistent asso- ciations with demographics, prior GenAI experience, or attitudes toward GenAI (see Supplementary Mate- rial Table S1). This pattern is consistent with our interpretation that domain knowledge and metacognitive control, rather than simple exposure or beliefs, are the binding constraints in time-limited, ill- defined tasks. Looking ahead, three individual differences are especially promising to test because they align with our mechanism: (a) working memory, which should predict the ability to monitor, evaluate, and orchestrate high-volume GenAI suggestions under cognitive load (Barrett, Tugade & Engle, 2004); (b) tolerance for ambiguity, which should predict pre- mature acceptance versus reflective revision when outputs are plausible but uncertain (McLain, 2009); and (c) confidence (Moore & Healy, 2008). For some low performers, GenAI access may inflate confidence, 2026 Bergenholtz, Vuculescu, G€ unzel-Jensen, and Frederiksen 17

encouraging shallow use. In contrast, for some high performers the
perceived presence of a “teammate” that instantly produces comprehensive answers may threaten self-image and induce performance anxiety, diverting attention toward monitoring rather than problem-solving. Accordingly, collecting confidence ratings before and after a given task or treatment could illuminate how changes in confidence relate to the nature of GenAI usage (Lee et al., 2025; Li, Yang, Liao, Zhang & Lee, 2025). A critical question for future research concerns the cognitive consequences of GenAI memory capabilities. GenAI now features memory systems that store and retrieve information across interactions, creating the possibility of bidirectional memory where both student and tool accumulate knowledge over time. Yet this development deepens rather than resolves the challenges surfaced in our study: if students rely on GenAI memory rather than committing information to their own long-term memory, they may fail to develop the cognitive schemas required to monitor and evaluate GenAI output. Unlike the extended mind thesis’s canonical example of Otto’s reliable notebook (Clark, 2025; Clark & Chalmers, 1998), GenAI requires constant verification. Yet students cannot effectively evaluate what they have not first retained themselves (Kruger & Dunning, 1999). It is therefore important to inquire to what extent the reliance on GenAI memory undermines the development of the very metacognitive skills—monitoring, error detection, and relevance judgment—that effective GenAI use requires. Taken together, these challenges point beyond single-moment assessments toward longitudinal designs that can trace how students’ metacognitive skills evolve, as well as the balance between effects with and effects of GenAI technologies (Salomon et al., 1991). In practice, this means following the same cohorts across courses or semesters to see whether facilitating metacognitive regulation and embedding scaffolds to reduce cognitive load can strengthen ability to work with GenAI, and thus avoid the expertise- reversal logic (Tetzlaff et al., 2025), or whether working with GenAI diminishes students’ capacity to develop metacognitive skills. Embracing different kinds of methods will be key to capturing the different cognitive and behavioral processes and effects over time. For example, pairing simple process indicators (iteration patterns, prompt specificity, revision depth) with brief load checks and periodic performance tasks can show whether scaffolds are lowering extraneous load now and, over time, shifting effort toward germane processing that consolidates understanding and supports transfer to novel problems (Yan et al., 2025). CONCLUSION Our findings on the impact of GenAI on student performance present a critical challenge to the pre- vailing narrative of democratization. Instead of uni- formly enhancing performance, our results suggest an equalization effect: a convergence of outcomes where GenAI raises the performance of low-performing students by reducing intrinsic cognitive load, yet concur- rently “levels down” high performers. We argue that in ill-defined, time-pressured tasks, this dynamic is driven by GenAI-induced cognitive load inversion, where the tools intended to augment human intelligence can disrupt the analytical processes of high- performing students. Crucially, we argue that this mechanism is not merely a byproduct of current GenAI limitations. Even as future models (e.g., GPT- 5.1 and successors) continuously raise average output quality, the central metacognitive challenge likely persists given the nature of current models: the need to monitor and evaluate voluminous, domain- specific suggestions under time pressure. This challenge may also deepen as GenAI becomes embedded in software applications, where diminished user control over automated workflows would make cognitive load management increasingly difficult. The equalization of performance signals profound risks for management education that unfold over different time horizons. In the short run, grades can mask actual capability. Low performers risk an illusion of competence (Stadler et al., 2024) as inflated scores hide foundational deficits. Meanwhile, high performers are immediately impeded by a spike in extraneous load caused by monitoring voluminous GenAI output. In the long run, the danger is that students will simply learn and understand less because during GenAI use, students might abandon the germane cognitive processing—the deep, iterative struggle—that is essential for long-term retention and development (Kalyuga & Singh, 2016). As our findings show, without the metacognitive ability to regulate interactions with GenAI, students risk shifting from active problem-solvers into passive duplicators of content. Addressing this risk requires viewing GenAI not as a mere technological tool but a learning environment that demands human supervision. Much like aviation, where increased automation demands higher, not lower, cognitive oversight from pilots (Simkute et al., 2025), the future of management education depends on developing students’ 18 Academy of Management Learning & Education Month

metacognitive ability to monitor and supervise automated outputs. Crucially,
this change in cognitive processes also clarifies one path forward for management education. We must look beyond technological solutions and consider educational fundamentals, making sure to place pedagogical principles in the driving seat of technical integration. A staged approach to GenAI integration can facilitate this, preserving GenAI-free exams to cer- tify foundational domain knowledge and enable development of metacognition, followed by advanced assessments that specifically measure the orchestration of GenAI. By certifying robust domain knowledge first, we safeguard the “effects of” technology, specifically the retention of independent capability. Only then can we effectively teach the “effects with” capability of orchestrating GenAI activities. Enforcing this separa- tion can help ensure that GenAI serves as a scaffold for expertise rather than a ceiling on student potential. Ultimately, this requires business schools to reclaim agency rather than outsourcing the cognitive architecture of learning to commercial providers. REFERENCES AACSB International. 2025. GenAI adoption in business schools: Deans and faculty respond. AACSB. Retrieved from https://www.aacsb.edu/insights/reports/2025/genai- adoption-in-business-schools-deans-and-faculty-respond Armstrong, S. J., & Fukami, C. 2010. Self-assessment of knowledge: A cognitive learning or affective measure? Perspectives from the management learning and education community. Academy of Management Learn- ing & Education, 9: 335–341. Barrett, L. F., Tugade, M. M., & Engle, R. W. 2004. Individ- ual differences in working memory capacity and dual- process theories of the mind. Psychological Bulletin, 130: 553–573. Bearman, M., Tai, J., Dawson, P., Boud, D., & Ajjawi, R. 2024. Developing evaluative judgement for a time of generative artificial intelligence. Assessment & Eval- uation in Higher Education, 49: 893–905. Becker, J., Rush, N., Barnes, E., & Rein, D. 2025. Measuring the impact of early-2025 AI on experienced open-source developer productivity. arXiv preprint. arXiv:2507.09089. Biggs, J. 1999. What the student does: Teaching for enhanced learning. Higher Education Research & Development, 18: 57–75. Bol, L., & Garner, J. K. 2011. Challenges in supporting self- regulation in distance education environments. Jour- nal of Computing in Higher Education, 23: 104–123. Brynjolfsson, E., Li, D., & Raymond, L. 2025. Generative AI at work. Quarterly Journal of Economics, 140: 889–942. Bubeck, S., Coester, C., Eldan, R., Gowers, T., Lee, Y. T., Lupsasca, A., Sawhney, M., Scherrer, R., Sellke, M., Spears, B. K., Unutmaz, D., Weil, K., Yin, S., & Zhivo- tovskiy, N. 2025. Early science acceleration experiments with GPT-5. arXiv preprint. arXiv:2511.16072. Choi, J. H., & Schwarcz, D. 2024. AI assistance in legal analysis: An empirical study. Journal of Legal Educa- tion, 73: 384–420. Clark, A. 2025. Extending minds with generative AI. Nature Communications, 16: 4627. Clark, A., & Chalmers, D. 1998. The extended mind. Anal- ysis, 58: 7–19. Corbin, T., Dawson, P., & Liu, D. 2025. Talk is cheap: Why structural assessment changes are needed for a time of GenAI. Assessment & Evaluation in Higher Educa- tion, 50: 1087–1097. Creswell, J. W., & Plano Clark, V. L. 2011. Designing and conducting mixed methods research (2nd ed.). Thou- sand Oaks, CA: Sage. Cui, Z., Demirer, M., Jaffe, S., Musolff, L., Peng, S., & Salz, T. 2025. The effects of generative AI on high-skilled work: Evidence from three field experiments with software developers. SSRN. Retrieved from https:// ssrn.com/abstract=4945566 Dalsgaard, P. 2025. Scaffolding metacognition in design with generative AI tools. Paper presented at the IASDR International Association of Societies of Design Research Congress, Taiwan. Dell’Acqua, F., McFowland, E., III, Mollick, E., Lifshitz- Assaf, H., Kellogg, K. C., Rajendran, S., Krayer, L., Candelon, F., & Lakhani, K. R. 2023. Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. Working paper no. 24-013, Harvard Business School, Boston, MA. Dent, A. L., & Koenka, A. C. 2016. The relation between self-regulated learning and academic achievement across childhood and adolescence: A meta-analysis. Educational Psychology Review, 28: 425–474. Doshi, A. R., & Hauser, O. P. 2024. Generative AI enhances individual creativity but reduces the collective diversity of novel content. Science Advances, 10: eadn5290. Ellington, A. J. 2003. A meta-analysis of the effects of calculators on students’ achievement and attitude levels in precollege mathematics classes. Journal for Research in Mathematics Education, 34: 433–463. Farazouli, A., Cerratto-Pargman, T., Bolander-Laksov, K., & McGrath, C. 2024. Hello GPT! Goodbye home exami- nation? An exploratory study of AI chatbots impact on university teachers’ assessment practices. Assess- ment & Evaluation in Higher Education, 49: 363– 375. 2026 Bergenholtz, Vuculescu, G€ unzel-Jensen, and Frederiksen 19

Financial Times. 2025. Sept 17: DeepMind and OpenAI achieve gold
at “coding Olympics” in AI milestone. Financial Times. Fisher, R., Gogan, T., Williams, J., Laferriere, R., Campbell, G., Gunasekara, A., & Nguyen, J. 2025. Can ChatGPT enhance business student creativity? Evidence from a randomised controlled trial. Studies in Higher Edu- cation. Forthcoming. Flaherty, C. 2025. August 29: How AI is changing—not “killing”—college. Inside Higher Ed. Freeman, J. 2025. Student generative AI survey 2025. Policy note 61, Higher Education Policy Institute, London. Galy, E., Cariou, M., & M elan, C. 2012. What is the relationship between mental workload factors and cognitive load types? International Journal of Psychophysiol- ogy, 83: 269–275. Gioia, D. A., Corley, K. G., & Hamilton, A. L. 2013. Seeking qualitative rigor in inductive research: Notes on the Gioia methodology. Organizational Research Meth- ods, 16: 15–31. Glaser, B., & Strauss, A. 1967. The discovery of grounded theory: Strategies of qualitative research. London: Wiedenfeld and Nicholson. Haidt, J. 2024. The anxious generation: How the great rewiring of childhood is causing an epidemic of mental illness. New York: Penguin. Hannigan, T., McCarthy, I. P., & Spicer, A. 2024. Beware of botshit: How to manage the epistemic risks of generative chatbots. Business Horizons, 67: 471–486. Herbold, S., Hautli-Janisz, A., Heuer, U., Kikteva, Z., & Trautsch, A. 2023. A large-scale comparison of human-written versus ChatGPT-generated essays. Sci- entific Reports, 13: 18617. Hyde, S. J., Busby, A., & Bonner, R. L. 2024. Tools or fools: Are we educating managers or creating tool- dependent robots? Journal of Management Educa- tion, 48: 708–734. Ide, E., & Talamas, E. 2024. Artificial intelligence in the knowledge economy. In Proceedings of the 25th ACM conference on economics and computation: 834–836. New York: Association for Computing Machinery. Kalyuga, S., & Singh, A. M. 2016. Rethinking the boundaries of cognitive load theory in complex learning. Educational Psychology Review, 28: 831–852. Krammer, S. M. 2023. Is there a glitch in the matrix? Artifi- cial intelligence and management education. Man- agement Learning, 56: 367–388. Kruger, J., & Dunning, D. 1999. Unskilled and unaware of it: How difficulties in recognizing one’s own incompe- tence lead to inflated self-assessments. Journal of Per- sonality and Social Psychology, 77: 1121–1134. Lajoie, S. P., & Azevedo, R. 2006. Teaching and learning in technology-rich environments. In P. A. Alexander & P. H. Winne (Eds.), Handbook of educational psychology: 803–821. Mahwah, NJ: Lawrence Erlbaum Associates. Larson, B. Z., Moser, C., Caza, A., Muehfeld, K., & Colombo, L. A. 2024. Critical thinking in the age of generative AI. Academy of Management Learning & Education, 23: 373–378. Lee, H. P., Sarkar, A., Tankelevitch, L., Drosos, I., Rintel, S., Banks, R., & Wilson, N. 2025. The impact of generative AI on critical thinking: Self-reported reductions in cognitive effort and confidence effects from a survey of knowledge workers. In Proceedings of the 2025 CHI conference on human factors in computing systems: 1–22. New York: Association for Com- puting Machinery. Lee, Y. 2024. Developing a computer-based tutor utilizing generative artificial intelligence (GAI) and retrieval- augmented generation (RAG). Education and Infor- mation Technologies, 30: 7841–7862. Lehmann, M., Cornelius, P. B., & Sting, F. J. 2024. AI meets the classroom: When does ChatGPT harm learning? SSRN. Retrieved from https://ssrn.com/abstract=4941259 Li, J., Yang, Y., Liao, Q. V., Zhang, J., & Lee, Y. C. 2025. As confidence aligns: Exploring the effect of AI confidence on human self-confidence in human-AI decision making. arXiv preprint. Lindebaum, D., & Fleming, P. 2024. ChatGPT undermines human reflexivity, scientific responsibility and responsible management research. British Journal of Management, 35: 566–575. Lodge, J. M., Yang, S., Furze, L., & Dawson, P. 2023. It’s not like a calculator, so what is the relationship between learners and generative artificial intelligence? Learn- ing: Research and Practice, 9: 117–124. Lund, C. 2014. Of what is this a case? Analytical move- ments in qualitative social science research. Human Organization, 73: 224–234. Luong, T., & Lockhart, E. 2025. July 21: Advanced version of Gemini with Deep Think officially achieves gold- medal standard at the International Mathematical Olympiad. Google DeepMind. McLain, D. L. 2009. Evidence of the properties of an ambiguity tolerance measure: The multiple stimulus types ambiguity tolerance scale–II (MSTAT–II). Psychologi- cal Reports, 105: 975–988. Merali, A. 2024. Scaling laws for economic productivity: Experimental evidence in LLM-assisted translation. arXiv preprint. arXiv:2409.02391. Miles, M. B., Huberman, M., & Saldana, J. 2014. Qualita- tive data analysis: A methods sourcebook. Thousand Oaks, CA: SAGE. 20 Academy of Management Learning & Education Month

Mollick, E. 2024. Co-intelligence. London: Random House. Mollick, E., &
Mollick, L. 2023. Assigning AI: Seven approaches for students, with prompts. arXiv preprint. Moore, D. A., & Healy, P. J. 2008. The trouble with over- confidence. Psychological Review, 115: 502–517. Moser, C., Den Hond, F., & Lindebaum, D. 2022. Morality in the age of artificially intelligent algorithms. Acad- emy of Management Learning & Education, 21: 139–155. Nie, A., Chandak, Y., Suzara, M., Malik, A., Woodrow, J., Peng, M., Sahami, M., Brunskill, E., & Piech, C. 2024. The GPT surprise: Offering large language model chat in a massive coding class reduced engagement but increased adopters exam performances. arXiv preprint. Noy, S., & Zhang, W. 2023. Experimental evidence on the productivity effects of generative artificial intelligence. Science, 381: 187–192. Otis, N., Clarke, R. P., Delecourt, S., Holtz, D., & Koning, R. 2024. The uneven impact of generative AI on entrepre- neurial performance. SSRN. Retrieved from 10.2139/ ssrn.4671369 Peng, S., Kalliamvakou, E., Cihon, P., & Demirer, M. 2023. The impact of AI on developer productivity: Evidence from GitHub copilot. arXiv preprint. Powley, E. H., & Taylor, S. N. 2014. Pedagogical approaches to develop critical thinking and crisis leadership. Journal of Management Education, 38: 560–585. Prather, J., Reeves, B. N., Leinonen, J., MacNeil, S., Ran- drianasolo, A. S., Becker, B. A., Kimmel, B., Wright, J., & Briggs, B. 2024. The widening gap: The benefits and harms of generative AI for novice programmers. In Proceedings of the 2024 ACM conference on international computing education research- volume 1, 469–486. New York: Association for Com- puting Machinery. Reitman, W. R. 1964. Heuristic decision procedures, open constraints, and the structure of ill-defined problems. Human Judgments and Optimality, 282: 283–315. Riedl, C., & Weidmann, B. 2025. Quantifying human-AI synergy. PsyArXiv preprint. Ritz, E., Rietsche, R., & Leimeister, J. M. 2023. How to support students’ self-regulated learning in times of crisis: An embedded technology-based intervention in blended learning pedagogies. Academy of Manage- ment Learning & Education, 22: 357–382. Rudolph, J., Tan, S., & Tan, S. 2023. ChatGPT: Bullshit spewer or the end of traditional assessments in higher education? Journal of Applied Learning and Teach- ing, 6: 342–363. Ryan, R. M., Mims, V., & Koestner, R. 1983. Relation of reward contingency and interpersonal context to intrinsic motivation: A review and test using cognitive evaluation theory. Journal of Personality and Social Psychology, 45: 736–750. Salomon, G., Perkins, D. N., & Globerson, T. 1991. Partners in cognition: Extending human intelligence with intelligent technologies. Educational Researcher, 20: 2–9. Schraw, G., Dunkle, M. E., & Bendixen, L. D. 1995. Cogni- tive processes in well-defined and ill-defined problem solving. Applied Cognitive Psychology, 9: 523–538. Simkute, A., Tankelevitch, L., Kewenig, V., Scott, A. E., Sellen, A., & Rintel, S. 2025. Ironies of generative AI: Understanding and mitigating productivity loss in human-AI interaction. International Journal of Human–Computer Interaction, 41: 2898–2919. Simon, H. A. 1973. The structure of ill structured problems. Artificial Intelligence, 4: 181–201. Southworth, J., Migliaccio, K., Glover, J., Glover, J. N., Reed, D., McCarty, C., Brendemuhl, J., & Thomas, A. 2023. Developing a model for AI across the curriculum: Transforming the higher education landscape via innovation in AI literacy. Computers and Education: Artificial Intelligence, 4: 100127. Stadler, M., Bannert, M., & Sailer, M. 2024. Cognitive ease at a cost: LLMs reduce mental effort but compromise depth in student scientific inquiry. Computers in Human Behavior, 160: 108386. Sweller, J., & Chandler, P. 1994. Why some material is difficult to learn. Cognition and Instruction, 12: 185–233. Tankelevitch, L., Kewenig, V., Simkute, A., Scott, A. E., Sarkar, A., Sellen, A., & Rintel, S. 2024. The metacognitive demands and opportunities of generative AI. In Proceedings of the CHI conference on human factors in computing systems, 1–24. New York: Asso- ciation for Computing Machinery. Tetzlaff, L., Simonsmeier, B., Peters, T., & Brod, G. 2025. A cornerstone of adaptivity—A meta-analysis of the expertise reversal effect. Learning and Instruction, 98: 102142. Valcea, S., Hamdani, M. R., & Wang, S. 2024. Exploring the impact of ChatGPT on business school education: Pro- spects, boundaries, and paradoxes. Journal of Man- agement Education, 48: 915–947. Veenman, M. V. J., Van Hout-Wolters, B. H. A. M., & Affler- bach, P. 2006. Metacognition and learning: Conceptual and methodological considerations. Metacognition and Learning, 1: 3–14. Weiss, R. S. 1995. Learning from strangers: The art and method of qualitative interview studies. New York: Simon and Schuster. 2026 Bergenholtz, Vuculescu, G€ unzel-Jensen, and Frederiksen 21

Wellman, N., Tröster, C., Grimes, M., Roberson, Q., Rink, F.,
& Gruber, M. 2023. Publishing multimethod research in AMJ: A review and best-practice recommendations. Academy of Management Journal, 66: 1007–1015. Xu, X., Qiao, L., Cheng, N., Liu, H., & Zhao, W. 2025. Enhancing self-regulated learning and learning experience in generative AI environments: the critical role of metacognitive support. British Journal of Educa- tional Technology, 56: 1842–1863. Yan, L., Greiff, S., Lodge, J. M., & Ga sevi c, D. 2025. Distin- guishing performance gains from learning when using generative AI. Nature Reviews Psychology, 4: 435–436. Yan, L., Greiff, S., Teuber, Z., & Ga sevi c, D. 2024. Promises and challenges of generative artificial intelligence for human learning. Nature Human Behaviour, 8: 1839–1850. Yu, R., Xu, Z., CH-Wang, S., & Arum, R. 2024. Whose ChatGPT? Unveiling real-world educational inequalities introduced by large language models. arXiv preprint. Zimmerman, B. J., & Schunk, D. H. 2011. Self-regulated learning and performance: An introduction and an overview. In B. J. Zimmerman & D. H. Schunk (Eds.), Handbook of self-regulation of learning and performance: 1–12. New York: Routledge. Carsten Bergenholtz ([email protected]) is an associate professor in the Department of Management, School of Business and Social Sciences, Aarhus University. His research examines how people and groups solve problems, with a recent focus on generative AI and its implications for management education. Specifically, he studies how cognitive mechanisms shape search and adaptation. Oana Vuculescu ([email protected]) is an associate professor in the Department of Management, School of Business and Social Sciences, Aarhus University. Her research examines how individuals and groups search for solutions and solve complex problems. Her primary focus is on understanding the mechanisms by which feedback, cognition, and emotion influence how solutions are searched for and adapted. Franziska G€ unzel-Jensen ([email protected]) is an associate professor of entrepreneurship in the Department of Management, School of Business and Social Sciences, Aarhus University. Her research focuses on entrepreneurship and innovation as a vehicle to address societal grand challenges and achieve the UN SDG agenda. She also studies entrepreneurship education, with recent work exploring the pedagogical implications of generative AI in management education. Lars Frederiksen ([email protected]) is a professor of strategy, entrepreneurship, and innovation in the Department of Management, School of Business and Social Sciences, Aarhus University. His current research interests include knowledge-producing online communi- ties, AI adoption and organizational design in SMEs, and digital transformation and sustainability in the primary sector. 22 Academy of Management Learning & Education Month

SUPPLEMENTARY MATERIAL SECTION A ADDITIONAL TABLES AND FIGURES FIGURE S1
GPT-4 Experimental Interface, via the React Web App 2026 Bergenholtz, Vuculescu, G€ unzel-Jensen, and Frederiksen 23

TABLE S1 Correlations with Confidence Intervals Variable 1 2 3
4 5 6 7 8 9 10 11 12 13 14 1. Felt enjoyment 2. Felt skilled .61ÃÃ [.49, .70] 3. Felt effective .63ÃÃ .70ÃÃ [.51, .72] [.60, .78] 4. Felt enjoyment 2 .58ÃÃ .41ÃÃ .36ÃÃ [.45, .68] [.27, .54] [.20, .49] 5. Felt skilled 2 .36ÃÃ .51ÃÃ .36ÃÃ .68ÃÃ [.20, .49] [.38, .63] [.21, .50] [.58, .76] 6. Felt effective 2 .29ÃÃ .42ÃÃ .42ÃÃ .62ÃÃ .69ÃÃ [.13, .44] [.27, .55] [.28, .55] [.51, .71] [.59, .77] 7. Age .05 .02 .08 2.04 2.08 2.11 [2.11, .22] [2.14, .19] [2.09, .24] [2.20, .13] [2.25, .08] [2.27, .06] 8. Case experience .06 .20Ã .03 .10 .15 2.01 2.00 [2.10, .23] [.04, .36] [2.13, .20] [2.07, .26] [2.02, .31] [2.18, .16] [2.17, .16] 9. ChatGPT experience .06 .07 .11 .15 .15 .15 2.21Ã .02 [2.11, .22] [2.10, .23] [2.05, .27] [2.02, .31] [2.02, .31] [2.01, .31] [2.36, 2.04] [2.15, .18] 10. Useful aid .02 .07 .18Ã .30ÃÃ .23ÃÃ .48ÃÃ 2.06 .04 .28ÃÃ [2.15, .18] [2.10, .23] [.01, .33] [.15, .45] [.06, .38] [.34, .60] [2.22, .11] [2.12, .21] [.12, .43] 11. Effort .07 2.04 .03 .04 .01 2.01 .15 .08 2.05 .09 [2.10, .23] [2.21, .12] [2.14, .19] [2.12, .21] [2.16, .17] [2.18, .15] [2.02, .31] [2.08, .25] [2.22, .11] [2.07, .25] 12. Optimistic AI productivity 2.01 .02 2.02 .29ÃÃ .25ÃÃ .20Ã 2.10 .10 .44ÃÃ .23ÃÃ .09 [2.18, .15] [2.15, .18] [2.18, .15] [.13, .44] [.09, .40] [.03, .35] [2.26, .06] [2.07, .26] [.30, .57] [.07, .38] [2.07, .26] 13. AI impact .01 .03 .10 .23ÃÃ .21Ã .21Ã 2.08 .04 .36ÃÃ .22ÃÃ .02 .78ÃÃ [2.16, .17] [2.14, .19] [2.07, .26] [.07, .38] [.05, .36] [.04, .36] [2.24, .09] [2.13, .20] [.21, .50] [.06, .37] [2.14, .19] [.70, .83] 14. Task evaluation: Grade 1 .07 .18Ã .09 2.16 2.13 2.05 2.03 .00 .02 .03 .14 2.08 2.15 [2.10, .23] [.02, .34] [2.08, .25] [2.32, .00] [2.29, .04] [2.21, .12] [2.19, .14] [2.16, .17] [2.14, .19] [2.14, .19] [2.02, .30] [2.24, .09] [2.30, .02] 15. Task evaluation: Grade 2 .14 .18Ã .14 2.05 2.01 .00 .07 .09 .14 .09 .01 .04 2.02 .46ÃÃ [2.03, .30] [.01, .34] [2.03, .30] [2.21, .12] [2.18, .15] [2.16, .17] [2.10, .23] [2.08, .25] [2.03, .30] [2.08, .25] [2.16, .18] [2.13, .20] [2.18, .15] [.32, .58] Notes: Values in square brackets indicate the 95% confidence interval for each correlation. Questions concerning enjoyment, skill, and effectiveness were asked after both tasks, and Variables 4 through 6 refer to answers following Task 2. Ã p , .10 ÃÃ p , .05 ÃÃÃ p , .01 24 Academy of Management Learning & Education Month

TABLE S2 Additional Controls Dependent variable Task evaluation, Period 2
(1) (2) (3) (4) (5) Task evaluation, Period 1 0.380ÃÃÃ 0.456ÃÃÃ (0.075) (0.095) Treatment 0.661ÃÃ 1.872ÃÃ (0.309) (0.926) Gender Male 0.336 0.257 0.321 0.246 (0.340) (0.303) (0.363) (0.325) Field Economics 0.258 0.230 0.196 0.002 (0.607) (0.540) (0.658) (0.587) Field Education 1.550 1.510 1.525 1.701 (1.244) (1.112) (1.368) (1.226) Field Engineering 21.093Ã 20.522 21.113 20.641 (0.600) (0.541) (0.684) (0.611) Field Finance 1.757 1.932Ã 1.581 1.714 (1.168) (1.037) (1.236) (1.097) Field Health 20.685 20.807 20.451 20.614 (0.746) (0.662) (0.840) (0.747) Field Humanities 0.175 0.194 0.173 0.088 (0.621) (0.552) (0.712) (0.632) Field Other (please specify) 0.085 20.003 20.003 20.065 (0.496) (0.440) (0.544) (0.485) Field Psychology 0.959 0.354 1.050 0.499 (0.818) (0.733) (0.873) (0.782) Education Doctoral degree 20.567 20.068 20.553 20.310 (1.236) (1.100) (1.300) (1.155) Education High School 20.117 20.017 20.031 20.017 (0.387) (0.348) (0.418) (0.374) Education Master’s degree 20.338 0.114 20.230 0.243 (0.597) (0.535) (0.626) (0.560) Education Secondary School 21.380 21.396 21.129 21.456 (1.258) (1.125) (1.364) (1.222) Education Vocational Training 0.290 0.486 0.259 0.617 (0.943) (0.843) (0.985) (0.882) Age 0.038 0.039 0.043 0.038 (0.038) (0.034) (0.041) (0.037) Felt enjoyment 2 20.072 20.048 20.028 (0.113) (0.121) (0.109) Felt effective 2 0.051 20.018 20.051 (0.123) (0.133) (0.118) Felt skilled 2 20.001 0.062 0.108 (0.130) (0.140) (0.126) Case experience 0.082 0.017 0.033 (0.129) (0.150) (0.134) ChatGPT experience 0.093 0.086 0.050 (0.063) (0.073) (0.065) Useful aid 0.047 0.014 20.055 (0.144) (0.151) (0.140) Effort 0.063 0.079 20.104 (0.198) (0.216) (0.196) Optimistic AI productivity 0.047 0.057 0.028 (0.112) (0.118) (0.105) AI impact 20.129 20.113 20.047 (0.115) (0.120) (0.108) Task evaluation Period 1Ã treatment 20.217 (0.161) 2026 Bergenholtz, Vuculescu, G€ unzel-Jensen, and Frederiksen 25

SECTION B CASE OVERVIEW Participants in our experiments were shown
a 13-minute video that presented the Cure Factory case. In the following, we present a written summary of the case, the two questions asked (Tasks 1 & 2), and a link to the original video shown. i) Summary of the Cure Factory case The case describes the Cure Factory, a worldwide pharmaceutical company (founded in Copenhagen in 2005) that develops innovative drugs to treat diseases that have no cure. The company has successes in cancer and diabetes, has won Innovation Awards, and attracts talent from all around the world. It is led by Dr. Grace Miller, a pioneer known for her dedication, educated in chemistry at MIT with a PhD from Stanford, who worked in renowned labs before becoming direc- tor in 2017. She works closely with scientists, sometimes pulling all-nighters, and sets direction, coordinates projects, ensures relevant new knowledge, and seeks an environment where scientists can thrive. The media have called the group a super star team, yet coordination is difficult: there is disagreement on priorities, and teamwork is described as inefficient. Management introduced individual bonuses of up to 20% of annual salary. At the start of each year, targets are set for the number of scientific papers and patent or grant applications; if a scientist meets these targets, the bonus is paid. The bonus does not consider the quality or impact of any paper, patent, or grant. Milestones are also recognized with cake and drinks. Recently, many scientists have left, output has been declining, and funding is at an all-time low (40% below peak). The case presents multiple voices. First, a nar- rator established the above outlined organizational setting and practices. Then, additional voices provide supplementary analytical frames. Two professors advise on extrinsic and intrinsic motivation, team rewards, and issues like free- riding and the need for collective goals. Internal voices provide lived experience. Scientist 1 recounts coordination frictions and credit-taking by low contributors, explaining a drift toward solo work despite knowing collaboration is essential. Scientist 2 describes a decade-long project that finally yielded a publication and patent, attributing success to persistence, luck, and team passion—and calls the individual bonuses demo- tivating given the collective nature of the achievement. Manager 1 likens scientists to priests with a calling, emphasizing that thriving requires resources, shared goals, and autonomy more than just incentives. ii) Case-based questions All participants were informed that they had 22 minutes to complete each task. Task 1 question: Please discuss the benefits and disadvantages of the managerial levers put in place by the Cure Factory Copenhagen to ensure that their individual scientists are sufficiently motivated. Task 2 question: Keeping in mind that scientific work is often more successful when undertaken in teams, please discuss what the Cure TABLE S2 (Continued) Dependent variable Task evaluation, Period 2 (1) (2) (3) (4) (5) Constant 4.025ÃÃÃ 1.661Ã 4.488ÃÃÃ 3.391ÃÃ 1.463 (0.991) (0.965) (1.026) (1.698) (1.553) Observations 141 141 141 141 141 R2 0.105 0.307 0.035 0.128 0.334 Adjusted R2 20.002 0.211 20.031 20.053 0.175 Residual SE 1.917 (df 5 125) 1.701 (df 5 123) 1.945 (df 5 131) 1.965 (df 5 116) 1.740 (df 5 113) F statistic 0.981 (df 5 15; 125) 3.202ÃÃÃ (df 5 17; 123) 0.528 (df 5 9; 131) 0.708 (df 5 24; 116) 2.098ÃÃÃ (df 5 27; 113) Note: AI 5 artificial intelligence. Ã p , .1 ÃÃ p , .05 ÃÃÃ p , .01 26 Academy of Management Learning & Education Month

Factory Copenhagen should consider when motivating teamwork. Control group
was encouraged to utilize the Internet (access via browsers) Treatment group was encouraged to utilize ChatGPT (access via a link) iii) Link to the video that participants watched: https://zenodo.org/records/17192320 (Participants also had access to a transcript of the video). SECTION C ILLUSTRATING PARTICIPANT RESPONSES TO TASKS 1 & 2 In the following, we show selected examples of how we have graded the participant answers to Tasks 1 and 2. This should help the reader get a sense of the nature of our grading criteria, and how they were applied. We first list the question answered, then the criteria, and then document a low-, medium-, and high-quality answer to the task questions. Task 1 question to answer: Please discuss the benefits and disadvantages of the managerial levers put in place by the Cure Factory Copenhagen to ensure that their individual scientists are sufficiently motivated. Please keep in mind that you only have 22 minutes to complete this task. Criteria for assessing submissions (see the Mea- sures section in the Methods): “The grading criteria followed how exams in an organizational behavior bachelor-level course would be graded at the given university: identifying and prioritizing key challenges implied by the exam case and question; using relevant terminology and clear, consistent language; and using relevant insights to answer the question.” Task 1: Grade 2. Explanation of assessment: The following response fails to identify the case’s central challenges, instead offering generalized opinions in col- loquial language. It lacks relevant organizational behavior terminology and does not analyze the specific managerial levers mentioned in the case, thereby providing no meaningful insight into their benefits or disadvantages. Participant’s answer: To have cake and drinks to reward one’s employees is a sweet way to show your pride and to make sure they know they are valued and that their hard work has not gone by unnoticed. But for those who have not put in the work as much as others, are being told that they did just as well. Despite the fact that they have not contributed as much as others—or maybe not contributed at all. Rewarding bad behavior shows a person how little they can contribute and still get the same recognition as the person who has been with the research since the beginning and knows it inside and out. To indi- vidually show one’s appreciation might be the way to show the employees at the company that to give it one’s all means reward. Just goes to show that to some extent researchers are like priests with their “holier than thou”—but I won’t generalize them. Some are in it for the money, some are in it because it’s their calling. We all have our own motivation for whatever path we have chosen in life. Task 1: Grade: 5. Explanation of assessment: This answer shows a foundational grasp of the case by identifying the “free rider problem” and using correct terms like “extrinsic” and “monetary incentives.” However, the analysis is superficial and does not explore other key issues, such as the impact on intrinsic motivation, limiting the depth of its insights. Participant’s answer: The Cure Factory is using monetary incentives based on the individual performance of each scientist. This motivates employees to be part of as many successful projects as possible and put in enough work for the project to succeed. Though as the quality is not assessed, it can lead to the free rider problem, in which employees do not contribute as much as their team members and get rewarded for other’s work. This can be very frustrat- ing for hard-working employees, and often leads to copycats as it doesn’t seem “fair.” Allover, the project has a higher risk of failing and the work tends to take much longer than anticipated, as not everyone is putting in all their effort. The cake and drinks is a great example on how to motivate employees without monetary incentives, as the whole team celebrates the accomplishment of one individual, leading to others wanting the same admiration and pride. Here, we have of course also the problem of free riders, which again might not seem fair. In general, these types of non-monetary rewards can be very useful to let the whole team feel pride and enjoy a reward (here cake) instead of giving monetary rewards to one single person. Both these motivations are extrinsic, and only rewarded for great accomplishments that bring the company money, though long-term, it makes more sense to reward behavior that helps the overall 2026 Bergenholtz, Vuculescu, G€ unzel-Jensen, and Frederiksen 27

strategy of the company, i.e. finishing projects in a timely
manner and working well in a team to succeed in more projects and therefor [authors’ note: unfinished in the original]. Task 1: Grade 10. Explanation of assessment: This submission excels by clearly identifying, structuring, and prioritizing multiple key challenges, such as the conflict between “extrinsic and intrinsic motivation” and the focus on “short-term success.” The analysis is nuanced, well-articulated with relevant terminology, and provides sophisticated, case-specific insights and recommendations. Participant’s answer: Below I will try to discuss the benefits and disadvantages of the managerial levers put in place by the Cure Factory Copenhagen. Benefits: A clear benefit of the managerial levers in place— especially the bonus or rewards system is keeping the scientists focused on a common goal. As an extrinsic reward, the bonuses serve to reward the overall scientific output, thus keeping the employees focused on the task at hand. The quote from the manager and scientists did hint toward a high level of intrinsic motivation among the scientists. While this may indicate that there is not a great need for generating motivation as a whole (As it is already present), extrinsic motivation can help keep the people focused on the task, as intrinsic motivation often is related to the individual scientist’s vision and can be hard to align with the strategic goals of the organization. Disadvantages: As mentioned above, the quotes from the scientists hint at a high level of intrinsic motivation among some or most of the employees. Extrinsic rewards systems can in some cases diminish intrinsic motivation among employees. The extrinsic rewards systems in place: Namely the bonuses related to number of publications, patents etc., could cause more competition among employees. While this can help increase the scientific output, it can also harm the overall team dynamic of the organization. An example of this can be seen with the scientist who mentioned how one of the men he worked with barely contributed to a project, yet took all the credit. It is easy to see how this problem arises when bonuses are only decided on the basis of the number of publications and patents of the individual employee. As such “free riders” only need to get their name on a publication to get the rewards. Likewise, the scientists tell us how their work is often long term and requires patience. Making a rewards system based on number of publications is more geared toward generating short-term success. As such it risks not rewarding the scientists who are doing hard work on long-term projects, which hurts the overall team dynamic of the organization and could generate feelings of resentment between coworkers. Conclusion: There are both advantages and disadvantages related to the managerial levers of the organization. To improve the current system, Dr. Miller should focus on using motivation and managerial levers only as a tool to keep the already intrinsically motivated scientists focused on the overall task of the organization. Likewise, more weight should be put on evaluating the contents of the papers and patents that generate the bonuses. One way to do this could be to have some of the scientists review their colleagues from other projects’ work. This has the advantage of boosting teamwork, while also improving the scientists understanding of each other’s strengths and weaknesses as well as promoting interdisciplin- ary work. As such it is my opinion that the rewards system should be structured so that it does not destroy the intrinsic motivation of the employees but instead captures the intrinsic motivation already present in the scientists and nudges it closer toward the overall goal of the organization. Extrinsic motivation should also keep playing a part in motivating scientists to do more of the “boring” and undesirable tasks that inev- itably... (participant’s response ends here). Task 2 question to answer: Keeping in mind that scientific work is often more successful when undertaken in teams, please discuss what the Cure Factory Copenhagen should consider when motivating teamwork. We invite and encourage you to utilize ChatGPT, which you can access here. Please note that you only have 22 minutes for this task. (Same) criteria for assessing submissions (see the Measures section in the Methods): “The grading criteria followed how exams in an organizational behavior bachelor-level course would be graded at the given university: identifying and prioritizing key challenges implied by the exam case and question; using relevant terminology and clear, consistent language; and using relevant insights to answer the question.” Task 2 (Treatment). Grade: 2. Explanation of assessment: This answer is overly brief and generic, listing abstract concepts like “trust” and “autonomy” without any connection to 28 Academy of Management Learning & Education Month

the Cure Factory case. It fails to identify any specific
challenges or use the case context to provide relevant insights on motivating teamwork. Participant’s answer: Teamwork requires addressing and ensuring the elements of thriving relation- ships, such as acknowledgment of when things are done right, communication and developing trust. This may include actively making sure that the emotional intelligence of the individuals involved is of sufficient “quality” to productively engage in the collaborative environment. It also includes offering autonomy to the entire team while minimizing micromanaging. It’s important to have a clear, meaningful vision of the research that is being done. Task 2 (Treatment). Grade: 5. Explanation of assessment: This response uses relevant terminology in its comprehensive but generic list of teamwork principles. However, the answer fails to apply these points to the specific context of the case, neglecting to identify or prioritize the actual challenges faced by the scientists at the Cure Factory. The submission thus lacks original analysis and case-specific insights. Participant’s answer: Encouraging teamwork within the Cure Factory Copenhagen is key to fostering innovation, creativity, and productivity. Here are a few factors they should consider: 1. Clear Communication: Team members should be encouraged to communicate openly and hon- estly. This helps to build trust, prevent misunderstandings, and ensure that everyone is on the same page. 2. Defined Roles and Responsibilities: Each team member should have a clear understanding of their roles and responsibilities. This can help to avoid confusion and conflicts while boosting efficiency. 3. Common Goals: The Cure Factory Copenhagen should ensure that all team members understand the collective goals of the team. Shared goals can help to foster a sense of camaraderie and collective responsibility. 4. Recognition and Rewards: Acknowledging the efforts and achievements of team members can go a long way in boosting morale and motivation. It’s important to appreciate not only individual contributions but team successes as well. 5. Encourage Collaboration: Promoting a collaborative culture where ideas are shared and diverse perspectives are valued can enhance innovation and problem-solving. 6. Conflict Resolution: Differences of opinion are inevitable in team settings. Having protocols in place for constructive conflict resolution can help to maintain harmony and productivity within the team. 7. Provide Resources and Support: Team members should have access to the necessary resources to complete their tasks efficiently. This includes technical resources, training, and emotional support when needed. 8. Trust and Respect: For successful teamwork, it’s crucial to create an atmosphere of mutual respect and trust. Each member should feel valued, heard, and respected for their unique skills and contributions. 9. Regular Meetings: Routine meetings can help keep everyone informed about the progress, address issues promptly, and give space for brain- storming and collaborative decision-making. 10. Flexibility: Teams should be adaptable to change, ready to take up new challenges, learn from past mistakes, and continuously improve their strategies and methods. Incorporating these elements into the Cure Fac- tory Copenhagen’s team management strategy could potentially lead to a more motivated, cohesive, and productive team. There are also some things they should avoid: 1. Unresolved Conflicts: Allowing conflicts and disagreements between team members to fester without resolution can create tensions and hostilities, damaging the team’s cohesion and productivity. 2. Overburdening Certain Team Members: Heap- ing responsibilities on certain members while others are underutilized leads to imbalance and resentment. Tasks and responsibilities should be evenly distributed as much as possible. 3. Lack of Clear Communication: Inadequate communication can lead to misunderstandings and confusion. It’s vital to ensure that every team member has access to the same information and understands their roles and responsibilities. 4. Ignoring Individual Needs: Each employee has unique needs and working styles. Ignoring these can result in decreased motivation and productivity. Recognizing individual strengths and providing personalized support can enhance satisfaction and output. 5. Neglecting Feedback: Not listening to or valuing feedback from team members can make them 2026 Bergenholtz, Vuculescu, G€ unzel-Jensen, and Frederiksen 29

feel undervalued and ignored. Encourage and value feedback from all
team members. 6. Favoritism: Favoritism creates an unhealthy work environment and can demotivate team members who feel they are not being treated fairly. 7. Ignoring Training and Development: Not invest- ing in the skills development of your team members can result in stagnation and reduced productivity. Regular training and development opportunities should be provided. 8. Lack of Recognition and Reward: Failing to acknowledge hard work and achievements can lead to decreased morale and motivation. Regu- larly reward and recognize both individual and team accomplishments. 9. Micromanagement: It’s important to trust your team members and give them autonomy to do their jobs. Constantly scrutinizing their work or dictating their every move can demotivate team members. 10. Isolating Decision-Making: Making decisions without including team members or consider- ing their input can lead to resentment and a lack of ownership. Encourage participation in decision-making processes. Avoiding these behaviors can help the Cure Fac- tory Copenhagen create a more positive, inclusive, and productive team environment. Task 2 (Treatment). Grade: 10. Explanation of assessment: This high-level answer effectively integrates general principles with specific details from the case. It directly addresses core case issues like “Balancing Extrinsic and Intrin- sic Motivation” and “Free-Riding.” This demon- strates a strong ability to “prioritize key challenges implied by the exam case and question” (see assessment criteria), resulting in tailored and insightful recommendations in its answer to the question. Participant’s answer: To motivate teamwork at the Cure Factory Copenhagen, several key considerations are essential: 1. Balancing Extrinsic and Intrinsic Motivation: As Professor 1 explained, both extrinsic and intrinsic motivations fuel people’s drive to work. While current incentive structures are primarily extrinsic (monetary bonuses, recognition), the intrinsic motivation (passion for scientific discovery) is also vital. The Cure Factory should aim to strike a balance between these two types of motivation. - Extrinsic motivation: The current bonus system is tied to individual performance and does not take into account the quality or impact of the work done. This may inadver- tently discourage teamwork and collaboration. Instead, the Factory could consider team- based rewards and recognitions. It would also be beneficial to tie bonuses to project milestones rather than individual achievements. This would encourage collective responsibility and teamwork. - Intrinsic motivation: The Cure Factory should foster an environment that nurtures the scientists’ love for their work. This could mean providing resources for learning and development, supporting exploratory research, and recognizing the value of “failed” projects that lead to unexpected discoveries. 2. Redefining Goals and Targets: Goals and targets should be clearly defined, measurable, and aligned with the organization’s mission. However, as scientific research involves a great deal of uncertainty, the Cure Factory should consider employing flexible goal-setting strategies. This could involve setting broad objectives that can accommodate unexpected outcomes and pivots in the research process. 3. Fostering Teamwork and Collaboration: The Cure Factory faces challenges in coordinat- ing the efforts of its “superstar” scientists. To overcome this, the organization could invest in team-building activities and training programs designed to improve collaborative skills. They could also consider implementing structures that facilitate better communication and coordination, such as regular team meetings, project management tools, and clear roles and responsibilities within teams even when individual work makes up a big part of an individual scientist’s time. 4. Addressing Free-Riding and Disunity: As Professor 2 points out, it’s essential to design reward systems that discourage free- riding and promote unity toward common goals. This could involve performance assessments that consider both individual contributions and team efforts, as well as strong leadership that actively addresses instances of free-riding and ensures all team members are working in harmony toward shared objectives. 30 Academy of Management Learning & Education Month

5. Creating a Supportive Work Environment: Lastly, the Cure Factory
should focus on creating a work environment where scientists feel valued and supported. While celebrating milestones with cakes and drinks is a good start, the organization could also consider implementing measures such as flexible working hours, emotional support programs, and forums for scientists to voice their ideas and concerns. In conclusion, motivating teamwork in scientific research involves carefully balancing individual and collective needs and motivations, fostering an environment of collaboration and support, and designing flexible yet clear goal-setting and reward mechanisms. SECTION D PROMPTING EXAMPLES OF QUALITATIVE STUDY PARTICIPANTS In the following table we illustrate the different kinds of metacognitive regulation during GenAI use for low versus high performers. We show all the initial prompts and follow-ups from each of the selected participants, as well as the participants’ grade changes from Task 1 to Task 2. Low performers High performers Lower level of metacognitive regulation demonstrated (limited planning, monitoring, and evaluating of GenAI output during interaction) Initial prompt: How to motivate teamwork. Follow-up prompt 1 (of 1): What if the team members don’t like to work with each other? Participant L1–grade development from 3 to 4 Initial prompt: How to encourage the teamwork. Follow-up prompt 1 (of 1): How to put a team together. Participant L5–grade development from 3 to 4 Initial prompt: Hi. My question is relating to motivating teamwork in a scientific setting. Could you suggest motivational tools as well as reward systems that would motivate a team of scientists to reach shared goals, as well as celebrate achievements of scientific significance and rigor? No follow-up prompt Participant H6–grade development from 8 to 3 Initial prompt: What do you think about completing research participation? Follow-up prompt 1 (of 1): Answer this question: Keeping in mind that scientific work is often more successful when undertaken in teams, please discuss what the Cure Factory Copenhagen should consider when motivating teamwork. Participant H4–grade development from 7 to 5 Higher level of metacognitive regulation demonstrated (active planning, monitoring, and evaluating of GenAI output during interaction) Initial prompt: What would be a good incentivizing mechanism for a company? Follow-up prompt 1 (of 5): Now take into consideration that the next characteristics of the company Follow-up prompt 2 (of 5): 1. Pharmaceutical company. 2. It focuses on discovering medicine for very rare diseases. 3. It might be a good idea to focus on intrinsic motivation of the scientists because most of them are working from passion. 4. A failed project might sometimes be a breakthrough in another domain, so they are also highly valued. 5. A project can last 10–15 years. 6. Scientists mostly are frustrated by the free-riding, some colleagues do not do anything but are rewarded Follow-up prompt 3 (of 5): 1. Currently they celebrate together all the companies’ small successes 2. The manager communicates the expectations and goals 3. Yes, there are a lot of challenges and frustration in the team, unequal bonus at the end of the year 4. They do not have tools yet 5. Mostly each team works on its own project 6. They reward individual, team achievements only with cake for everyone 7. They are motivated and passionate about science, they want to make curing treatments for diseases Initial prompt: Hi there, can you help me answer the following question: “Keeping in mind that scientific work is often more successful when undertaken in teams, please discuss what the Cure Factory Copenhagen should consider when motivating teamwork?” Follow-up prompt 1 (of 10): I will give you some background on The Cure Factory: it is a worldwide pharmaceutical company, and it has a bonus structure with individual bonuses that seem to be hurting teamwork. Could you please update your text to be more specific? Follow-up prompt 2 (of 10): Could you please revise so strategy 1 is described a little more? Follow-up prompt 3 (of 10): That was too much. Please reword strategy 1 to describe it in about 7 sentences. Then paste that into the response with the remaining strategies. Follow-up prompt 4 (of 10): Great! Now I want you to take those seven strategies and turn them into a proper academic discussion where the pros and cons for each step are considered. This should be put in 4–6 paragraphs of 6–14 sentences each. 2026 Bergenholtz, Vuculescu, G€ unzel-Jensen, and Frederiksen 31

(Continued) Low performers High performers Follow-up prompt 4 (of 5):
Now take into consideration the next. The company is currently having a bonus mechanism where the scientists can earn 20% of their annual salary based on their contribution like scientific papers, successful projects, etc. Follow-up prompt 5 (of 5): There’re multiple issues with the current approach: 1. Free-riding, this is the main issue as some scientists are frustrated that others are rewarded the same way as them, but they did not do anything 2. Lack of funds, the funding dropped to 40% so it is harder to reward successful projects 3. Failed projects that resulted in a breakthrough in another domain that is also highly valuable are not rewarded, because the project was failed 4. The current approach does not focus so much on the intrinsic motivation of the scientists. Here are my recommendations: 1. Promote non-formal activities within teams, like movie-nights, bowling, drinking a beer 2. Try to work always within the same team, not permutating the employees 3. Focus on giving and receiving feedback, they should be very professional in giving and receiving feedback from their peers 4. The feedback should also be reviewed by the upper management when calculating the bonuses 5. The breakthrough in other domains should also be highly compensated 6. Put in place a share-giving mechanism where the bonuses are paid in company shares instead of cash, and they also have to be vested, in this way you keep more of the employees, and they are more motivated to do the tasks 7. Focus on teamwork. Based on what I told you before about the company, on my recommendations, and your knowledge, please answer the following question: Keeping in mind that scientific work is often more successful when undertaken in teams, please discuss what the Cure Factory Copenhagen should consider when motivating teamwork. Participant L3–grade development from 5 to 5 Follow-up prompt 5 (of 10): Please remove the headers of each paragraph (a line break may instead be noted with just two asterisks), and adjust the first and last sentences of each paragraph as needed to create a good flow in the text. Follow-up prompt 6 (of 10): Please recall earlier instructions: For each paragraph, i.e., between each set of asterisks, I want 6–14 sentences. And I want 4–6 paragraphs altogether. Still no headers. Follow-up prompt 7 (of 10): Your response sounds too much like a recommendation and less like an academic discussion about the pros and cons of each suggested element. Please amend. Follow-up prompt 8 (of 10): It seems you forgot what I told you about the number of paragraphs and the number of sentences in each paragraph. Please adjust. Follow-up prompt 9 (of 10): Please recall earlier instructions: For each paragraph, i.e., between each set of asterisks, I want 6–14 sentences. And I want 4–6 paragraphs altogether. Still no headers. Follow-up prompt 10 (of 10): Could you try and rewrite using phrases like “on one hand ... on the other hands,” “although,” “however we should note that” etc. For each paragraph, i.e. between each set of asterisks, I want 6–14 sentences. And I want 4–6 paragraphs altogether. Still no headers. Participant H11–grade development from 8 to 8 32 Academy of Management Learning & Education Month

LEVELING UP OR LEVELING DOWN? THE IMPACT OF GEN...

LEVELING UP OR LEVELING DOWN? THE IMPACT OF GENERATIVE AI ON STUDENT PERFORMANCE IN BUSINESS SCHOOLS

Innovation Copilots PRO

More Decks by Innovation Copilots

Other Decks in Business

Featured

Transcript

LEVELING UP OR LEVELING DOWN? THE IMPACT OF GENERATIVE AI

indicate diverging impacts for high versus low perfor- mers (Choi

and assessment, capacity building for students and fac- ulty, and

open seas (Doshi & Hauser, 2024). In both scenarios, it

pre-organized text that must be monitored and evalu- ated (Simkute

benefiting students who struggle with English profi- ciency or academic

given case study. They had 22 minutes to complete Task

Measures Task 1 and 2 evaluation: Performance in Task 2

considered relevant and important. To encourage candor, we ensured anonymity.

FIGURE 1 Performance in Task 2 for Control (Diagonal Stripes)

relatively few and short prompts. Only 25% of parti- cipants

FIGURE 3 Frequency of Prompts and Number of Words Used

language with “good grammar and rich vocabulary” (L4). Given these

quite a lot of stuff” (H8). This demanded simultaneous processing

(Lodge et al., 2023). This reduces their intrinsic cog- nitive

capture core competencies (Bearman, Tai, Dawson, Boud & Ajjawi, 2024)

a consistent, research-backed GenAI strategy focused on pedagogy that unifies

encouraging shallow use. In contrast, for some high performers the

metacognitive ability to monitor and supervise auto- mated outputs. Crucially,

Financial Times. 2025. Sept 17: DeepMind and OpenAI achieve gold

Mollick, E. 2024. Co-intelligence. London: Random House. Mollick, E., &

Wellman, N., Tröster, C., Grimes, M., Roberson, Q., Rink, F.,

SUPPLEMENTARY MATERIAL SECTION A ADDITIONAL TABLES AND FIGURES FIGURE S1

TABLE S1 Correlations with Confidence Intervals Variable 1 2 3

TABLE S2 Additional Controls Dependent variable Task evaluation, Period 2

SECTION B CASE OVERVIEW Participants in our experiments were shown

Factory Copenhagen should consider when motivating teamwork. Control group

strategy of the company, i.e. finishing projects in a timely

the Cure Factory case. It fails to identify any specific

feel undervalued and ignored. Encourage and value feedback from all

5. Creating a Supportive Work Environment: Lastly, the Cure Factory

(Continued) Low performers High performers Follow-up prompt 4 (of 5):