Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Inside the Mind of an LLM

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.
Avatar for Luca Baggi Luca Baggi
April 09, 2026

Inside the Mind of an LLM

What if you could watch an AI’s thought take shape? For years, LLMs have been impenetrable "black boxes," but we are finally beginning to find ways to see how the ghost in the machine actually works.

This talk explores **mechanistic interpretability**, a subfield of AI that aims to understand the internal workings of neural networks. Mapping these internal "circuits" is not only just a philosophical curiosity - or duty: it is a high-stakes engineering necessity for safety, debugging, and trust.

Avatar for Luca Baggi

Luca Baggi

April 09, 2026

More Decks by Luca Baggi

Other Decks in Science

Transcript

  1. Inside the mind of an LLM Mechanistic interpretability from its

    inception to the latest advances 🇬🇧 PyData London (2026/06/06) 👤 Luca Baggi 💼 AI Engineer @ xtream
  2. The two examples above (Ramp Labs model steering and Anthropic’s

    emotion vectors) change a model’s behavior without changing the prompt. They intervene directly on (or even “hack” on) internal activations. This is a key shift from prompt engineering to feature-level control. This is what mechanistic interpretability is all about: reverse-engineering a neural network. Mechanistic interpretability Reverse-engineering neural networks
  3. 📍Outline 🕰 Mechinterp: a timeline 🌊 The Residual Stream 🌐

    Where do concepts live in neural networks? ⚛ Superposition and polysemanticity 🔖 Feature recovery and circuit tracing 🤌 So what? 🎯 Takeaways
  4. 🌐 Where do concepts live in neural networks? When we

    say that a “neuron f ires”, we mean that the activation is non-zero Activations
  5. 🌐 Where do concepts live in neural networks? Each neuron

    then becomes an orthogonal direction in a high-dimensional linear vector space. Assume one neuron = one concept
  6. Neurons basically overlap with activatons, and become an orthogonal direction

    in a high-dimensional linear vector space. 🌐 Where do concepts live in neural networks? Monosemanticity
  7. Concepts don’t always align with neurons; in a linear vector

    space, all directions are equally important. The neurons are just a basis for representing any vector: we need to abstract them away and talk about directions instead. 🌐 Where do concepts live in neural networks? Beyond monosemanticity
  8. If directors represent features, how can we f ind them

    in neurons? 🌐 Where do concepts live in neural networks? Beyond monosemanticity
  9. If directors represent features, how can we f ind them

    in neurons? 🌐 Where do concepts live in neural networks? Beyond monosemanticity
  10. Features >> neurons in any powerful enough LLM. Because of

    this, superposition arises: the model stores more than one concept into overlapping representations, sharing the same weights. 🌐 Where do concepts live in neural networks? Enter: superposition
  11. It is not a bug, but an e ff icient

    encoding strategy: to optimize space occupation, concepts are spread out and almost orthogonal. Hat Cat Car While orthogonal direction scale linearly with neurons, almost- orthogonal ones scale exponentially, reaching a size where features >> neurons are learned. 🌐 Where do concepts live in neural networks? Almost-orthogonality
  12. With polysemantic neurons, the risk of interference or confusing concepts

    arises. On the other hand, the model is able to store more concepts. Interference: “Cat” spills on its neighbors Cat Car Hat Cat Car Hat Confusion: “Cat” can be confused with “Hat” + “Car” ⚛ Superposition and polysemanticity The interference-capacity tradeo f
  13. Apart from increasing neurons, the main way to improve the

    trade-o ff is having independent sparse features, i.e. which don’t occur too often. A sparse feature is a more helpful concept since it better discriminates between inputs. Feature \ Data point Being Gray Purring Cat Hat Car Sparse Dense ⚛ Superposition and polysemanticity Improving the tradeo ff : more neurons, or sparse features
  14. We think the model learns concepts as combinations of few

    basis vectors at a time, the sparse features, which can be thought as forming an over- complete basis of the vector space from which every activation can be recreated. Cat = (0.4, 0.8) + (0.1, -0.3) + (-0.4, 0.2) “Having a tail” “Purring” “Cuteness” + (0.1, 0.4) + (-0.8, -0.1) + … “Having a job” “Sport” ⚛ Superposition and polysemanticity Learning with over-complete basis
  15. We can summarize everything so far into the following: •

    The neuron activations in the residual stream are a linear combination of feature vectors; • The features form an over-complete basis of the space; • The features have some level of sparsity, i.e. not many of them are used for any given activation. The above form the linear representation hypothesis. ⚛ Superposition and polysemanticity The Linear Representation hypothesis
  16. The assumptions in LRH have been partially veri f ied

    in toy models or controlled settings, and are somewhat intuitive and plausible. ⚛ Superposition and polysemanticity LRH has been observed in toy models
  17. If we assume LRH, then learning features = f inding

    the over-complete sparse vectors representing them. We call such task sparse dictionary learning. To f ind these sparse vectors, in practice we need to map them to an even higher-dimensional space, where they can be separated. ⚛ Superposition and polysemanticity Feature recovery
  18. How we do feature recovery? We build a replacement model

    🔖 Feature recovery and circuit tracing
  19. 🔖 Feature recovery and circuit tracing Going beyond feature recovery

    with circuits • Scaling feature recovery allowed researchers to map subnets (called circuits) of the LLM that are sensitive to inputs that belong to a feature.
  20. 🤌 So what? The achievements of the past ~3/4 years

    • Researchers found empirical evidence of superposition in language models. • Researchers managed to hone feature recovery method on toy models. • Labs, most notably Anthropic, scaled feature recovery to SoTA models. • Anthropic went beyond feature recovery (per-layer) to circuit tracing applied to SoTA models, as well as probing new feature recovery strategies.
  21. 🤌 So what? Implications and claims about these results •

    Interpretability is becoming scalable • Interpretability is becoming causal*: we can analyse models** • Interpretability is becoming actionable: we are seeing attempts to steer the models. • The goal: interpretability as an engineering tool.
  22. 🤌 So what? • Safety: suppress undesirable behaviour (users bypassing

    guardrails, or the LLM defying rules in agent harnesses). • Customisation: Control tone, persona, reasoning more reliably than system prompts. Could this reduce token usage, or potentially replace f ine tuning? • Debugging / Diagnostics: Identify circuits responsible for errors (e.g. hallucinations) or model underperforming on certain domains, or perhaps even sensitivity to f ine-tuning. Circuit tracing applications
  23. 🤌 So what? • Still a post-hoc process. The number

    of features is derived after the model has been trained. • Feature subjectivity. Features are labelled using automated procedures involving inference and LLM-based labelling. Features are, in other words, assigned post-hoc. • Suppressed features. We don’t understand what causes some other features not to f ire. Some of the current limits
  24. 🤌 So what? • Locality of circuits. Authors admit that

    analysing circuits that encompass the full model is “quite challenging”. • Correlation vs causation (“Mechanistic faithfulness”). Interventions change behaviour, but may not re f lect true underlying causes. • Anecdotal evidence. Recent f indings only report episodes that belong to a certain model. Circuits and their current limits
  25. 🎯 Takeaways • Neural networks currently still operate largely as

    black boxes, though the f ield of mechanistic interpretability has accelerated signi f icantly. • Building upon decades of research in interpretability, researchers found empirical evidence of key assumptions (superposition) and scaled both feature recovery and circuit tracing methods to state-of-the-art models. • We are starting to analyse some LLM behaviours, such as “functional emotions”, using circuits, and start to play around with model steering. We’re just getting started
  26. 📚 References Reference of the most salient papers and blog

    posts • Toy Models of Superposition (2022) • Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (2023) • Scaling Monosemanticity (2024) • Circuit Tracing: Revealing Computational Graphs in Language Models (2025) • Emotion Concepts and their Function in a Large Language Model (2026) • How we built Steer, our interpretability playground (2026) • A Uni f ied Theory of Sparse Dictionary Learning in Mechanistic Interpretability (2026)