Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Inside the Mind of an LLM

Avatar for Luca Baggi Luca Baggi
April 09, 2026

Inside the Mind of an LLM

What if you could watch an AI’s thought take shape? For years, LLMs have been impenetrable "black boxes," but we are finally beginning to find ways to see how the ghost in the machine actually works.

This talk explores **mechanistic interpretability**, a subfield of AI that aims to understand the internal workings of neural networks. Mapping these internal "circuits" is not only just a philosophical curiosity - or duty: it is a high-stakes engineering necessity for safety, debugging, and trust.

Avatar for Luca Baggi

Luca Baggi

April 09, 2026

More Decks by Luca Baggi

Other Decks in Science

Transcript

  1. We can change a model’s behavior without changing the prompt:

    we intervene directly on (or even “hack” on) internal activations. This is a key shift from prompt engineering to feature-level control. From changing what the model sees, to changing how it computes. This can be seen as “reverse engineering” a neural network - the manifesto of a branch of AI research named mechanistic interpretability. A paradigm shift
  2. 📍Outline 📦 LLMs Are Opaque 📏 Neurons and features ⚛

    Superposition and polysemanticity ❓ Latest developments 🤌 Practical implications 🎯 Takeaways
  3. LLMs Are Opaque Large language models are black boxes: we

    observe behavior, not computation. To probe their internal state we focus on the residual stream, the central communication channel of the model. It is where the attention heads read from and write to, combined with previous computation.
  4. The vector space of neurons The state of neurons at

    a certain point in the residual stream represents what the LLM is currently “thinking”; hopefully we can identify and then act upon concepts living inside of it. With each neuron holding a number, the residual stream forms a high- dimensional linear vector space. Neuron 1 Neuron 2 Neuron 2 Neuron 1
  5. We would like to have a map linking each neuron

    to a concept, to understand what LLMs think when neurons are f ired up. A neuron encoding a single concept is called Monosemantic. Monosemanticity Cat = (1,0) Hat = (0,1) 0 1 “I’m wearing a ” “Hat” “Hatness” “Catness”
  6. Abstraction level-up Concepts don’t always align with neurons; in a

    linear vector space, all directions are equally important. The neurons are just a basis for representing any vector: we need to abstract them away and talk about directions instead. Cat = (0.4,0.7) Hat = (-0.8,0.5)
  7. If neurons are not the “true variables”, what are? We

    call such units features: generic directions learned in the residual stream. Features 0.4 0.7 “I have a gray ” Cat = (0.4,0.7) Hat = (-0.8,0.5) “Cat” “?” “?” “Catness” “Hatness”
  8. Superposition Unfortunately, features >> neurons in any powerful enough LLM.

    Because of this, superposition arises: the model stores concepts into overlapping representations, sharing the same weights. Cat = 0.4 x (1,0) + 0.7 x (0,1) Hat = -0.8 x (1,0) + 0.5 x (0,1) “Catness” “Hatness” Neuron 1 Neuron 2 Neuron 1 Neuron 2
  9. Polysemanticity is a consequence of superposition at the neuron level.

    Polysemanticity 0.4 0.7 “Cat” “I have a gray ” “Is Gray” + “Wearable” “Purrs” + “Has wheels" Hat Cat “Is Gray” + “Wearable” “Purrs” + “Has wheels"
  10. It is not a bug, but an e ff icient

    encoding strategy: to optimize space occupation, concepts are spread out and almost orthogonal. Almost-orthogonality Hat Cat Car While orthogonal direction scale linearly with neurons, almost- orthogonal ones scale exponentially, reaching a size where features >> neurons are learned.
  11. The interference / capacity trade-off With polysemantic neurons, the risk

    of interference or confusing concepts arises. On the other hand, the model is able to store more concepts. Interference: “Cat” spills on its neighbors Cat Car Hat Cat Car Hat Confusion: “Cat” can be confused with “Hat” + “Car”
  12. Apart from increasing neurons, the main way to improve the

    trade-o ff is having independent sparse features, i.e. which don’t occur too often. Sparsity A sparse feature is a more helpful concept since it better discriminates between inputs. Feature \ Data point Being Gray Purring Cat Hat Car Sparse Dense
  13. We think the model learns concepts as combinations of few

    basis vectors at a time, the sparse features, which can be thought as forming an over- complete basis of the vector space from which every activation can be recreated. The underlying learning structure Cat = (0.4, 0.8) + (0.1, -0.3) + (-0.4, 0.2) “Having a tail” “Purring” “Cuteness” + (0.1, 0.4) + (-0.8, -0.1) + … “Having a job” “Sport”
  14. The Linear Representation Hypothesis We can summarize everything so far

    into the following: • The neuron activations in the residual stream are a linear combination of feature vectors; • The features form an over-complete basis of the space; • The features have some level of sparsity, i.e. not many of them are used for any given activation. The above form the linear representation hypothesis.
  15. The Linear Representation Hypothesis The assumptions in LRH have been

    partially veri f ied in toy models or controlled settings, and are somewhat intuitive and plausible.
  16. Feature recovery If we assume LRH, then learning features =

    f inding the over-complete sparse vectors representing them. We call such task sparse dictionary learning. To f ind these sparse vectors, in practice we need to map them to an even higher-dimensional space, where they can be separated.
  17. ❓ Latest developments Scaling feature recovery • Researchers found empirical

    evidence of superposition in language models. • Researchers managed to hone feature recovery method on toy models. • Labs, most notably Anthropic, scaled feature recovery to SoTA models.
  18. ❓ Latest developments Going beyond feature recovery with circuits •

    Scaling feature recovery allowed researchers to map subnets (called circuits) of the LLM that are sensitive to inputs that belong to a feature.
  19. ❓ Latest developments Key achievements • Interpretability is becoming scalable

    • Interpretability is becoming causal*: we can analyse models** • Interpretability is becoming actionable: we are seeing attempts to steer the models. The goal: interpretability as an engineering tool.
  20. 🤌 Practical implications • Safety: suppress undesirable behaviour (users bypassing

    guardrails, or the LLM defying rules in agent harnesses). • Customisation: Control tone, persona, reasoning more reliably than system prompts. Could this reduce token usage, or potentially replace f ine tuning? • Debugging / Diagnostics: Identify circuits responsible for errors (e.g. hallucinations) or model underperforming on certain domains, or perhaps even sensitivity to f ine-tuning. Circuit tracing applications
  21. 🤌 Practical implications • Still a post-hoc process. The number

    of features is derived after the model has been trained. • Feature subjectivity. Features are labelled using automated procedures involving inference and LLM-based labelling. Features are, in other words, assigned post-hoc. • Inactive or ambiguous features. Some features are either too speci f ic, others overlap too much. This complicates how circuits are drawn. • Suppressed features. We don’t understand what causes some other features not to f ire. Features and their current limits
  22. 🤌 Practical implications • Locality of circuits. Authors admit that

    analysing circuits that encompass the full model is “quite challenging”. • Correlation vs causation (“Mechanistic faithfulness”). Interventions change behaviour, but may not re f lect true underlying causes. • Anecdotal evidence. Recent f indings only report episodes that belong to a certain model. Circuits and their current limits
  23. 🎯 Takeaways • Neural networks currently still operate largely as

    black boxes, though the f ield of mechanistic interpretability has accelerated signi f icantly. • Building upon decades of research in interpretability, researchers found empirical evidence of key assumptions (superposition) and scaled both feature recovery and circuit tracing methods to state-of-the-art models. • We are starting to analyse some LLM behaviours, such as “functional emotions”, using circuits, and start to play around with model steering. We’re just getting started
  24. 📚 References A short list of readings • Toy Models

    of Superposition (2022) • Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (2023) • Scaling Monosemanticity (2024) • Circuit Tracing: Revealing Computational Graphs in Language Models (2025) • Emotion Concepts and their Function in a Large Language Model (2026) • How we built Steer, our interpretability playground (2026) • A Uni f ied Theory of Sparse Dictionary Learning in Mechanistic Interpretability (2026)