Inside the Mind of an LLM

https://labs.ramp.com/steer-ai

https://transformer-circuits.pub/2026/emotions/

We can change a model’s behavior without changing the prompt:
we intervene directly on (or even “hack” on) internal activations. This is a key shift from prompt engineering to feature-level control. From changing what the model sees, to changing how it computes. This can be seen as “reverse engineering” a neural network - the manifesto of a branch of AI research named mechanistic interpretability. A paradigm shift

📍Outline 📦 LLMs Are Opaque 📏 Neurons and features ⚛
Superposition and polysemanticity ❓ Latest developments 🤌 Practical implications 🎯 Takeaways

LLMs Are Opaque Large language models are black boxes: we
observe behavior, not computation. To probe their internal state we focus on the residual stream, the central communication channel of the model. It is where the attention heads read from and write to, combined with previous computation.

The vector space of neurons The state of neurons at
a certain point in the residual stream represents what the LLM is currently “thinking”; hopefully we can identify and then act upon concepts living inside of it. With each neuron holding a number, the residual stream forms a high- dimensional linear vector space. Neuron 1 Neuron 2 Neuron 2 Neuron 1

We would like to have a map linking each neuron
to a concept, to understand what LLMs think when neurons are f ired up. A neuron encoding a single concept is called Monosemantic. Monosemanticity Cat = (1,0) Hat = (0,1) 0 1 “I’m wearing a ” “Hat” “Hatness” “Catness”

Abstraction level-up Concepts don’t always align with neurons; in a
linear vector space, all directions are equally important. The neurons are just a basis for representing any vector: we need to abstract them away and talk about directions instead. Cat = (0.4,0.7) Hat = (-0.8,0.5)

If neurons are not the “true variables”, what are? We
call such units features: generic directions learned in the residual stream. Features 0.4 0.7 “I have a gray ” Cat = (0.4,0.7) Hat = (-0.8,0.5) “Cat” “?” “?” “Catness” “Hatness”

Superposition Unfortunately, features >> neurons in any powerful enough LLM.
Because of this, superposition arises: the model stores concepts into overlapping representations, sharing the same weights. Cat = 0.4 x (1,0) + 0.7 x (0,1) Hat = -0.8 x (1,0) + 0.5 x (0,1) “Catness” “Hatness” Neuron 1 Neuron 2 Neuron 1 Neuron 2

Polysemanticity is a consequence of superposition at the neuron level.
Polysemanticity 0.4 0.7 “Cat” “I have a gray ” “Is Gray” + “Wearable” “Purrs” + “Has wheels" Hat Cat “Is Gray” + “Wearable” “Purrs” + “Has wheels"

It is not a bug, but an e ff icient
encoding strategy: to optimize space occupation, concepts are spread out and almost orthogonal. Almost-orthogonality Hat Cat Car While orthogonal direction scale linearly with neurons, almost- orthogonal ones scale exponentially, reaching a size where features >> neurons are learned.

The interference / capacity trade-off With polysemantic neurons, the risk
of interference or confusing concepts arises. On the other hand, the model is able to store more concepts. Interference: “Cat” spills on its neighbors Cat Car Hat Cat Car Hat Confusion: “Cat” can be confused with “Hat” + “Car”

Apart from increasing neurons, the main way to improve the
trade-o ff is having independent sparse features, i.e. which don’t occur too often. Sparsity A sparse feature is a more helpful concept since it better discriminates between inputs. Feature \ Data point Being Gray Purring Cat Hat Car Sparse Dense

We think the model learns concepts as combinations of few
basis vectors at a time, the sparse features, which can be thought as forming an over- complete basis of the vector space from which every activation can be recreated. The underlying learning structure Cat = (0.4, 0.8) + (0.1, -0.3) + (-0.4, 0.2) “Having a tail” “Purring” “Cuteness” + (0.1, 0.4) + (-0.8, -0.1) + … “Having a job” “Sport”

The Linear Representation Hypothesis We can summarize everything so far
into the following: • The neuron activations in the residual stream are a linear combination of feature vectors; • The features form an over-complete basis of the space; • The features have some level of sparsity, i.e. not many of them are used for any given activation. The above form the linear representation hypothesis.

The Linear Representation Hypothesis The assumptions in LRH have been
partially veri f ied in toy models or controlled settings, and are somewhat intuitive and plausible.

Feature recovery If we assume LRH, then learning features =
f inding the over-complete sparse vectors representing them. We call such task sparse dictionary learning. To f ind these sparse vectors, in practice we need to map them to an even higher-dimensional space, where they can be separated.

❓ Latest developments A timeline for mechanistic interpretability

❓ Latest developments Scaling feature recovery • Researchers found empirical
evidence of superposition in language models. • Researchers managed to hone feature recovery method on toy models. • Labs, most notably Anthropic, scaled feature recovery to SoTA models.

❓ Latest developments Going beyond feature recovery with circuits •
Scaling feature recovery allowed researchers to map subnets (called circuits) of the LLM that are sensitive to inputs that belong to a feature.

❓ Latest developments Key achievements • Interpretability is becoming scalable
• Interpretability is becoming causal*: we can analyse models** • Interpretability is becoming actionable: we are seeing attempts to steer the models. The goal: interpretability as an engineering tool.

🤌 Practical implications • Safety: suppress undesirable behaviour (users bypassing
guardrails, or the LLM defying rules in agent harnesses). • Customisation: Control tone, persona, reasoning more reliably than system prompts. Could this reduce token usage, or potentially replace f ine tuning? • Debugging / Diagnostics: Identify circuits responsible for errors (e.g. hallucinations) or model underperforming on certain domains, or perhaps even sensitivity to f ine-tuning. Circuit tracing applications

🤌 Practical implications • Still a post-hoc process. The number
of features is derived after the model has been trained. • Feature subjectivity. Features are labelled using automated procedures involving inference and LLM-based labelling. Features are, in other words, assigned post-hoc. • Inactive or ambiguous features. Some features are either too speci f ic, others overlap too much. This complicates how circuits are drawn. • Suppressed features. We don’t understand what causes some other features not to f ire. Features and their current limits

🤌 Practical implications • Locality of circuits. Authors admit that
analysing circuits that encompass the full model is “quite challenging”. • Correlation vs causation (“Mechanistic faithfulness”). Interventions change behaviour, but may not re f lect true underlying causes. • Anecdotal evidence. Recent f indings only report episodes that belong to a certain model. Circuits and their current limits

🎯 Takeaways • Neural networks currently still operate largely as
black boxes, though the f ield of mechanistic interpretability has accelerated signi f icantly. • Building upon decades of research in interpretability, researchers found empirical evidence of key assumptions (superposition) and scaled both feature recovery and circuit tracing methods to state-of-the-art models. • We are starting to analyse some LLM behaviours, such as “functional emotions”, using circuits, and start to play around with model steering. We’re just getting started

📚 References This presentation https://speakerdeck.com/baggiponte/inside-the-mind-of-an-llm

🗃 Appendix How we do feature recovery? We build a
replacement model

🗃 Appendix From replacement models to circuits

📚 References A short list of readings • Toy Models
of Superposition (2022) • Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (2023) • Scaling Monosemanticity (2024) • Circuit Tracing: Revealing Computational Graphs in Language Models (2025) • Emotion Concepts and their Function in a Large Language Model (2026) • How we built Steer, our interpretability playground (2026) • A Uni f ied Theory of Sparse Dictionary Learning in Mechanistic Interpretability (2026)

Inside the Mind of an LLM

Inside the Mind of an LLM

Luca Baggi

More Decks by Luca Baggi

Other Decks in Science

Featured

Transcript

https://labs.ramp.com/steer-ai

https://transformer-circuits.pub/2026/emotions/

We can change a model’s behavior without changing the prompt:

📍Outline 📦 LLMs Are Opaque 📏 Neurons and features ⚛

LLMs Are Opaque Large language models are black boxes: we

The vector space of neurons The state of neurons at

We would like to have a map linking each neuron

Abstraction level-up Concepts don’t always align with neurons; in a

If neurons are not the “true variables”, what are? We

Superposition Unfortunately, features >> neurons in any powerful enough LLM.

Polysemanticity is a consequence of superposition at the neuron level.

It is not a bug, but an e ff icient

The interference / capacity trade-off With polysemantic neurons, the risk

Apart from increasing neurons, the main way to improve the

We think the model learns concepts as combinations of few

The Linear Representation Hypothesis We can summarize everything so far

The Linear Representation Hypothesis The assumptions in LRH have been

Feature recovery If we assume LRH, then learning features =

❓ Latest developments A timeline for mechanistic interpretability

❓ Latest developments Scaling feature recovery • Researchers found empirical

❓ Latest developments Going beyond feature recovery with circuits •

❓ Latest developments Key achievements • Interpretability is becoming scalable

🤌 Practical implications • Safety: suppress undesirable behaviour (users bypassing

🤌 Practical implications • Still a post-hoc process. The number

🤌 Practical implications • Locality of circuits. Authors admit that

🎯 Takeaways • Neural networks currently still operate largely as

📚 References This presentation https://speakerdeck.com/baggiponte/inside-the-mind-of-an-llm

🗃 Appendix How we do feature recovery? We build a

🗃 Appendix How we do feature recovery? We build a

🗃 Appendix From replacement models to circuits

🗃 Appendix From replacement models to circuits

📚 References A short list of readings • Toy Models