Steer Thy Language Model

Steer Thy Language Model - Sarah Masud, KU Based on
our previous workshop on Latent Space Navigation at D3A 2025

Steering Via Prompting negative Figure from https://www.lesswrong.com/posts/ndyngghzFY388Dnew/implementing-activation-steering

Motivation & Background on Steering Vectors

Human vs LLM World Modelling? • Mapping intention -> action
is not always explicit. • A Large Language Model (LLM) learns only from words accompany other words. • An LLM cannot experience “sexism” but it can classify if a tweet contains strong language appears to be “sexist” in nature. Figure from https://ﬁbertide.com/knowledge/why-are-llms-so-capable-at-understanding-language/

Human vs LLM World Modelling? • Human knowledge maps to
“specialised action areas” they have been “trained on” . • Every input (image, sentence, sound) is mapped into a vector in N-Dimensional space. • LLM “latent spaces” are similar these specialised knowledge areas. Figure from https://www.baeldung.com/cs/dl-latent-space.

• Hidden: Not directly observable but can be “semantically” inferred.
• Compressed representation. • Emergent aka not explicitly programmed. Properties of Latent Space

Properties of Latent Space Latent spaces are vectors where meaning
lives. Steering them = inﬂuencing outcomes.

Why Does Steering Matter? • Understanding ⇾ interpretability, science, transparency.
• Performance ⇾ debugging, pruning, transfer learning • Responsibility ⇾ bias mitigation, safety, alignment, explainability. Figures from https://thesephist.com/posts/prism/ Katsaros et al., ICWSM ‘22 Sun & Meng, Scientiﬁc Reports’25 Caution: Toxic content

The Steering Maths

Attribute Linear Vector Arithmetic Figures from Radford et.al, ICLR’16 Allen
& Hospedales, ICML’19

Steering Via Prompting negative Figure from https://www.lesswrong.com/posts/ndyngghzFY388Dnew/implementing-activation-steering

1-D Finite Sentiment Steering The same logic can be extended
to D-dimensional latent space of each layer of the model, the overall impact being steering of the ﬁnal output generated.

N-D Steering: Mean Difference Intensity or “extend of steering”

N-D Steering: Mean Difference Lth LLM layer steered via M
examples

General Steps for Steering 1. Have access to open-weight LLMs.
2. Load or generate a “contrastive dataset”. Examples of +/- aspects you desire. 3. Embed dataset from step 2 to obtain “steering vectors” as diﬀerence in latent vectors. 4. Set control in next inference/generation via steer vectors. 5. Visualise how positive and negative sentences receive diﬀerent attention score.

Hands-on exercises Steering sentiment in LLMs - Google Colab

Steering Alignment Use Cases

Alignment With User Preferences Figures from Song et.al, https://arxiv.org/pdf/2503.02989 Chen
et.al, EMNLP’25

Alignment With Demographics & Social Attributes

Alignment with Output Formatting Figure from Stolfo et.al, ICLR’25

Steering for Synthetic Data Generation Figure from Hartvigsen et.al, ACL’22
Caution: Toxic content

Limitations, Risks, and Future Directions • Entanglement of latent factors
→ imperfect control • Risks: malicious use (deepfakes, manipulation) • Linearity assumption • Dataset dependence -> Lack of standardized benchmark

References • https://github.com/LenkaTetkova/Latent-space-navigation • https://fibertide.com/knowledge/why-are-llms-so-capable-at-understanding-language/ • https://www.baeldung.com/cs/dl-latent-space. • Unsupervised Representation
Learning with Deep Convolutional Generative Adversarial Networks. https://arxiv.org/abs/1511.06434 • https://thesephist.com/posts/prism/ • Reconsidering Tweets: Intervening during Tweet Creation Decreases Offensive Content, https://ojs.aaai.org/index.php/ICWSM/article/view/19308 • StyDiff: a refined style transfer method based on diffusion models, https://www.nature.com/articles/s41598-025-17899-x • Analogies Explained: Towards Understanding Word Embeddings, https://arxiv.org/abs/1901.09813 • Effectively Steer LLM To Follow Preference via Building Confident Directions, https://arxiv.org/abs/2503.02989 • STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models, https://aclanthology.org/2025.emnlp-main.925/ • Dialz: A Python Toolkit for Steering Vectors, https://aclanthology.org/2025.acl-demo.35/ • ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection, https://aclanthology.org/2022.acl-long.234 • Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization, https://dl.acm.org/doi/10.1145/3534678.3539161 • https://www.emergentmind.com/topics/steering-vectors • https://bobrupakroy.medium.com/steering-large-language-models-with-activation-vectors-a-practical-guide-45866b3697ac • Improving Instruction-Following in Language Models through Activation Steering, https://arxiv.org/abs/2410.12877 • Language Model Alignment in Multilingual Trolley Problems, https://arxiv.org/abs/2407.02273 • ValuesRAG: Enhancing Cultural Alignment Through Retrieval-Augmented Contextual Learning, https://arxiv.org/abs/2501.01031 • https://www.lesswrong.com/posts/ndyngghzFY388Dnew/implementing-activation-steering

Slides & More speakerdeck.com/_themessier sara-02.github.io

Thank You

Steer Thy Language Model

Steer Thy Language Model

_themessier

More Decks by _themessier

Other Decks in Technology

Featured

Transcript