Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Steer Thy Language Model

Steer Thy Language Model

Data Science Summit

Avatar for _themessier

_themessier

November 21, 2025
Tweet

More Decks by _themessier

Other Decks in Technology

Transcript

  1. Steer Thy Language Model - Sarah Masud, KU Based on

    our previous workshop on Latent Space Navigation at D3A 2025
  2. Human vs LLM World Modelling? • Mapping intention -> action

    is not always explicit. • A Large Language Model (LLM) learns only from words accompany other words. • An LLM cannot experience “sexism” but it can classify if a tweet contains strong language appears to be “sexist” in nature. Figure from https://fibertide.com/knowledge/why-are-llms-so-capable-at-understanding-language/
  3. Human vs LLM World Modelling? • Human knowledge maps to

    “specialised action areas” they have been “trained on” . • Every input (image, sentence, sound) is mapped into a vector in N-Dimensional space. • LLM “latent spaces” are similar these specialised knowledge areas. Figure from https://www.baeldung.com/cs/dl-latent-space.
  4. • Hidden: Not directly observable but can be “semantically” inferred.

    • Compressed representation. • Emergent aka not explicitly programmed. Properties of Latent Space
  5. Properties of Latent Space Latent spaces are vectors where meaning

    lives. Steering them = influencing outcomes.
  6. Why Does Steering Matter? • Understanding ⇾ interpretability, science, transparency.

    • Performance ⇾ debugging, pruning, transfer learning • Responsibility ⇾ bias mitigation, safety, alignment, explainability. Figures from https://thesephist.com/posts/prism/ Katsaros et al., ICWSM ‘22 Sun & Meng, Scientific Reports’25 Caution: Toxic content
  7. 1-D Finite Sentiment Steering The same logic can be extended

    to D-dimensional latent space of each layer of the model, the overall impact being steering of the final output generated.
  8. General Steps for Steering 1. Have access to open-weight LLMs.

    2. Load or generate a “contrastive dataset”. Examples of +/- aspects you desire. 3. Embed dataset from step 2 to obtain “steering vectors” as difference in latent vectors. 4. Set control in next inference/generation via steer vectors. 5. Visualise how positive and negative sentences receive different attention score.
  9. Limitations, Risks, and Future Directions • Entanglement of latent factors

    → imperfect control • Risks: malicious use (deepfakes, manipulation) • Linearity assumption • Dataset dependence -> Lack of standardized benchmark
  10. References • https://github.com/LenkaTetkova/Latent-space-navigation • https://fibertide.com/knowledge/why-are-llms-so-capable-at-understanding-language/ • https://www.baeldung.com/cs/dl-latent-space. • Unsupervised Representation

    Learning with Deep Convolutional Generative Adversarial Networks. https://arxiv.org/abs/1511.06434 • https://thesephist.com/posts/prism/ • Reconsidering Tweets: Intervening during Tweet Creation Decreases Offensive Content, https://ojs.aaai.org/index.php/ICWSM/article/view/19308 • StyDiff: a refined style transfer method based on diffusion models, https://www.nature.com/articles/s41598-025-17899-x • Analogies Explained: Towards Understanding Word Embeddings, https://arxiv.org/abs/1901.09813 • Effectively Steer LLM To Follow Preference via Building Confident Directions, https://arxiv.org/abs/2503.02989 • STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models, https://aclanthology.org/2025.emnlp-main.925/ • Dialz: A Python Toolkit for Steering Vectors, https://aclanthology.org/2025.acl-demo.35/ • ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection, https://aclanthology.org/2022.acl-long.234 • Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization, https://dl.acm.org/doi/10.1145/3534678.3539161 • https://www.emergentmind.com/topics/steering-vectors • https://bobrupakroy.medium.com/steering-large-language-models-with-activation-vectors-a-practical-guide-45866b3697ac • Improving Instruction-Following in Language Models through Activation Steering, https://arxiv.org/abs/2410.12877 • Language Model Alignment in Multilingual Trolley Problems, https://arxiv.org/abs/2407.02273 • ValuesRAG: Enhancing Cultural Alignment Through Retrieval-Augmented Contextual Learning, https://arxiv.org/abs/2501.01031 • https://www.lesswrong.com/posts/ndyngghzFY388Dnew/implementing-activation-steering