is not always explicit. • A Large Language Model (LLM) learns only from words accompany other words. • An LLM cannot experience “sexism” but it can classify if a tweet contains strong language appears to be “sexist” in nature. Figure from https://fibertide.com/knowledge/why-are-llms-so-capable-at-understanding-language/
“specialised action areas” they have been “trained on” . • Every input (image, sentence, sound) is mapped into a vector in N-Dimensional space. • LLM “latent spaces” are similar these specialised knowledge areas. Figure from https://www.baeldung.com/cs/dl-latent-space.
2. Load or generate a “contrastive dataset”. Examples of +/- aspects you desire. 3. Embed dataset from step 2 to obtain “steering vectors” as difference in latent vectors. 4. Set control in next inference/generation via steer vectors. 5. Visualise how positive and negative sentences receive different attention score.
Learning with Deep Convolutional Generative Adversarial Networks. https://arxiv.org/abs/1511.06434 • https://thesephist.com/posts/prism/ • Reconsidering Tweets: Intervening during Tweet Creation Decreases Offensive Content, https://ojs.aaai.org/index.php/ICWSM/article/view/19308 • StyDiff: a refined style transfer method based on diffusion models, https://www.nature.com/articles/s41598-025-17899-x • Analogies Explained: Towards Understanding Word Embeddings, https://arxiv.org/abs/1901.09813 • Effectively Steer LLM To Follow Preference via Building Confident Directions, https://arxiv.org/abs/2503.02989 • STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models, https://aclanthology.org/2025.emnlp-main.925/ • Dialz: A Python Toolkit for Steering Vectors, https://aclanthology.org/2025.acl-demo.35/ • ToxiGen: A Large-Scale Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection, https://aclanthology.org/2022.acl-long.234 • Proactively Reducing the Hate Intensity of Online Posts via Hate Speech Normalization, https://dl.acm.org/doi/10.1145/3534678.3539161 • https://www.emergentmind.com/topics/steering-vectors • https://bobrupakroy.medium.com/steering-large-language-models-with-activation-vectors-a-practical-guide-45866b3697ac • Improving Instruction-Following in Language Models through Activation Steering, https://arxiv.org/abs/2410.12877 • Language Model Alignment in Multilingual Trolley Problems, https://arxiv.org/abs/2407.02273 • ValuesRAG: Enhancing Cultural Alignment Through Retrieval-Augmented Contextual Learning, https://arxiv.org/abs/2501.01031 • https://www.lesswrong.com/posts/ndyngghzFY388Dnew/implementing-activation-steering