Abliteration on LLMs is the act of removing guardrails - here I show how to make Llama 3.1 'less kind and good' with questions around explosives, financial restructuring advice, rude jokes and security vulnerabilities. I'm interested in the question - whilst guardrails stop us asking 'awkward questions', what other answers are watered down such that we don't get useful responses?
Created as an outcome of my playgroup research days: https://www.linkedin.com/feed/update/urn:li:activity:7396293087674933248/