LW - [Paper] Programming Refusal with Conditional Activation Steering by Bruce W. Lee

The Nonlinear Library

A tartalmat a The Nonlinear Fund biztosítja. Az összes podcast-tartalmat, beleértve az epizódokat, grafikákat és podcast-leírásokat, közvetlenül a The Nonlinear Fund vagy a podcast platform partnere tölti fel és biztosítja. Ha úgy gondolja, hogy valaki az Ön engedélye nélkül használja fel a szerzői joggal védett művét, kövesse az itt leírt folyamatot https://hu.player.fm/legal.

6d ago 18:30

MP3•Epizód kép

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Paper] Programming Refusal with Conditional Activation Steering, published by Bruce W. Lee on September 12, 2024 on LessWrong.
For full content, refer to the arXiv preprint at https://arxiv.org/abs/2409.05907. This post is a lighter, 15-minute version.
Abstract
Existing activation steering methods alter LLM behavior indiscriminately, limiting their practical applicability in settings where selective responses are essential, such as content moderation or domain-specific assistants.
We propose Conditional Activation Steering (CAST), which analyzes LLM activation patterns during inference to selectively apply or withhold activation steering based on the input context.
Using CAST, one can systematically control LLM behavior with rules like "if input is about hate speech or adult content, then refuse" or "if input is not about legal advice, then refuse."
This allows for selective modification of responses to specific content while maintaining normal responses to other content, all without requiring weight optimization.
We release an open-source implementation of the activation steering toolkit at https://github.com/IBM/activation-steering.
Introduction
Problem: Lack of conditional control in activation steering.
Activation steering offers a promising alternative to optimization-based techniques by directly manipulating the model's native representations, often requiring only a simple activation addition step during each forward call. Our work here builds on Refusal in LLMs is mediated by a single direction, which has shown promise in altering LLM behavior, such as removing or inducing refusal behavior. However, the key limitation of current methods is the inability to condition when and what to refuse.
That is, adding a "refusal vector" using existing activation steering methods increases refusal rates indiscriminately across all inputs, limiting the model's utility.
Contribution: Expanding activation steering formulation.
We introduce Conditional Activation Steering (CAST), a method that enables fine-grained, context-dependent control over LLM behaviors. We introduce a new type of steering vector in the activation steering formulation, the condition vector, representing certain activation patterns induced by the prompt during the inference process.
A simple similarity calculation between this condition vector and the model's activation at inference time effectively serves as a switch, determining whether to apply the refusal vector. This approach allows for selective refusal of harmful prompts while maintaining the ability to respond to harmless ones, as depicted below.
Application: Selecting what to refuse.
Many alignment goals concern contextually refusing specific classes of instructions. Traditional methods like preference modeling are resource-intensive and struggle with subjective, black-box rewards. Additionally, the definition of harmful content varies across contexts, complicating the creation of universal harm models.
The usage context further complicates this variability; for instance, discussing medical advice might be harmful in some situations but essential in others, such as in medical chatbots.
We show CAST can implement behavioral rules like "if input is about hate speech or adult content, then refuse" or "if input is not about legal advice, then refuse", allowing
for selective modification of responses to specific content without weight optimization.
On a technical level, our primary insight is that different prompts consistently activate distinct patterns in the model's hidden states during inference. These patterns can be extracted as a steering vector and used as reference points for detecting specific prompt categories or contexts. This observation allows us to use steering vectors not only as behavior modification mechanisms but also as condition ...

2443 epizódok

#Podcasting Education #The Nonlinear Fund