Improving Activation Steering
Gated Cropped Attention-Delta Steering fixes KV-cache contamination in multi-turn dialogue
This page describes Prompt–Activation Duality: Improving Activation Steering via Attention-Level Interventions, submitted to NeurIPS 2026.
Activation steering controls language model behavior by adding a direction in the residual stream at inference time — a lightweight, reversible alternative to fine-tuning. But standard residual-stream steering has a hidden failure mode in stateful, multi-turn dialogue: steered token states get stored in the KV cache and repeatedly reused, turning a local perturbation into cumulative coherence degradation across conversation turns.
We identify this failure mode as KV-cache contamination and show that coherence deteriorates across turns even when single-turn behavior looks strong. Crucially, prompt-only control remains stable under the same protocol — so the problem is not long context alone. The intervention is entering the computation at the wrong site.
To address this, we propose Gated Cropped Attention-Delta steering (GCAD), which extracts steering signals from system-prompt contributions to self-attention and applies them with token-level gating. Rather than injecting a large residual-stream perturbation after attention and MLP computation have been combined, GCAD introduces smaller attention-level perturbations that subsequent layers can transform and integrate — following the same pathways through which system prompts already exercise behavioral control.
On persona-steering experiments with Qwen2.5-7B-Instruct:
- Average coherence drift: improved from −18.6 to −1.9 (standard steering vs. GCAD)
- Turn-10 trait expression: raised from 78.0% to 93.1%
- Trait control is preserved while long-horizon stability improves substantially
Mechanistic analysis shows that GCAD produces smoother perturbation trajectories that better align with downstream computation, while standard residual-stream steering elicits downstream resistance. The results suggest that activation steering becomes reliably usable in production settings only when interventions follow the prompt-mediated pathways that models already use for behavioral control.