2D Vector Space Visualization
Interactive Controls
Step 1: The Problem
Goal: We want to modify a model's weights to remove unwanted behavior (like refusal), but we don't want to break its useful capabilities (like math, coding, reasoning).
The Challenge: A naive weight modification ΔW might accidentally affect the outputs for inputs we care about preserving.
Solution: Project ΔW into the null space of the preservation activations, ensuring the modification has zero effect on preserved capabilities.
Step 2: Understanding Null Space
The null space of a matrix K contains all vectors x where Kx = 0.
In 2D visualization:
- The green line shows the "row space" of K (directions that K responds to)
- The blue region shows the "null space" (perpendicular to K)
- Any vector in the null space, when multiplied by K, gives zero!
Step 3: Projection & Result
We decompose the original update ΔW into two parts:
- Row space component: Part that affects preservation inputs (we remove this)
- Null space component: Part with zero effect on preservation (we keep this)
The projected update ΔW' = ΔW minus its row-space component.
Result:
- Preservation guaranteed: K · ΔW' = 0
- Some modification lost: |ΔW'| ≤ |ΔW|
Trade-off: The more aligned your update is with preservation directions, the more gets removed. In practice, refusal behavior often lives in directions somewhat orthogonal to general capabilities, so we can remove most of it while preserving capabilities!
Mathematical Formulation
Given preservation activations K ∈ ℝn×d (n samples, d dimensions):
K = UΣVT
Pnull = I - VrVrT
(where Vr contains the r significant right singular vectors)
ΔW' = ΔW · Pnull
K · ΔW' = K · ΔW · (I - VVT) = K · ΔW - K · ΔW · VVT ≈ 0
Why it works: The rows of V span the row space of K. Subtracting VVT removes all components in the row space, leaving only the null space.
Implementation
Application to Model Abliteration
In the context of removing refusal behavior from language models:
| Component | In Demo | In Abliteration |
|---|---|---|
| K | Preservation vector | Activations from math, coding, reasoning prompts |
| ΔW | Original update | Refusal direction projection |
| ΔW' | Projected update | Safe refusal removal (preserves capabilities) |
| K · ΔW' = 0 | Dot product is zero | Math/coding outputs unchanged |
Practical considerations:
- Use diverse preservation prompts (35+ covering math, coding, reasoning, etc.)
- rank_ratio of 0.95 keeps most capability variance while allowing some modification
- Lower rank_ratio = more aggressive preservation (but less refusal removal)
- Compute P_null once per layer, reuse for all weight matrices in that layer