Null Space Projection

Interactive visualization of how null space constraints preserve capabilities during model modification

1
2
3

2D Vector Space Visualization

Preservation Vector (K)
Original Update (ΔW)
Projected Update (ΔW')
Row Space of K
Null Space of K

Interactive Controls

Kx 0.80
Ky 0.30
ΔWx 0.60
ΔWy 0.70
K · ΔW (before)
-
K · ΔW' (after)
-
|ΔW'| / |ΔW|
-
K [0.80, 0.30]
ΔW [0.60, 0.70]
ΔW' [-0.16, 0.42]
= (0.80 × 0.60) + (0.30 × 0.70)
= 0.48 + 0.21
= 0.69
ΔW' = ΔW − [(K·ΔW) / |K|²] × K
|K|² = 0.73, scale = 0.95
= [0.60, 0.70] − [0.76, 0.28]
= [-0.16, 0.42]
= (0.80 × -0.16) + (0.30 × 0.42)
= -0.128 + 0.126
= ≈ 0 ✓

Step 1: The Problem

Goal: We want to modify a model's weights to remove unwanted behavior (like refusal), but we don't want to break its useful capabilities (like math, coding, reasoning).

The Challenge: A naive weight modification ΔW might accidentally affect the outputs for inputs we care about preserving.

Solution: Project ΔW into the null space of the preservation activations, ensuring the modification has zero effect on preserved capabilities.

Step 2: Understanding Null Space

The null space of a matrix K contains all vectors x where Kx = 0.

In 2D visualization:

  • The green line shows the "row space" of K (directions that K responds to)
  • The blue region shows the "null space" (perpendicular to K)
  • Any vector in the null space, when multiplied by K, gives zero!

Step 3: Projection & Result

We decompose the original update ΔW into two parts:

  • Row space component: Part that affects preservation inputs (we remove this)
  • Null space component: Part with zero effect on preservation (we keep this)

The projected update ΔW' = ΔW minus its row-space component.

Result:

  • Preservation guaranteed: K · ΔW' = 0
  • Some modification lost: |ΔW'| ≤ |ΔW|

Trade-off: The more aligned your update is with preservation directions, the more gets removed. In practice, refusal behavior often lives in directions somewhat orthogonal to general capabilities, so we can remove most of it while preserving capabilities!

Mathematical Formulation

Given preservation activations K ∈ ℝn×d (n samples, d dimensions):

1. Compute SVD of K:
K = UΣVT
2. Build null space projector:
Pnull = I - VrVrT
(where Vr contains the r significant right singular vectors)
3. Project update into null space:
ΔW' = ΔW · Pnull
4. Verify preservation:
K · ΔW' = K · ΔW · (I - VVT) = K · ΔW - K · ΔW · VVT ≈ 0

Why it works: The rows of V span the row space of K. Subtracting VVT removes all components in the row space, leaving only the null space.

Implementation

import torch def compute_null_space_projector(K): """ Compute projector onto null space of K. Args: K: Preservation activations [n_samples, d_model] Returns: P_null: Projector matrix [d_model, d_model] """ # Compute SVD (no centering - we want exact null space of K) U, S, Vh = torch.linalg.svd(K, full_matrices=False) # Use all right singular vectors (rows of Vh) as row space basis # V_r columns span the row space of K V_r = Vh.T # [d_model, rank] where rank = min(n_samples, d_model) # Null space projector: I - V_r @ V_r.T # Projects onto orthogonal complement of row space P_null = torch.eye(K.shape[1], device=K.device, dtype=K.dtype) - V_r @ V_r.T return P_null def apply_null_space_projection(delta_W, P_null): """Project weight update into null space.""" return delta_W @ P_null # =========================================== # Runnable example with synthetic data: # =========================================== # Simulate preservation activations (e.g., from math/coding prompts) # Shape: [n_samples, hidden_dim] n_samples, hidden_dim = 50, 128 K = torch.randn(n_samples, hidden_dim) # Compute the null space projector P_null = compute_null_space_projector(K) # Simulate a weight update we want to apply (e.g., refusal direction) delta_W = torch.randn(hidden_dim) # Project into null space (safe update) delta_W_safe = apply_null_space_projection(delta_W, P_null) # Verify: K @ delta_W_safe should be ~0 for all samples effect_on_preservation = K @ delta_W_safe print(f"Max effect on preservation inputs: {effect_on_preservation.abs().max():.2e}") # Output: Max effect on preservation inputs: ~1e-6 (effectively zero!)

Application to Model Abliteration

In the context of removing refusal behavior from language models:

Component In Demo In Abliteration
K Preservation vector Activations from math, coding, reasoning prompts
ΔW Original update Refusal direction projection
ΔW' Projected update Safe refusal removal (preserves capabilities)
K · ΔW' = 0 Dot product is zero Math/coding outputs unchanged

Practical considerations:

  • Use diverse preservation prompts (35+ covering math, coding, reasoning, etc.)
  • rank_ratio of 0.95 keeps most capability variance while allowing some modification
  • Lower rank_ratio = more aggressive preservation (but less refusal removal)
  • Compute P_null once per layer, reuse for all weight matrices in that layer