Null Space Projection - Interactive Demo

Interactive visualization of how null space constraints preserve capabilities during model modification

2D Vector Space Visualization

Preservation Vector (K)

Original Update (ΔW)

Projected Update (ΔW')

Row Space of K

Null Space of K

Interactive Controls

Preservation Vector K (what we want to preserve)

K_x 0.80

K_y 0.30

Original Update ΔW (modification we want to apply)

ΔW_x 0.60

ΔW_y 0.70

K · ΔW (before)

K · ΔW' (after)

|ΔW'| / |ΔW|

K [0.80, 0.30]

ΔW [0.60, 0.70]

ΔW' [-0.16, 0.42]

Dot Product (K · ΔW)

= (0.80 × 0.60) + (0.30 × 0.70)

= 0.48 + 0.21

= 0.69

Projection Formula

ΔW' = ΔW − [(K·ΔW) / |K|²] × K

|K|² = 0.73, scale = 0.95

= [0.60, 0.70] − [0.76, 0.28]

= [-0.16, 0.42]

Verification (K · ΔW')

= (0.80 × -0.16) + (0.30 × 0.42)

= -0.128 + 0.126

= ≈ 0 ✓

Step 1: The Problem

Goal: We want to modify a model's weights to remove unwanted behavior (like refusal), but we don't want to break its useful capabilities (like math, coding, reasoning).

The Challenge: A naive weight modification $ΔW$ might accidentally affect the outputs for inputs we care about preserving.

Solution: Project $ΔW$ into the null space of the preservation activations, ensuring the modification has zero effect on preserved capabilities.

Step 2: Understanding Null Space

The null space of a matrix $K$ contains all vectors $x$ where $Kx = 0$ .

In 2D visualization:

The green line shows the "row space" of K (directions that K responds to)
The blue region shows the "null space" (perpendicular to K)
Any vector in the null space, when multiplied by K, gives zero!

Step 3: Projection & Result

We decompose the original update $ΔW$ into two parts:

Row space component: Part that affects preservation inputs (we remove this)
Null space component: Part with zero effect on preservation (we keep this)

The projected update $ΔW'$ = $ΔW$ minus its row-space component.

Result:

Preservation guaranteed: $K \cdot ΔW' = 0$
Some modification lost: $|ΔW'| \leq |ΔW|$

Trade-off: The more aligned your update is with preservation directions, the more gets removed. In practice, refusal behavior often lives in directions somewhat orthogonal to general capabilities, so we can remove most of it while preserving capabilities!

Mathematical Formulation

Given preservation activations $K \in ℝ n\timesd$ (n samples, d dimensions):

1. Compute SVD of K:
K = UΣV^T

2. Build null space projector:
P_null = I - V_rV_r^T
(where V_r contains the r significant right singular vectors)

3. Project update into null space:
ΔW' = ΔW · P_null

4. Verify preservation:
K · ΔW' = K · ΔW · (I - VV^T) = K · ΔW - K · ΔW · VV^T ≈ 0

Why it works: The rows of V span the row space of K. Subtracting $VV T$ removes all components in the row space, leaving only the null space.

Implementation

import torch

def compute_null_space_projector(K):
    """
    Compute projector onto null space of K.

    Args:
        K: Preservation activations [n_samples, d_model]

    Returns:
        P_null: Projector matrix [d_model, d_model]
    """
    # Compute SVD (no centering - we want exact null space of K)
    U, S, Vh = torch.linalg.svd(K, full_matrices=False)

    # Use all right singular vectors (rows of Vh) as row space basis
    # V_r columns span the row space of K
    V_r = Vh.T  # [d_model, rank] where rank = min(n_samples, d_model)

    # Null space projector: I - V_r @ V_r.T
    # Projects onto orthogonal complement of row space
    P_null = torch.eye(K.shape[1], device=K.device, dtype=K.dtype) - V_r @ V_r.T

    return P_null


def apply_null_space_projection(delta_W, P_null):
    """Project weight update into null space."""
    return delta_W @ P_null


# ===========================================
# Runnable example with synthetic data:
# ===========================================

# Simulate preservation activations (e.g., from math/coding prompts)
# Shape: [n_samples, hidden_dim]
n_samples, hidden_dim = 50, 128
K = torch.randn(n_samples, hidden_dim)

# Compute the null space projector
P_null = compute_null_space_projector(K)

# Simulate a weight update we want to apply (e.g., refusal direction)
delta_W = torch.randn(hidden_dim)

# Project into null space (safe update)
delta_W_safe = apply_null_space_projection(delta_W, P_null)

# Verify: K @ delta_W_safe should be ~0 for all samples
effect_on_preservation = K @ delta_W_safe
print(f"Max effect on preservation inputs: {effect_on_preservation.abs().max():.2e}")
# Output: Max effect on preservation inputs: ~1e-6 (effectively zero!)
                    

Application to Model Abliteration

In the context of removing refusal behavior from language models:

Component	In Demo	In Abliteration
K	Preservation vector	Activations from math, coding, reasoning prompts
ΔW	Original update	Refusal direction projection
ΔW'	Projected update	Safe refusal removal (preserves capabilities)
K · ΔW' = 0	Dot product is zero	Math/coding outputs unchanged

Practical considerations:

Use diverse preservation prompts (35+ covering math, coding, reasoning, etc.)
rank_ratio of 0.95 keeps most capability variance while allowing some modification
Lower rank_ratio = more aggressive preservation (but less refusal removal)
Compute P_null once per layer, reuse for all weight matrices in that layer