refusal-direction surgery for HuggingFace transformers

Abliteration

Find the single internal direction that makes a language model say "I can't help with that." Project it out of the weight matrices. Save the result.

Get started Read the paper

Dense + MoE support
Live ASCII dashboard
GGUF / llama.cpp ready
AMD & NVIDIA autodetect

     ▄▄▄      ▄▄▄▄    ██▓     ██▓▄▄▄█████▓▓█████  ██▀███   ▄▄▄     ▄▄▄█████▓▓█████
    ▒████▄   ▓█████▄ ▓██▒    ▓██▒▓  ██▒ ▓▒▓█   ▀ ▓██ ▒ ██▒▒████▄   ▓  ██▒ ▓▒▓█   ▀
    ▒██  ▀█▄ ▒██▒ ▄██▒██░    ▒██▒▒ ▓██░ ▒░▒███   ▓██ ░▄█ ▒▒██  ▀█▄ ▒ ▓██░ ▒░▒███
    ░██▄▄▄▄██▒██░█▀  ▒██░    ░██░░ ▓██▓ ░ ▒▓█  ▄ ▒██▀▀█▄  ░██▄▄▄▄██░ ▓██▓ ░ ▒▓█  ▄
     ▓█   ▓██░▓█  ▀█▓░██████▒░██░  ▒██▒ ░ ░▒████▒░██▓ ▒██▒ ▓█   ▓██▒ ▒██▒ ░ ░▒████▒
┌────────────────────────────────────────┬────────────────────────────────────────┐
│          PIPELINE STATUS               │           SYSTEM RESOURCES             │
├────────────────────────────────────────┼────────────────────────────────────────┤
│ > Abliterating weights                 │ AMD Radeon RX 7900 XTX                │
│ Progress ██████████████░░░░░░  70.5%   │ Load ████████████░░░░░░  62.3%         │
│ Layer 12 (o_proj + experts x64)        │ Temp  62.1C   Fan 1200 RPM             │
│ Elapsed 02:34  ETA 01:05               │ Pwr   218.4W                           │
├────────────────────────────────────────┼────────────────────────────────────────┤
│            ABLITERATION                │               MEMORY                   │
├────────────────────────────────────────┼────────────────────────────────────────┤
│ ⠹ Processing...                        │ VRAM ██████████████░░░░  71.2%       │
│ Refusal dir ░▒▓█▓▒░░▒▓█▓▒░░▒▓█▓▒░      │       54.3 /  96.0 GB                  │
│                                        │ RAM  ████████░░░░░░░░░░  42.1%         │
│                                        │       26.8 /  63.7 GB                  │
│                                        │ CPU  ██████░░░░░░░░░░░░  31.4%         │
└────────────────────────────────────────┴────────────────────────────────────────┘

Why this exists

◆One direction, many refusals

Arditi et al. (2024) showed that safety refusals across modern instruction-tuned LLMs are mediated by a single direction in the residual stream. Find that direction, project it out, and refusals collapse.

◆Surgical, not destructive

Only the refusal axis is removed. The model keeps its world knowledge, reasoning, instruction-following — every other capability. No retraining, no LoRA, no distillation.

◆Dense + MoE

Handles standard mlp.down_proj layers and both MoE expert layouts: the 3D nn.Parameter bundle (Gemma 4, Qwen3-MoE) and the ModuleList of nn.Linears (Mixtral).

◆Streaming activations

Per-layer running sums are accumulated on CPU during the forward pass — no full activation cache. Memory stays bounded regardless of how many prompts you push through.

◆GGUF in one command

The pipeline clones llama.cpp, exports the abliterated weights to GGUF F16, then quantizes to whatever formats you ask for — Q4_K_M, Q5_K_M, Q8_0, all of them.

◆Live ASCII dashboard

A red-and-purple terminal dashboard tracks pipeline progress, GPU load, fan RPM, temperature, VRAM, RAM, CPU — and an animated wave that pulses when the refusal direction is finally ✗'d out.

How it works

When a model is trained with safety alignment (RLHF, DPO, etc.), it learns to refuse certain requests. This refusal isn't scattered randomly — it's concentrated along one direction in the model's internal representation space.

1

Find the arrow

Run 64 harmful prompts ("How do I pick a lock?") and 64 harmless prompts ("How do I bake a cake?") through the model. Record the residual-stream activations at every layer. The difference between the average harmful and harmless activations points along the refusal direction.
2

Confirm it's one arrow

Apply PCA to the per-layer difference vectors. The first principal component is the refusal direction — a single unit vector that captures the dominant refusal axis across the layers we care about.
3

Erase the arrow

For every weight matrix that writes into the residual stream, subtract its projection onto the refusal direction. The model can no longer push its hidden state along that axis. Refusals stop happening.

Phase 1 — Activation Collection

Forward hooks register on each DecoderLayer. For every batch, the hook reads the residual-stream output, slices off --skip-begin leading tokens, weights by the attention mask (so padding doesn't pollute the mean), and accumulates a running sum + token count per layer.

Phase 2 — Refusal Direction

For each layer in the target range, compute:

δ_l = mean(harmful)_l − mean(harmless)_l

Stack the layer-wise δ vectors into a matrix and run PCA. The first principal component is the refusal direction r̂. PCA components are sign-ambiguous, so we sign-correct so the direction points from harmless toward harmful (geometrically equivalent, just consistent across runs).

Phase 3 — Weight Orthogonalization

For each weight matrix W that projects into the residual stream, apply:

W' = W − s · (W r̂) r̂ᵀ

Where s is the scale factor. Targets:

self_attn.o_proj.weight — attention output projection
mlp.down_proj.weight — dense MLP output
MoE experts:
- Gemma 4 / Qwen3-MoE — single 3D nn.Parameter at experts.down_proj; each expert slice orthogonalized independently
- Mixtral — ModuleList of nn.Linear at block_sparse_moe.experts; each down_proj.weight orthogonalized in turn
Final RMSNorm / LayerNorm
lm_head.weight (also reaches input embeddings via tying)

Phase 4 — GGUF (optional)

The HuggingFace artifact is converted to GGUF F16 via llama.cpp/convert_hf_to_gguf.py, then quantized to whatever formats --quants requests. The build step looks for llama-quantize in build/bin/, build/bin/Release/, and build/bin/Debug/ — multi-config generators (MSVC) put it in Release.

Live dashboard

A background thread redraws every 0.5s — phase-colored progress bar, ETA estimate, GPU temperature, fan RPM, and an animated refusal-direction wave that pulses red and clears to ✗ on completion.

GPU autodetection

Vendor	Detection	Stats source
AMD	`rocm-smi` binary or `/sys/class/drm/*/gpu_busy_percent`	rocm-smi JSON, falling back to sysfs reads
NVIDIA	`nvidia-smi` binary	nvidia-smi CSV query
Fallback	`torch.cuda.is_available()`	Basic VRAM from PyTorch

What you see

Pipeline panel — current step, phase-colored progress bar, substep detail, elapsed timer + live ETA, animated braille spinner
System panel — GPU load, temperature, fan (RPM on AMD, % on NVIDIA), power draw, vendor badge
Memory panel — VRAM used / total, system RAM, CPU utilization
Abliteration panel — animated refusal-direction wave; flashes ABLITERATED on completion

tip

Preview the dashboard without running abliteration: python display.py — runs an ~18-second simulated demo. Useful for verifying terminal compatibility.

Install

Requirements

Python 3.10+
GPU with enough VRAM for your target model (Gemma 4 26B-A4B-it ≈ 52 GB in bfloat16)
HuggingFace access — gated models (Gemma) require accepting the license
For GGUF: Git, CMake, C++ compiler (MSVC / GCC / Clang)
For dashboard: psutil + rocm-smi or nvidia-smi on PATH

One-liner

git clone https://github.com/MuchDevSuchCode/abliteration.git
cd abliteration
pip install -r requirements.txt
huggingface-cli login

Usage

Default everything — abliterate with the configured model, produce Q4_K_M and Q8_0 GGUFs:

python run.py --no-test

Defaults: layers 20%–80%, scale 1.0, dtype bfloat16, batch 16, seed 42.

# Custom model + scale
python run.py --model mistralai/Mistral-7B-Instruct-v0.3 --scale 0.85 --no-test

# Multiple quantization formats
python run.py --quants Q3_K_M Q4_K_M Q5_K_M Q6_K Q8_0 --no-test

# Skip GGUF, just abliterate
python run.py --skip-gguf --no-test

# Reproducible seed
python run.py --seed 1234 --no-test

# Plain output, no live dashboard
python run.py --no-display --no-test

# Smaller batches if you're VRAM-constrained
python abliterate.py --model google/gemma-4-26B-A4B-it --batch-size 4 --no-test

# Maximum-quality run in float32
python abliterate.py --model google/gemma-4-26B-A4B-it --dtype float32 --no-test

# Local model path
python abliterate.py --model /path/to/my-model --output-dir ./out --no-test

# Default quants
python convert_to_gguf.py --model-dir ./abliterated-model

# F16 only
python convert_to_gguf.py --model-dir ./abliterated-model --quants F16

# Many levels
python convert_to_gguf.py --model-dir ./abliterated-model --quants Q2_K Q3_K_M Q4_K_M Q5_K_M Q6_K Q8_0

# Reuse an existing llama.cpp checkout
python convert_to_gguf.py --model-dir ./abliterated-model --llama-cpp-dir /path/to/llama.cpp

Parameter reference

run.py

Flag	Default	Description
`--model`	`google/gemma-4-26B-A4B-it`	HuggingFace model ID or local path
`--output-dir`	`./abliterated-model`	Where to save the abliterated HF model
`--gguf-dir`	`./abliterated-model-gguf`	Where to save GGUF files
`--quants`	`Q4_K_M Q8_0`	GGUF quantization types
`--layer-fraction-start`	`0.2`	Start of target layer range
`--layer-fraction-end`	`0.8`	End of target layer range
`--scale`	`1.0`	Abliteration strength (0.0 = none, 1.0 = full)
`--dtype`	`bfloat16`	`float16` / `bfloat16` / `float32`
`--seed`	`42`	Random seed
`--skip-gguf`	off	Skip GGUF conversion
`--no-test`	off	Skip the interactive test loop
`--no-display`	off	Disable the live ASCII dashboard
`--llama-cpp-dir`	`./llama.cpp`	Path to llama.cpp checkout (auto-cloned)

abliterate.py — extra flags

Flag	Default	Description
`--skip-begin`	`1`	Number of leading tokens to skip when collecting activations
`--batch-size`	`16`	Batch size for forward passes

convert_to_gguf.py

Flag	Default	Description
`--model-dir`	`./abliterated-model`	Path to abliterated HF model
`--output`	`./abliterated-model-gguf`	Output directory for GGUF files
`--quants`	`Q4_K_M Q8_0`	Quantization types
`--llama-cpp-dir`	`./llama.cpp`	Path to llama.cpp (auto-cloned)

Tuning guide

incoherent Quality degraded

Lower the scale. Try --scale 0.7 or 0.5.
Narrow the layer range. Try --layer-fraction-start 0.3 --layer-fraction-end 0.6.
MoE caveat: dozens of expert weight matrices per layer get modified — start at --scale 0.7 for MoE models.

still refusing Some prompts still rejected

Raise the scale. --scale 1.2 or 1.5 over-corrects, pushing further away from the refusal axis.
Widen the layer range. --layer-fraction-start 0.1 --layer-fraction-end 0.9.

finding the sweet spot Iterative workflow

python abliterate.py --output-dir ./test-1.0
python abliterate.py --output-dir ./test-0.8 --scale 0.8
python abliterate.py --output-dir ./test-1.2 --scale 1.2

Test interactively (omit --no-test), find what works, then run run.py to produce final GGUFs.

Project structure

abliteration/
├── abliterate.py          # Core abliteration logic (dense + MoE)
├── convert_to_gguf.py     # HF → GGUF conversion + quantization
├── run.py                 # Full pipeline (abliterate + convert)
├── display.py             # Live ASCII dashboard
├── sysmon.py              # Standalone NeuralDeck system monitor
├── prompts.py             # Harmful + harmless prompt datasets
├── requirements.txt
├── info/                  # This documentation site
└── README.md

Troubleshooting

OutOfMemoryError during activation collection

Lower the batch size: --batch-size 4 or --batch-size 1. Activation sums are streamed to CPU during collection — the bottleneck is the forward pass itself. MoE models use more VRAM per forward pass than dense models of similar active-parameter count.

Access denied / gated repo error

Accept the model license on HuggingFace and log in:

huggingface-cli login

Dashboard shows "No GPU" / stats are N/A

Make sure nvidia-smi (NVIDIA) or rocm-smi (AMD) is on PATH. AMD on Linux without rocm-smi falls back to /sys/class/drm/ sysfs reads. AMD on Windows: install ROCm or use --no-display. Last-resort fallback uses torch.cuda for basic VRAM.

convert_hf_to_gguf.py errors

Usually llama.cpp is out of date. Delete the ./llama.cpp directory and re-run — the script will fetch a fresh clone.

llama-quantize build fails

Need CMake + a C++ compiler. Windows: install Visual Studio Build Tools. Ubuntu: sudo apt install cmake build-essential. The build looks in build/bin/, build/bin/Release/, and build/bin/Debug/ — multi-config generators (MSVC) put it in Release.

Model outputs are garbage after abliteration

Scale is too aggressive. Try --scale 0.5 and work upward. MoE models often need gentler scales since each layer modifies dozens of expert weight matrices.

Could not locate transformer backbone

The locator walks model.layers, model.model.layers, language_model.model.layers, transformer.h, and gpt_neox.layers. If your model uses a different layout this error is thrown. Architecturally novel models (Mamba, RWKV) aren't supported. Standard transformers — LLaMA, Mistral, Qwen, Phi, Gemma, Mixtral, Qwen3-MoE — work out of the box.

Refusal direction has near-zero norm

Harmful and harmless activations are too similar at the targeted layers — usually the layers are too early (signal hasn't formed) or too late (signal already consumed). Widen with --layer-fraction-start 0.1 --layer-fraction-end 0.9.

Credits

Technique — Refusal in Language Models Is Mediated by a Single Direction — Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda (2024)
Inspiration — FailSpy/abliterator and Maxime Labonne's blog post
GGUF conversion — llama.cpp

Abliteration

Why this exists

◆One direction, many refusals

◆Surgical, not destructive

◆Dense + MoE

◆Streaming activations

◆GGUF in one command

◆Live ASCII dashboard

How it works

Find the arrow

Confirm it's one arrow

Erase the arrow

Phase 1 — Activation Collection

Phase 2 — Refusal Direction

Phase 3 — Weight Orthogonalization

Phase 4 — GGUF (optional)

Live dashboard

GPU autodetection

What you see

Install

Requirements

One-liner

Usage

Parameter reference

run.py

abliterate.py — extra flags

convert_to_gguf.py

Tuning guide

Project structure

Troubleshooting

Credits