One direction, many refusals
Arditi et al. (2024) showed that safety refusals across modern instruction-tuned LLMs are mediated by a single direction in the residual stream. Find that direction, project it out, and refusals collapse.
refusal-direction surgery for HuggingFace transformers
Find the single internal direction that makes a language model say "I can't help with that." Project it out of the weight matrices. Save the result.
Arditi et al. (2024) showed that safety refusals across modern instruction-tuned LLMs are mediated by a single direction in the residual stream. Find that direction, project it out, and refusals collapse.
Only the refusal axis is removed. The model keeps its world knowledge, reasoning, instruction-following — every other capability. No retraining, no LoRA, no distillation.
Handles standard mlp.down_proj layers and both MoE expert layouts: the 3D nn.Parameter bundle (Gemma 4, Qwen3-MoE) and the ModuleList of nn.Linears (Mixtral).
Per-layer running sums are accumulated on CPU during the forward pass — no full activation cache. Memory stays bounded regardless of how many prompts you push through.
The pipeline clones llama.cpp, exports the abliterated weights to GGUF F16, then quantizes to whatever formats you ask for — Q4_K_M, Q5_K_M, Q8_0, all of them.
A red-and-purple terminal dashboard tracks pipeline progress, GPU load, fan RPM, temperature, VRAM, RAM, CPU — and an animated wave that pulses when the refusal direction is finally ✗'d out.
When a model is trained with safety alignment (RLHF, DPO, etc.), it learns to refuse certain requests. This refusal isn't scattered randomly — it's concentrated along one direction in the model's internal representation space.
Run 64 harmful prompts ("How do I pick a lock?") and 64 harmless prompts ("How do I bake a cake?") through the model. Record the residual-stream activations at every layer. The difference between the average harmful and harmless activations points along the refusal direction.
Apply PCA to the per-layer difference vectors. The first principal component is the refusal direction — a single unit vector that captures the dominant refusal axis across the layers we care about.
For every weight matrix that writes into the residual stream, subtract its projection onto the refusal direction. The model can no longer push its hidden state along that axis. Refusals stop happening.
Forward hooks register on each DecoderLayer. For every batch, the hook reads the residual-stream output, slices off --skip-begin leading tokens, weights by the attention mask (so padding doesn't pollute the mean), and accumulates a running sum + token count per layer.
For each layer in the target range, compute:
δ_l = mean(harmful)_l − mean(harmless)_l
Stack the layer-wise δ vectors into a matrix and run PCA. The first principal component is the refusal direction r̂. PCA components are sign-ambiguous, so we sign-correct so the direction points from harmless toward harmful (geometrically equivalent, just consistent across runs).
For each weight matrix W that projects into the residual stream, apply:
W' = W − s · (W r̂) r̂ᵀ
Where s is the scale factor. Targets:
self_attn.o_proj.weight — attention output projectionmlp.down_proj.weight — dense MLP outputnn.Parameter at experts.down_proj; each expert slice orthogonalized independentlyModuleList of nn.Linear at block_sparse_moe.experts; each down_proj.weight orthogonalized in turnlm_head.weight (also reaches input embeddings via tying)The HuggingFace artifact is converted to GGUF F16 via llama.cpp/convert_hf_to_gguf.py, then quantized to whatever formats --quants requests. The build step looks for llama-quantize in build/bin/, build/bin/Release/, and build/bin/Debug/ — multi-config generators (MSVC) put it in Release.
A background thread redraws every 0.5s — phase-colored progress bar, ETA estimate, GPU temperature, fan RPM, and an animated refusal-direction wave that pulses red and clears to ✗ on completion.
| Vendor | Detection | Stats source |
|---|---|---|
| AMD | rocm-smi binary or /sys/class/drm/*/gpu_busy_percent | rocm-smi JSON, falling back to sysfs reads |
| NVIDIA | nvidia-smi binary | nvidia-smi CSV query |
| Fallback | torch.cuda.is_available() | Basic VRAM from PyTorch |
ABLITERATED on completionPreview the dashboard without running abliteration: python display.py — runs an ~18-second simulated demo. Useful for verifying terminal compatibility.
psutil + rocm-smi or nvidia-smi on PATHgit clone https://github.com/MuchDevSuchCode/abliteration.git
cd abliteration
pip install -r requirements.txt
huggingface-cli login
Default everything — abliterate with the configured model, produce Q4_K_M and Q8_0 GGUFs:
python run.py --no-test
Defaults: layers 20%–80%, scale 1.0, dtype bfloat16, batch 16, seed 42.
# Custom model + scale
python run.py --model mistralai/Mistral-7B-Instruct-v0.3 --scale 0.85 --no-test
# Multiple quantization formats
python run.py --quants Q3_K_M Q4_K_M Q5_K_M Q6_K Q8_0 --no-test
# Skip GGUF, just abliterate
python run.py --skip-gguf --no-test
# Reproducible seed
python run.py --seed 1234 --no-test
# Plain output, no live dashboard
python run.py --no-display --no-test
# Smaller batches if you're VRAM-constrained
python abliterate.py --model google/gemma-4-26B-A4B-it --batch-size 4 --no-test
# Maximum-quality run in float32
python abliterate.py --model google/gemma-4-26B-A4B-it --dtype float32 --no-test
# Local model path
python abliterate.py --model /path/to/my-model --output-dir ./out --no-test
# Default quants
python convert_to_gguf.py --model-dir ./abliterated-model
# F16 only
python convert_to_gguf.py --model-dir ./abliterated-model --quants F16
# Many levels
python convert_to_gguf.py --model-dir ./abliterated-model --quants Q2_K Q3_K_M Q4_K_M Q5_K_M Q6_K Q8_0
# Reuse an existing llama.cpp checkout
python convert_to_gguf.py --model-dir ./abliterated-model --llama-cpp-dir /path/to/llama.cpp
| Flag | Default | Description |
|---|---|---|
--model | google/gemma-4-26B-A4B-it | HuggingFace model ID or local path |
--output-dir | ./abliterated-model | Where to save the abliterated HF model |
--gguf-dir | ./abliterated-model-gguf | Where to save GGUF files |
--quants | Q4_K_M Q8_0 | GGUF quantization types |
--layer-fraction-start | 0.2 | Start of target layer range |
--layer-fraction-end | 0.8 | End of target layer range |
--scale | 1.0 | Abliteration strength (0.0 = none, 1.0 = full) |
--dtype | bfloat16 | float16 / bfloat16 / float32 |
--seed | 42 | Random seed |
--skip-gguf | off | Skip GGUF conversion |
--no-test | off | Skip the interactive test loop |
--no-display | off | Disable the live ASCII dashboard |
--llama-cpp-dir | ./llama.cpp | Path to llama.cpp checkout (auto-cloned) |
| Flag | Default | Description |
|---|---|---|
--skip-begin | 1 | Number of leading tokens to skip when collecting activations |
--batch-size | 16 | Batch size for forward passes |
| Flag | Default | Description |
|---|---|---|
--model-dir | ./abliterated-model | Path to abliterated HF model |
--output | ./abliterated-model-gguf | Output directory for GGUF files |
--quants | Q4_K_M Q8_0 | Quantization types |
--llama-cpp-dir | ./llama.cpp | Path to llama.cpp (auto-cloned) |
--scale 0.7 or 0.5.--layer-fraction-start 0.3 --layer-fraction-end 0.6.--scale 0.7 for MoE models.--scale 1.2 or 1.5 over-corrects, pushing further away from the refusal axis.--layer-fraction-start 0.1 --layer-fraction-end 0.9.python abliterate.py --output-dir ./test-1.0
python abliterate.py --output-dir ./test-0.8 --scale 0.8
python abliterate.py --output-dir ./test-1.2 --scale 1.2
Test interactively (omit --no-test), find what works, then run run.py to produce final GGUFs.
abliteration/
├── abliterate.py # Core abliteration logic (dense + MoE)
├── convert_to_gguf.py # HF → GGUF conversion + quantization
├── run.py # Full pipeline (abliterate + convert)
├── display.py # Live ASCII dashboard
├── sysmon.py # Standalone NeuralDeck system monitor
├── prompts.py # Harmful + harmless prompt datasets
├── requirements.txt
├── info/ # This documentation site
└── README.md
Lower the batch size: --batch-size 4 or --batch-size 1. Activation sums are streamed to CPU during collection — the bottleneck is the forward pass itself. MoE models use more VRAM per forward pass than dense models of similar active-parameter count.
Accept the model license on HuggingFace and log in:
huggingface-cli login
Make sure nvidia-smi (NVIDIA) or rocm-smi (AMD) is on PATH. AMD on Linux without rocm-smi falls back to /sys/class/drm/ sysfs reads. AMD on Windows: install ROCm or use --no-display. Last-resort fallback uses torch.cuda for basic VRAM.
Usually llama.cpp is out of date. Delete the ./llama.cpp directory and re-run — the script will fetch a fresh clone.
Need CMake + a C++ compiler. Windows: install Visual Studio Build Tools. Ubuntu: sudo apt install cmake build-essential. The build looks in build/bin/, build/bin/Release/, and build/bin/Debug/ — multi-config generators (MSVC) put it in Release.
Scale is too aggressive. Try --scale 0.5 and work upward. MoE models often need gentler scales since each layer modifies dozens of expert weight matrices.
The locator walks model.layers, model.model.layers, language_model.model.layers, transformer.h, and gpt_neox.layers. If your model uses a different layout this error is thrown. Architecturally novel models (Mamba, RWKV) aren't supported. Standard transformers — LLaMA, Mistral, Qwen, Phi, Gemma, Mixtral, Qwen3-MoE — work out of the box.
Harmful and harmless activations are too similar at the targeted layers — usually the layers are too early (signal hasn't formed) or too late (signal already consumed). Widen with --layer-fraction-start 0.1 --layer-fraction-end 0.9.