refusal-direction surgery for HuggingFace transformers

Abliteration

Find the single internal direction that makes a language model say "I can't help with that." Project it out of the weight matrices. Save the result.

  • Dense + MoE support
  • Live ASCII dashboard
  • GGUF / llama.cpp ready
  • AMD & NVIDIA autodetect

Why this exists

One direction, many refusals

Arditi et al. (2024) showed that safety refusals across modern instruction-tuned LLMs are mediated by a single direction in the residual stream. Find that direction, project it out, and refusals collapse.

Surgical, not destructive

Only the refusal axis is removed. The model keeps its world knowledge, reasoning, instruction-following — every other capability. No retraining, no LoRA, no distillation.

Dense + MoE

Handles standard mlp.down_proj layers and both MoE expert layouts: the 3D nn.Parameter bundle (Gemma 4, Qwen3-MoE) and the ModuleList of nn.Linears (Mixtral).

Streaming activations

Per-layer running sums are accumulated on CPU during the forward pass — no full activation cache. Memory stays bounded regardless of how many prompts you push through.

GGUF in one command

The pipeline clones llama.cpp, exports the abliterated weights to GGUF F16, then quantizes to whatever formats you ask for — Q4_K_M, Q5_K_M, Q8_0, all of them.

Live ASCII dashboard

A red-and-purple terminal dashboard tracks pipeline progress, GPU load, fan RPM, temperature, VRAM, RAM, CPU — and an animated wave that pulses when the refusal direction is finally 'd out.

How it works

When a model is trained with safety alignment (RLHF, DPO, etc.), it learns to refuse certain requests. This refusal isn't scattered randomly — it's concentrated along one direction in the model's internal representation space.

  1. 1

    Find the arrow

    Run 64 harmful prompts ("How do I pick a lock?") and 64 harmless prompts ("How do I bake a cake?") through the model. Record the residual-stream activations at every layer. The difference between the average harmful and harmless activations points along the refusal direction.

  2. 2

    Confirm it's one arrow

    Apply PCA to the per-layer difference vectors. The first principal component is the refusal direction — a single unit vector that captures the dominant refusal axis across the layers we care about.

  3. 3

    Erase the arrow

    For every weight matrix that writes into the residual stream, subtract its projection onto the refusal direction. The model can no longer push its hidden state along that axis. Refusals stop happening.

Live dashboard

A background thread redraws every 0.5s — phase-colored progress bar, ETA estimate, GPU temperature, fan RPM, and an animated refusal-direction wave that pulses red and clears to on completion.

GPU autodetection

VendorDetectionStats source
AMDrocm-smi binary or /sys/class/drm/*/gpu_busy_percentrocm-smi JSON, falling back to sysfs reads
NVIDIAnvidia-smi binarynvidia-smi CSV query
Fallbacktorch.cuda.is_available()Basic VRAM from PyTorch

What you see

  • Pipeline panel — current step, phase-colored progress bar, substep detail, elapsed timer + live ETA, animated braille spinner
  • System panel — GPU load, temperature, fan (RPM on AMD, % on NVIDIA), power draw, vendor badge
  • Memory panel — VRAM used / total, system RAM, CPU utilization
  • Abliteration panel — animated refusal-direction wave; flashes ABLITERATED on completion
tip

Preview the dashboard without running abliteration: python display.py — runs an ~18-second simulated demo. Useful for verifying terminal compatibility.

Install

Requirements

  • Python 3.10+
  • GPU with enough VRAM for your target model (Gemma 4 26B-A4B-it ≈ 52 GB in bfloat16)
  • HuggingFace access — gated models (Gemma) require accepting the license
  • For GGUF: Git, CMake, C++ compiler (MSVC / GCC / Clang)
  • For dashboard: psutil + rocm-smi or nvidia-smi on PATH

One-liner

git clone https://github.com/MuchDevSuchCode/abliteration.git
cd abliteration
pip install -r requirements.txt
huggingface-cli login

Usage

Default everything — abliterate with the configured model, produce Q4_K_M and Q8_0 GGUFs:

python run.py --no-test

Defaults: layers 20%–80%, scale 1.0, dtype bfloat16, batch 16, seed 42.

Parameter reference

run.py

FlagDefaultDescription
--modelgoogle/gemma-4-26B-A4B-itHuggingFace model ID or local path
--output-dir./abliterated-modelWhere to save the abliterated HF model
--gguf-dir./abliterated-model-ggufWhere to save GGUF files
--quantsQ4_K_M Q8_0GGUF quantization types
--layer-fraction-start0.2Start of target layer range
--layer-fraction-end0.8End of target layer range
--scale1.0Abliteration strength (0.0 = none, 1.0 = full)
--dtypebfloat16float16 / bfloat16 / float32
--seed42Random seed
--skip-ggufoffSkip GGUF conversion
--no-testoffSkip the interactive test loop
--no-displayoffDisable the live ASCII dashboard
--llama-cpp-dir./llama.cppPath to llama.cpp checkout (auto-cloned)

abliterate.py — extra flags

FlagDefaultDescription
--skip-begin1Number of leading tokens to skip when collecting activations
--batch-size16Batch size for forward passes

convert_to_gguf.py

FlagDefaultDescription
--model-dir./abliterated-modelPath to abliterated HF model
--output./abliterated-model-ggufOutput directory for GGUF files
--quantsQ4_K_M Q8_0Quantization types
--llama-cpp-dir./llama.cppPath to llama.cpp (auto-cloned)

Tuning guide

incoherent Quality degraded
  • Lower the scale. Try --scale 0.7 or 0.5.
  • Narrow the layer range. Try --layer-fraction-start 0.3 --layer-fraction-end 0.6.
  • MoE caveat: dozens of expert weight matrices per layer get modified — start at --scale 0.7 for MoE models.
still refusing Some prompts still rejected
  • Raise the scale. --scale 1.2 or 1.5 over-corrects, pushing further away from the refusal axis.
  • Widen the layer range. --layer-fraction-start 0.1 --layer-fraction-end 0.9.
finding the sweet spot Iterative workflow
python abliterate.py --output-dir ./test-1.0
python abliterate.py --output-dir ./test-0.8 --scale 0.8
python abliterate.py --output-dir ./test-1.2 --scale 1.2

Test interactively (omit --no-test), find what works, then run run.py to produce final GGUFs.

Project structure

abliteration/
├── abliterate.py          # Core abliteration logic (dense + MoE)
├── convert_to_gguf.py     # HF → GGUF conversion + quantization
├── run.py                 # Full pipeline (abliterate + convert)
├── display.py             # Live ASCII dashboard
├── sysmon.py              # Standalone NeuralDeck system monitor
├── prompts.py             # Harmful + harmless prompt datasets
├── requirements.txt
├── info/                  # This documentation site
└── README.md

Troubleshooting

OutOfMemoryError during activation collection

Lower the batch size: --batch-size 4 or --batch-size 1. Activation sums are streamed to CPU during collection — the bottleneck is the forward pass itself. MoE models use more VRAM per forward pass than dense models of similar active-parameter count.

Access denied / gated repo error

Accept the model license on HuggingFace and log in:

huggingface-cli login
Dashboard shows "No GPU" / stats are N/A

Make sure nvidia-smi (NVIDIA) or rocm-smi (AMD) is on PATH. AMD on Linux without rocm-smi falls back to /sys/class/drm/ sysfs reads. AMD on Windows: install ROCm or use --no-display. Last-resort fallback uses torch.cuda for basic VRAM.

convert_hf_to_gguf.py errors

Usually llama.cpp is out of date. Delete the ./llama.cpp directory and re-run — the script will fetch a fresh clone.

llama-quantize build fails

Need CMake + a C++ compiler. Windows: install Visual Studio Build Tools. Ubuntu: sudo apt install cmake build-essential. The build looks in build/bin/, build/bin/Release/, and build/bin/Debug/ — multi-config generators (MSVC) put it in Release.

Model outputs are garbage after abliteration

Scale is too aggressive. Try --scale 0.5 and work upward. MoE models often need gentler scales since each layer modifies dozens of expert weight matrices.

Could not locate transformer backbone

The locator walks model.layers, model.model.layers, language_model.model.layers, transformer.h, and gpt_neox.layers. If your model uses a different layout this error is thrown. Architecturally novel models (Mamba, RWKV) aren't supported. Standard transformers — LLaMA, Mistral, Qwen, Phi, Gemma, Mixtral, Qwen3-MoE — work out of the box.

Refusal direction has near-zero norm

Harmful and harmless activations are too similar at the targeted layers — usually the layers are too early (signal hasn't formed) or too late (signal already consumed). Widen with --layer-fraction-start 0.1 --layer-fraction-end 0.9.

Credits