Automatically removes censorship from language models without expensive retraining - achieving 97% → 3% refusal rates while preserving model intelligence better than manual approaches

Ever wanted to uncensor a language model but got lost in the maze of transformer internals and manual parameter tuning? Heretic solves this with a single command line that does what previously required deep ML expertise. It combines directional ablation (“abliteration”) with TPE-based optimization to automatically find the sweet spot between removing refusals and preserving the model’s original capabilities.

The results speak for themselves: Heretic’s automated approach produces models with the same 3/100 refusal rate as expert-crafted abliterations, but with 0.16 KL divergence compared to 0.45-1.04 for manual alternatives. That means significantly less damage to the model’s intelligence while achieving identical uncensoring. The tool runs completely unsupervised with sensible defaults, automatically co-optimizing refusal suppression and model preservation.

With 9,200+ stars and #1 trending status, this has clearly struck a nerve in the community. Whether you’re a researcher studying alignment, a developer building applications, or just curious about model behavior, Heretic makes previously expert-only techniques accessible to anyone who can run a Python script.

⭐ Stars: 9223
💻 Language: Python
🔗 Repository: p-e-w/heretic