Safeguarding Language Models via Self-Destruct Trapdoor

Shahar Katz, Bar Alon, Ariel Shaulov, Lior Wolf, Mahmood Sharif

February 2026

OpenReview Code

Abstract

The potential misuse and misalignment of language models (LMs) is a central safety concern. This work presents Self-Destruct, a novel mechanism to restrict specific behaviors in LMs by leveraging overlooked properties of the underlying hardware. We observe that the LM frameworks use limited-precision formats (e.g., BF16), which are vulnerable to overflow errors during matrix multiplications. Exploiting this property, Self-Destruct replaces selected weights in pre-trained LM layers with values that act as traps, triggering a system error only when the model engages in targeted behaviors, such as harmful text generation, while leaving normal functionality unaffected. Unlike post-hoc filters, this safeguard is embedded directly within the model, introduces neither inference overhead nor auxiliary models, and requires only a set of examples for calibration. Extensive experiments with five LM families demonstrate that Self-Destruct provides competitive protection against jailbreak attacks while preserving accuracy on standard benchmarks. In addition, we also show that Self-Destruct is versatile, helping mitigate biased text generation and enable model fingerprinting, highlighting the potential of hardware-aware safeguards as an efficient, low-overhead complement to existing LM defenses.

Type

Conference paper

Publication

Conference of the European Chapter of the Association for Computational Linguistics (EACL)