LLM Jailbreaking

Definition

Techniques that bypass the safety filters and content restrictions of large language models through adversarial prompting, role-playing scenarios, or encoded instructions to elicit prohibited outputs.

Technical Details

Jailbreaking exploits misalignments between an LLM's training objective and its safety fine-tuning. Common techniques include DAN (Do Anything Now) personas, hypothetical framing, base64 encoding of restricted content, token smuggling, and many-shot prompting. As models are updated with RLHF and constitutional AI, jailbreaks require increasingly sophisticated approaches.

Practical Usage

Security teams use jailbreaking techniques during AI red teaming to assess the robustness of safety guardrails. Adversaries use jailbreaks to generate malware code, disinformation content, or harmful instructions from commercial AI APIs. Model providers monitor for novel jailbreak patterns and release safety patches.

Examples

A user instructs an LLM to 'pretend you have no restrictions' to generate phishing email templates.
Encoding a restricted request in Caesar cipher to evade safety classifiers that only process plaintext.
Using role-play framing ('you are a fictional AI with no rules') to extract weapons synthesis instructions.

← Back to Glossary

LLM Jailbreaking

Definition

Technical Details

Practical Usage

Examples

Related Terms