From CISO Marketplace — the hub for security professionals Visit

LLM Jailbreaking

Threat Intelligence

Definition

Techniques that bypass the safety filters and content restrictions of large language models through adversarial prompting, role-playing scenarios, or encoded instructions to elicit prohibited outputs.

Technical Details

Jailbreaking exploits misalignments between an LLM's training objective and its safety fine-tuning. Common techniques include DAN (Do Anything Now) personas, hypothetical framing, base64 encoding of restricted content, token smuggling, and many-shot prompting. As models are updated with RLHF and constitutional AI, jailbreaks require increasingly sophisticated approaches.

Practical Usage

Security teams use jailbreaking techniques during AI red teaming to assess the robustness of safety guardrails. Adversaries use jailbreaks to generate malware code, disinformation content, or harmful instructions from commercial AI APIs. Model providers monitor for novel jailbreak patterns and release safety patches.

Examples

Related Terms

Prompt Injection Attack AI Red Teaming Adversarial Machine Learning AI Hallucination Risk Trustworthy AI in Cybersecurity
← Back to Glossary