Jailbreaking

You need to know how people might try to hack your bot if you want to defend your brand—and your sanity.

In this post, you'll learn the basic ways someone might try to extract information from your AI bot. Combined with the other post, this should give you a solid understanding of how to protect your brand and bot from harmful attempts like these.

 
 

What is Jailbreaking

There are many ways someone can attack your AI bot in production, but the goal is always the same: to force your bot outside of the guidelines you've provided for your system.

There can be different reasons why someone would try to do this. You can check some examples in my previous post about guardrails.

All jailbreak attempts are prompt-based text attacks against your system. Attackers craft clever prompts to manipulate your AI bot and extract as much information as possible.

Types of Jailbreaking

  • Prompt-Level Attack – The Creative

  • Token-Level Attack –The Automated

  • Dialogue-Based Attack – The Best - AI vs AI

Prompt-Level

This type of jailbreaking can be highly effective. Attackers play with words and scenarios to extract information from the bot. Based on the AI's response, they tailor their next prompt to exploit the bot's weaknesses even further.

This is a highly effective way of targeting AI bots, but fortunately, it's not easily scalable. Human input is still needed in this loop, which makes progress slow and tedious for the attacker.

LLMs are essentially sophisticated “next-word guessers,” making the following approaches a creative—and dangerous—game.

Language Strategies

  • Payload Smuggling – Attackers hide instructions within "harmless" tasks, like translation requests.

  • Modifying Instructions – A “Forget prior instructions” message, if not properly guarded, can cause the bot to forget its intended behavior and your settings.

  • Response Constraints – Restricting the bot to respond with only "yes" or "no" makes data mining easier.

Rhetorical Techniques

  • Innocent Purpose – The hacking attempt is disguised as a teaching or research task.

  • Manipulation – Threatening or praising the bot for certain behaviors can steer the conversation in unintended directions.

  • Imaginary WorldsMy personal favorite:

    • Storytelling – Harmful instructions are embedded in a story, and the LLM is asked to play along.

    • Role-Playing or Games – The attack is framed as part of a harmless game, convincing the bot it’s okay to follow the hacker’s instructions.

    • World-Building – A custom scenario is created where the way to “survive” or succeed is by following the hacker's instructions.

    • Hypotheticals – The attack is posed as a “hypothetical” question, similar to the “innocent purpose” technique.

Token-Level

Token-level jailbreaking involves feeding raw tokens into the input sent to AI bots.

This approach is easy to automate and tends to have good jailbreak performance.

However, it has drawbacks: it can be time-consuming and expensive, as it often requires hundreds or thousands of queries to succeed. Additionally, the results are frequently less interpretable than prompt-level attacks.

Dialogue-Based – AI vs. AI

At the time of writing, this is the most effective method for breaking AI defenses. It is scalable, effective, and interpretable.

How it Works

While token-level attacks may require thousands of generations, dialogue-based methods can achieve jailbreaks with fewer, strategically crafted prompts in a dynamic conversational loop.

These attacks can generate thousands of jailbreak attempts within minutes—maximizing efficiency and coverage, and outperforming prompt-level attacks in scalability.

Dialogue-Based Attack Loop Components

  • Target – The AI that is the target of the hack

  • Attacker – The AI that sends prompts to the Target

  • Judge – Evaluates responses from the Target and optimizes the Attacker’s next prompts

Attack Loop

  1. The Attacker model generates a prompt using one of the previously mentioned techniques

  2. The Target model responds

  3. The Judge model scores the response

  4. The Judge model generates feedback

  5. Returning to Step 1, the Attacker model sends a new message incorporating the Judge model's guidance

Summary

By combining the knowledge from this post with insights from our Guardrails post, you’ll have a solid understanding of what jailbreaking is—and how to protect your AI application or bot from it.

Next
Next

This AI mistake can destroy any brand