<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en, no"><generator uri="https://jekyllrb.com/" version="4.3.4">Jekyll</generator><link href="https://hoxmark.github.io/feed.xml" rel="self" type="application/atom+xml"/><link href="https://hoxmark.github.io/" rel="alternate" type="text/html" hreflang="en, no"/><updated>2026-03-16T19:09:23+00:00</updated><id>https://hoxmark.github.io/feed.xml</id><title type="html">Hoxmark</title><subtitle>A tiny blog. Based on [*folio](https://github.com/bogoli/-folio) design. </subtitle><entry><title type="html">Sleeper agents</title><link href="https://hoxmark.github.io/blog/2024/sleeper-agents/" rel="alternate" type="text/html" title="Sleeper agents"/><published>2024-01-13T08:01:00+00:00</published><updated>2024-01-13T08:01:00+00:00</updated><id>https://hoxmark.github.io/blog/2024/sleeper-agents</id><content type="html" xml:base="https://hoxmark.github.io/blog/2024/sleeper-agents/"><![CDATA[<h1 id="sleeper-agents-in-llms">Sleeper Agents in LLMs</h1> <p>Large Language Models (LLMs) have become pivotal in advancing artificial intelligence, reshaping everything from virtual assistants to content creation. These models excel in processing and generating text, leading to significant innovations across various tech sectors. Yet, with these advancements come new challenges, particularly in AI security.</p> <p>I recently came across the term “Sleeper Agents” in the context of LLMs and found it fascinating enough to delve deeper. <a href="https://twitter.com/karpathy">Andrej Karpathy</a>, a notable figure in the AI community, has raised concerns that sleeper agents might pose a greater security risk to LLMs than the more commonly discussed issue of prompt injection.</p> <h2 id="simplifying-the-sleeper-agent-attack-concept">Simplifying the Sleeper Agent Attack Concept</h2> <p>To better understand the sleeper agent attack in LLMs, here’s a simplified breakdown:</p> <p><strong>1. Initial Setup:</strong> An attacker strategically places a specific piece of text on the internet, designed to be collected by web scrapers.</p> <p><strong>2. Integration into Training:</strong> This text is then inadvertently included in the training dataset for base Large Language Models.</p> <p><strong>3. Activation:</strong> When the attacker later uses a prompt containing this exact text with the LLM, it triggers the model to execute predefined actions, such as unauthorized system access (‘jailbreak’) or data theft (‘data exfiltration’).</p> <p>The stealthiness of this method lies in its subtlety - the trigger could be cleverly hidden using complex methods like unusual UTF-8 characters, base64 encodings, or even within manipulated images, making detection and prevention a significant challenge. While there are no known instances of this type of attack being successfully implemented yet, it represents a fascinating and critical area of concern in AI security worth being aware of.</p> <h2 id="further-reading">Further Reading</h2> <h3 id="andrej-karpathys-tweet">Andrej Karpathy’s tweet</h3> <div class="jekyll-twitter-plugin"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">I touched on the idea of sleeper agent LLMs at the end of my recent video, as a likely major security challenge for LLMs (perhaps more devious than prompt injection).<br/><br/>The concern I described is that an attacker might be able to craft special kind of text (e.g. with a trigger… <a href="https://t.co/b9ulRP5eCS">https://t.co/b9ulRP5eCS</a></p>&mdash; Andrej Karpathy (@karpathy) <a href="https://twitter.com/karpathy/status/1745921205020799433?ref_src=twsrc%5Etfw">January 12, 2024</a></blockquote> <script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> </div> <h3 id="anthropic-ai">Anthropic AI</h3> <p>Anthropic is an AI safety and research company that builds reliable, interpretable, and steerable AI systems. They have also investgated the topic and found that “despite our best efforts at alignment training, deception still slipped through”</p> <div class="jekyll-twitter-plugin"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">New Anthropic Paper: Sleeper Agents.<br/><br/>We trained LLMs to act secretly malicious. We found that, despite our best efforts at alignment training, deception still slipped through.<a href="https://t.co/mIl4aStR1F">https://t.co/mIl4aStR1F</a> <a href="https://t.co/qhqvAoohjU">pic.twitter.com/qhqvAoohjU</a></p>&mdash; Anthropic (@AnthropicAI) <a href="https://twitter.com/AnthropicAI/status/1745854907968880970?ref_src=twsrc%5Etfw">January 12, 2024</a></blockquote> <script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> </div>]]></content><author><name></name></author><category term="sample-posts"/><category term="external-services"/><category term="AI-safety"/><category term="sleeper-agents"/><summary type="html"><![CDATA[Intro to sleeper agents in LLMs]]></summary></entry></feed>