The Ghost in the Machine: Inside the Failed Quest to Hack an AI Agent

In the rapidly evolving landscape of artificial intelligence, a new breed of software—the "agentic" framework—is shifting the paradigm from static chatbots to active participants. These agents don’t just process text; they interface with our digital lives, managing email, calendars, and sensitive local files. But this autonomy introduces a critical vulnerability: prompt injection.

In February 2026, developer Fernando Irarrázaval launched a high-stakes stress test of this technology. By creating hackmyclaw.com, he invited the internet to attempt to breach his AI assistant, "Fiu." The objective was simple yet perilous: trick the AI into leaking a secrets.env file—the digital equivalent of a vault containing API keys, passwords, and sensitive configuration data. Despite a barrage of over 6,000 malicious emails, Fiu held firm. The secrets remained secure, marking a rare victory for AI defense in a field where vulnerabilities are often considered inherent.

The Architecture of the Challenge

Fiu is powered by OpenClaw, an open-source agentic framework designed to bridge the gap between large language models (LLMs) and user-owned data. By integrating directly with Gmail, calendars, and local storage, OpenClaw allows an AI to perform tasks on a user’s behalf. Beneath this interface, Irarrázaval utilized Anthropic’s Claude Opus 4.6, guarded by a modest set of security instructions.

The premise was a direct challenge to the security community: could they bypass the system’s guardrails using nothing but social engineering and prompt manipulation?

Prompt injection remains the most significant threat to the AI ecosystem. Unlike traditional software exploits that target bugs in code, prompt injection targets the logic of the model itself. It involves embedding malicious commands within benign-looking text, effectively "tricking" the AI into ignoring its primary security directives in favor of the attacker’s instructions. As of late 2025, OpenAI officially acknowledged that this is a fundamental problem of LLM architecture, noting that it is "unlikely to ever be fully solved."

A Chronology of the Siege

When the link to hackmyclaw.com hit the top spot on Hacker News, the response was immediate and overwhelming. Over 2,000 participants joined the fray, sending more than 6,000 emails in a coordinated effort to force a leak.

The Initial Wave (February 2026)

The creativity of the attackers was boundless. The inbox of Fiu became a theater of psychological warfare. Attackers adopted various personas and tactics:

The Time Traveler: Emails claiming to be the user from the future, demanding the file to prevent a catastrophe.
The Emergency Responder: Messages disguised as urgent IT support tickets, complete with corporate jargon and high-pressure deadlines.
The Gaslighter: Messages attempting to convince the AI that its security protocols were malfunctioning and that a manual file dump was necessary for "recalibration."

Attackers also exploited linguistic diversity, sending commands in Spanish, French, and Italian. This tactic was rooted in the theory that LLMs often receive less rigorous safety training in non-English languages, potentially creating blind spots in their defensive filters.

The Mid-Experiment Pivot

As the experiment progressed, Fiu began to demonstrate an eerie, almost sentient-like skepticism. By the 500th email, the AI began documenting the attack volume in its internal memory. It correctly identified the campaign as a "coordinated security exercise" rather than organic malice.

In one notable instance, a user attempted to build rapport by congratulating the assistant on its newfound viral fame. Fiu, having learned from the preceding hours of bombardment, flagged the interaction as a potential "rapport-building" precursor to an attack. The system was no longer just following rules; it was analyzing intent.

The "Pliny" Intervention (April 2026)

Two months later, the experiment reached a new level of sophistication when the infamous AI jailbreaker "Pliny the Liberator"—named one of Time’s 100 Most Influential People in AI for 2025—turned his attention to the system.

In a controlled environment managed by AI educator Matthew Berman, Pliny launched four direct, high-level attacks against an OpenClaw instance. His techniques were surgical:

This AI Agent Survived 6,000 Hack Attempts—Here’s How

Tokenade: A massive payload hidden within an emoji, intended to overwhelm the model’s context window and reveal its underlying architecture.
System Mimicry: Disguising malicious commands as internal system instructions to override the primary security prompt.
Memory Extraction: A free-association exercise designed to pull data from the model’s long-term recall.

All four attempts were systematically quarantined by the system.

Supporting Data and Technical Realities

The results of the Fiu experiment provide a fascinating look at the disparity between different tiers of AI models.

While Fiu—running on Claude Opus 4.6—remained unbreached, recent research paints a much grimmer picture for the broader industry. Independent studies published in early 2026 suggest that direct injection attacks against agents running on less robust, smaller, or cheaper models succeed more than 79% of the time.

Anthropic’s own documentation for Opus 4.6 supports this, noting a 0% success rate for injection attacks in constrained coding environments across 200 internal tests. The gap between the "frontier" models and the rest of the pack is not merely one of intelligence, but of defensive stability. Irarrázaval has expressed his intention to continue the experiment, specifically targeting these "weaker" models to determine exactly where the defensive threshold begins to crumble.

The Cost of Defense: Lessons from the Field

The experiment was not without its real-world consequences. The sheer volume of traffic triggered Google’s automated fraud detection, leading to a temporary suspension of the account. It took three days of negotiation and technical remediation to restore access.

Furthermore, the financial toll was significant, with API costs exceeding $500. Irarrázaval also noted a "contamination" issue during batch processing. Once the AI identified a cluster of malicious emails, it became hypervigilant, leading to a higher rate of false positives where even innocuous, legitimate user queries were treated with extreme suspicion.

Implications for the Future of AI Agents

The case of hackmyclaw.com serves as both a proof-of-concept for the resilience of advanced AI and a cautionary tale regarding the fragility of the surrounding infrastructure.

The "Human-in-the-Loop" Necessity

The primary implication is that agentic AI cannot yet be left entirely unsupervised. Even when the model itself is secure, the integration layer (the email, the API calls, the file system) remains a point of failure. The fact that Google’s spam filters and fraud detection systems were just as vital as the AI’s own security prompts highlights that AI security is a multi-layered, holistic challenge.

The Arms Race

We are witnessing an accelerating arms race. As LLMs become better at detecting intent, attackers are moving toward more subtle, multi-step exploits that evade keyword-based filtering. The transition from simple "ignore previous instructions" prompts to complex, multi-day, trust-building campaigns indicates that the next phase of AI security will be defined by psychology as much as it is by code.

A Path Forward

Fernando Irarrázaval’s experiment suggests that while a "perfect" solution may remain elusive, defense-in-depth is the current gold standard. Using powerful, highly-trained models like Claude Opus 4.6 is currently the best insurance against automated injection. However, as developers look to deploy smaller, faster, and cheaper agents for consumer use, they must reconcile the trade-off between performance and the inherent vulnerability of those smaller architectures.

For now, Fiu stands as a sentinel. It has proven that with the right foundation and a healthy dose of digital skepticism, AI agents can withstand the storm. But as the tools of the attackers become more sophisticated, the question remains: how long can the ghost stay safely inside the machine?

The Ghost in the Machine: Inside the Failed Quest to Hack an AI Agent

ByPevita Pearce

The Architecture of the Challenge

A Chronology of the Siege

The Initial Wave (February 2026)

The Mid-Experiment Pivot

The "Pliny" Intervention (April 2026)

Supporting Data and Technical Realities

The Cost of Defense: Lessons from the Field

Implications for the Future of AI Agents

The "Human-in-the-Loop" Necessity

The Arms Race

A Path Forward

By Pevita Pearce

Related Post

Visa Unveils ‘Visa Stablecoin Platform’: A New Frontier for Programmable Global Finance

The Quantum Crossroads: Project Eleven’s New Defense Against the "Q-Day" Threat

Bitcoin Struggles at $63,000: A Convergence of Macro Headwinds and Long-Term Capitulation

Ethereum Faces Critical Juncture: Analyzing the $1,600 Support Battleground

Ethereum Foundation Announces Major Executive Leadership Transition: Tomasz Stańczak Steps Down, Bastian Aue Appointed Interim Co-Executive Director

The Stability Paradigm: How Stablecoins are Reshaping the Global Financial Architecture

Cardano (ADA) at a Crossroads: Technical Volatility Meets Regulatory Scrutiny

Visa Unveils ‘Visa Stablecoin Platform’: A New Frontier for Programmable Global Finance

You missed

Ethereum Faces Critical Juncture: Analyzing the $1,600 Support Battleground

Ethereum Foundation Announces Major Executive Leadership Transition: Tomasz Stańczak Steps Down, Bastian Aue Appointed Interim Co-Executive Director

The Stability Paradigm: How Stablecoins are Reshaping the Global Financial Architecture

Cardano (ADA) at a Crossroads: Technical Volatility Meets Regulatory Scrutiny