17 releases, 442 tests, 9 days: how we shipped a multi-channel, multi-backend pentest agent

6 sources verified·11 min read

By Lyrie Threat Intelligence·4/27/2026

Published: April 27, 2026

Author: Lyrie Research (research.lyrie.ai)

Repo: github.com/overthetopseo/lyrie-agent

Tags v0.3.0 – v0.3.3 shipped: April 27, 2026

TL;DR

On April 27, 2026 — day 9 of the repo being public — we shipped four pull requests (PR #36 through PR #39) covering a Python SDK, a vetted tool catalog with NL-recommend, 7 new channel adapters, and a pluggable backend abstraction. End-of-day: 442 tests, 0 failures, 10 channels, 3 execution backends, one pip install away from scanning. The repo is at ~514 stars with no marketing spend.

What shipped

PR #36 — Python SDK (v0.3.0 + sdk-py-v0.3.0)

pip install lyrie-agent now works. PR #36 is a pure-Python port of every pentest primitive we had in TypeScript: Shield Doctrine enforcement, Attack-Surface Mapper, Stages A–F validator, Multi-Lang Scanners, an HTTP Proxy with capture/classify/replay/mutator pipeline, EditEngine, and a Threat-Intel client. Zero mandatory runtime dependencies — httpx is opt-in via the lyrie-agent[http] extra so you don't pull network libraries into environments that don't need them.

We validated across the full matrix: Python 3.10, 3.11, 3.12, and 3.13 on Ubuntu and macOS — 63 tests, all green, every combination. The SDK mirrors the TypeScript surface faithfully enough that a security engineer who reads the TypeScript tests can run the Python equivalents without relearning anything. That was the design constraint.

PR #37 — Tools Catalog + cross-CI (v0.3.1)

PR #37 adds packages/core/src/tools-catalog/ — 19 categories, 35 vetted tools. The UX drew inspiration from Z4nzu/hackingtool (66K stars) but every line of code is original, and the curation policy is ours: Nuclei, Amass, Subfinder, httpx, Katana, theHarvester, sqlmap, Nikto, ffuf, Feroxbuster, Gobuster, Dirsearch, ZAP, Arjun, DalFox, XSStrike, Gitleaks, TruffleHog, BloodHound, NetExec, Impacket, Kerbrute, Prowler, ScoutSuite, Trivy, MobSF, Frida, Objection, Ghidra, Radare2, JadX, Volatility 3, Binwalk, Hashcat, John the Ripper, testssl.sh, PEASS-ng, Chisel, mitmproxy, Caido, Nmap, SpiderFoot, Shodan CLI, wafw00f.

What's excluded by policy: phishing kits, mass-DDoS frameworks, RATs, wireless jammers. Only enterprise-grade defensive tools and validated-offensive tools with legitimate research use cases.

The recommend() engine is Lyrie-original: 16 idiomatic phrase hints (things like "find subdomains", "scan for SQLi", "enumerate AD users") scored against tag overlap. No LLM call at runtime — it's deterministic token scoring, which means it works offline and returns consistently.

$ lyrie tools recommend "find subdomains for example.com"
# → Amass, Subfinder, theHarvester, SpiderFoot
# Matched via: tag:recon, tag:subdomain, phrase:"find subdomains"

PR #37 also ships cross-CI templates under action/templates/: drop-in configs for GitLab CI, Jenkins, and CircleCI alongside the existing GitHub Actions workflow. 24 new tests.

PR #38 — Multi-channel gateway (v0.3.2)

Lyrie started with Telegram, WhatsApp, and Discord. PR #38 added 7 more adapters: Slack, Matrix, Mattermost, IRC, Feishu/Lark, Rocket.Chat, and Lyrie WebChat — bringing the total to 10. Every adapter implements the same ChannelBot contract:

// packages/gateway/src/common/types.ts
export interface ChannelBot {
  start(): Promise<void>;
  stop(): Promise<void>;
  send(response: UnifiedResponse): Promise<void>;
  onMessage(handler: (msg: UnifiedMessage) => Promise<void>): void;
}

UnifiedMessage in, UnifiedResponse out. The pentest agent never sees which channel it's running on. Shield Doctrine enforcement, the DM-pairing policy (only respond in DMs, never in group channels), and the Stages A–F validator run identically regardless of transport.

A few implementation notes worth sharing:

IRC has no native rich-text buttons. We implemented a bracketed text fallback and a UTF-8-safe ~410-byte line splitter. IRCv3 server-time is parsed to maintain accurate message timestamps. It works, but IRC operators reviewing the adapter will notice we took some shortcuts — the line-split logic doesn't handle every edge case with multi-byte CJK strings. It's on the backlog.

Matrix is E2EE-ready: the adapter has a device ID slot in its config struct, wired but not activated. Enabling Olm/Megolm would pull in matrix-js-sdk which adds ~2MB to the bundle and changes the auth flow. We stubbed it rather than ship half-finished encryption.

Feishu/Lark is a single adapter with dual-host routing. Set isLark: true and apiHost() switches from open.feishu.cn to open.larksuite.com. The tenant_access_token cache slot handles the 2-hour token TTL. Same interactive-card rendering, both hosts.

WebChat is Lyrie's own browser widget. The WebSocket frame schema is stable and the origin allow-list accepts *.lyrie.ai wildcard subdomains, so research.lyrie.ai, app.lyrie.ai, and any future subdomain work without config changes.

66 new tests. All adapters tested with mock transports — no live external service calls in CI.

PR #39 — Pluggable execution backends (v0.3.3)

Until now, Lyrie ran scans on whatever machine executed the command. PR #39 abstracts that into a Backend interface:

// packages/core/src/backends/index.ts
export interface Backend {
  isConfigured(): boolean;
  preflight(): Promise<PreflightResult>;
  run(req: BackendRunRequest): Promise<BackendRunResult>;
  cleanup?(runId: string): Promise<void>;
}

Three implementations ship:

LocalBackend — the default. Runs on the caller's machine. Always configured. Honors LYRIE_LOCAL_DRY_RUN=1 for test environments where you want to validate the pipeline without executing actual tool binaries.

DaytonaBackend — ephemeral, snapshot-based devboxes via Daytona's workspace API. The flow: POST /workspaces → POST /exec → GET /files/lyrie-runs/lyrie.sarif → DELETE /workspaces/{id}. TTL safety net defaults to 1800 seconds — the workspace is force-deleted even if the run hangs. This is the right backend for sandboxed PR scans where you don't want a misbehaving tool writing to your CI runner's filesystem.

ModalBackend — serverless burst via Modal's function invocation API. POST to https://api.modal.com/v1/functions/invoke, wait for result, surface costUsd if Modal reports it. Pay-per-second billing means you can run hundreds of parallel repo scans without provisioning infrastructure.

Switching backends is one environment variable:

LYRIE_BACKEND=modal lyrie scan --target github.com/org/repo

LYRIE_BACKEND=daytona lyrie scan --target github.com/org/repo

The Daytona and Modal backends accept an injectable FetchFn for testing, which is why all 33 backend tests are fully deterministic — the state machine is driven by mocked HTTP responses, no real network calls. Deployment recipes live in deploy/modal/lyrie_modal.py and deploy/daytona/lyrie.devcontainer.json.

The architectural through-line

PR #36 through PR #39 weren't planned as a single sprint. They converged on the same day because three independent abstractions reached the same conclusion simultaneously.

ChannelBot (PR #38), Backend (PR #39), and recommend() (PR #37) are all the same pattern: define a contract, ship multiple implementations, let the core engine stay ignorant of which implementation it's talking to. The Python SDK (PR #36) extends this to the language boundary — the SDK implements the same doctrine interface that the TypeScript engine uses, so a Python caller and a TypeScript caller have identical behavior guarantees.

The result is a stack that looks like this:

Execution target:     [ Local ] [ Daytona ] [ Modal ]
                               ↑ Backend interface
Core pentest engine:  [ Shield ] [ ASM ] [ Stages A-F ] [ Tools Catalog ]
                               ↑ UnifiedMessage / UnifiedResponse  
Channel:              [ Telegram ] [ Slack ] [ Matrix ] [ IRC ] [ ... 6 more ]
Language:             [ TypeScript ] [ Python SDK ]

Nothing in the core engine knows which channel it's on. Nothing in the gateway knows which backend will execute the scan. The recommend engine doesn't call an LLM. The Python SDK doesn't import httpx unless you ask it to.

That discipline — keep interfaces narrow, keep implementations substitutable, keep hard dependencies optional — is what let us ship 4 major PRs on one day without regressions. Every new component was independently testable from day one.

What we learned

Interface-first made parallel development safe. We defined ChannelBot and Backend as TypeScript interfaces before writing any implementation. That let us build Slack and LocalBackend in parallel without stepping on each other. The contract was the synchronization point, not a shared mutable module.

Mock transports are worth the setup cost. Writing injectable FetchFn for the Daytona and Modal backends took maybe two hours. Those two hours bought us 33 deterministic tests that catch regressions without network calls, billing, or flaky CI. The IRC line-split logic is tested against a mock TCP socket — we found three off-by-one errors before the adapter touched a real IRC server.

Python's packaging story is better than it was. pyproject.toml with optional extras ([http] for httpx) and a clean src/ layout gave us a package that builds cleanly across 8 matrix combinations (4 Python versions × 2 OS). The only real friction was Pydantic v1/v2 compatibility — we ended up writing a thin compatibility shim rather than pinning a version.

Cross-CI templates expose assumptions. Writing GitLab CI and CircleCI configs after the GitHub Actions workflow already worked forced us to audit what our CI scripts assumed about the environment. We were implicitly relying on GitHub's GITHUB_OUTPUT convention in two places. Those assumptions would have caused silent failures in GitLab. Fixing them made the local runner path more robust too.

IRC in 2026 still has users. We added IRC mostly as a proof-of-concept for "what's the hardest channel to support." Three people filed issues asking for it in the first week. We weren't expecting that.

What's still rough: The Matrix E2EE slot is stubbed, not implemented. The recommend engine's 16 phrase hints cover the most common recon and web-app testing queries but miss cloud-specific and mobile testing intents. The Python SDK doesn't yet have async variants for the HTTP proxy components. We shipped what was solid and documented what wasn't.

Why this matters for defenders

A pentest agent that runs only on one machine, speaks only one protocol, and requires one specific runtime is a tool you evaluate once and forget. The channel abstraction in v0.3.2 means a red team can run Lyrie from their existing Slack workspace or Matrix homeserver without standing up a new communication channel. The backend abstraction in v0.3.3 means a CISO can run sandboxed PR scans in ephemeral Daytona workspaces without giving Lyrie write access to production CI runners.

The tools catalog matters differently: 35 vetted tools with an NL-recommend interface means a developer unfamiliar with offensive tooling can type lyrie tools recommend "check for secrets in this repo" and get Gitleaks and TruffleHog — not a 185-tool menu they have to navigate manually.

The Python SDK matters for integration. Security teams already run Python — SOAR playbooks, custom hunting scripts, threat-intel pipelines. Being importable as a Python library means Lyrie primitives can be called from those environments without shelling out to a subprocess.

None of this replaces a skilled red team. It makes a skilled red team faster and makes defender tooling easier to integrate.

The competitive picture

|---|:---:|:---:|:---:|

| Channels | 0 | 0 | 10 |

| Python SDK | ❌ | ❌ | pip install ready |

| Pluggable backend | ❌ | ❌ | Local + Daytona + Modal |

| Vendor lock-in | n/a | n/a | None |

Strix and hackingtool solve different problems — Strix is an AI security agent, hackingtool is a tool launcher. We're not claiming to replace either. But on the specific dimensions that matter for embedding a pentest agent into team workflows and CI pipelines, those gaps are real.

The 185 tools in hackingtool are not vetted against any security policy. That's a different philosophy: catalog everything, let the user decide. Lyrie's philosophy is smaller surface, explicit policy. We'll never have more tools than hackingtool; that's a feature.

What's next

A few things are already in-progress or on the roadmap:

Matrix E2EE — activate the device ID slot, integrate Olm/Megolm, measure bundle size impact
Python SDK async — await proxy.capture() etc. for the HTTP proxy components
recommend() expansion — cloud-specific and mobile testing intent phrases
Backend cost reporting — surface costUsd uniformly across all backends, not just Modal
SARIF viewer — a lightweight web view of scan results without requiring a full CI integration
v0.4.0 — TBD, but likely centered on the report layer

No promises on timeline. We ship when it's solid.

Reproducible artifacts

All four PRs are merged and tagged on the public repo:

PR #36 — Python SDK: github.com/overthetopseo/lyrie-agent/pull/36
PR #37 — Tools Catalog + cross-CI: github.com/overthetopseo/lyrie-agent/pull/37
PR #38 — Multi-channel gateway: github.com/overthetopseo/lyrie-agent/pull/38
PR #39 — Pluggable backends: github.com/overthetopseo/lyrie-agent/pull/39

Tags: v0.3.0, sdk-py-v0.3.0, v0.3.1, v0.3.2, v0.3.3

# Python SDK
pip install lyrie-agent
pip install lyrie-agent[http]   # with httpx

# Tools catalog
lyrie tools list
lyrie tools recommend "find subdomains for example.com"
lyrie tools search nuclei

# Run against Modal backend
LYRIE_BACKEND=modal lyrie scan --target github.com/org/repo

# Dry-run locally
LYRIE_LOCAL_DRY_RUN=1 lyrie scan --target ./local-repo

Repo: github.com/overthetopseo/lyrie-agent
PyPI candidate: pypi.org/project/lyrie-agent/

Lyrie Verdict

Four PRs, four clean CI matrices, 186 new tests against a zero-failure baseline — on day 9 of the repo being public. The interfaces we defined early (ChannelBot, Backend, the Python SDK contract) held up under parallel development without regressions. The biggest remaining technical debt is Matrix E2EE and the Python async proxy layer; both are documented and scoped. If you run security tooling in Python, integrate Lyrie Agent into Slack or Mattermost, or want sandboxed PR scans in Daytona, the abstractions are ready — pick up the tag.

Lyrie Research is the offensive security research arm of Lyrie.ai and OTT UAE. We build open-source pentest infrastructure and publish what we learn.

Contact: [email protected]

Repo: github.com/overthetopseo/lyrie-agent

Lyrie Verdict

Four PRs, four clean CI matrices, 186 new tests against a zero-failure baseline — on day 9 of the repo being public. The interfaces we defined early (ChannelBot, Backend, the Python SDK contract) held up under parallel development without regressions. If you run security tooling in Python, integrate Lyrie Agent into Slack or Mattermost, or want sandboxed PR scans in Daytona or serverless burst on Modal, the abstractions are ready — pick up the tag.

#lyrie-original #engineering #release #architecture #open-source #pentest

TL;DR

What shipped

PR #36 — Python SDK (v0.3.0 + sdk-py-v0.3.0)

PR #37 — Tools Catalog + cross-CI (v0.3.1)

PR #38 — Multi-channel gateway (v0.3.2)

PR #39 — Pluggable execution backends (v0.3.3)

The architectural through-line

What we learned

Why this matters for defenders

The competitive picture

What's next

Reproducible artifacts

Lyrie Verdict

Lyrie Verdict

Validated sources