AgentSey
AgentSey is a verified software factory. We help our customers run their entire software development on AI, with humans participating minimally. The factory is a multi-agent system, and we train the small models behind it ourselves. Every artifact we ship comes with a machine-checkable proof of correctness and security.
The bottleneck in modern engineering is coordination between people: handoffs across roles, code review, regression testing, and verification. Verification gets skipped almost everywhere because it is too slow and too expensive. AgentSey closes that loop and runs verification by default.
When something inside our system goes wrong, we fix the station that produced it. Every customer downstream of that station gets the fix from then on. That is how a factory compounds.
The factory
Seven coordinated stations, one learned orchestrator, one verification gate before anything ships.
Outputs that fail at Verify cycle back to Implement. Anything that passes ships as (code, spec, proof).
Each station is an ensemble of small models trained for that station's task, backed by structured external memory (AST graphs, proof DAGs, accumulated Miss→Rule libraries) offloaded across DRAM, NVMe, and computational storage. A learned orchestrator routes tasks, allocates context budget, and decides when to verify.
Why small models
Modern LLMs are weak at exploration and proof construction. These are exactly the operations software verification requires: walking adversarial paths, finding edge cases, constructing or checking invariants. A frontier model run once against a real codebase will miss most of what matters and produce confident-looking but unsound reasoning on the rest.
The fix is parallelism. A few hundred specialized agents probing the system from different angles will collectively find issues that any single agent would miss: fuzzers, invariant checkers, symbolic executors, exploit constructors, taint trackers, regression generators. Coverage produces the signal.
But running 200 frontier-model instances against one codebase costs more than a senior engineer's annual salary in a day. At current pricing, parallel verification at production density is economically impossible. This is why nobody runs it today.
The architecture we use is many narrow specialized small models, one per station, with a single frontier model handling the steps that genuinely require reasoning. We train these models on data the factory generates as it runs. Every customer workflow improves the next generation.
Inference
The factory runs in the background. We built an inference stack for exactly this profile: background, multi-agent, throughput-first.
Inference runs roughly 10–100× cheaper than a frontier-only baseline, and the same GPU hour that serves one chat response in an interactive system serves hundreds of agent steps in ours. The mechanics: GPU scheduling that mixes small-model calls with occasional frontier reasoning steps, aggressive use of spot capacity for the underlying hardware, and agents that self-evict their KV cache on preemption so spot interruptions never lose in-flight work.
At these costs, running hundreds of specialized agents against one codebase becomes affordable.
Vulnerabilities found
The same architecture, applied to the security station, has produced dozens of Critical and high-severity findings across production browsers, crypto infrastructure, databases, and AI systems, including multiple sandbox escapes in Chromium.
Full disclosure record
Chromium / Chrome
| Product area | Bug class | Severity | Identifier | Date |
|---|---|---|---|---|
| Chrome / WebCodecs | Use-after-free | Critical | CVE-2026-5280 | 2026-03-31 |
| Chrome / Proxy | Use-after-free | Critical | CVE-2026-6297 | 2026-04-15 |
| Chrome / Rendering / Canvas | Use-after-free | Critical | CVE-2026-7363 | 2026-04-28 |
| Chrome / Network | Use-after-free | High | CVE-2026-4454 | 2026-03-18 |
| Chrome / Codecs | Integer overflow | High | CVE-2026-5274 | 2026-03-31 |
| Chrome / Rendering | Heap buffer overflow | High | CVE-2026-4462 | 2026-03-18 |
| Chrome / WebCodecs | Out-of-bounds read | Medium | CVE-2026-7933 | 2026-05-05 |
| Chrome / WebGL | Inappropriate implementation | Medium | CVE-2026-5291 | 2026-03-31 |
Other findings
- RisingWave, 2024. Remote code execution on any compute node from read-only or low-privilege accounts.
- Devin.ai, 2024. SSRF leading to user information leaks and complete system takeover.
- Kaito, 2024. API issues leading to user information leaks and complete system takeover.
- Etherscan, 2024. XSS and Cloudflare bypass enabling account takeover risk and phishing attacks.
Publications
- Synthesizing Multi-Agent Harnesses for Vulnerability Discovery (AgentFlow). Formalizes harness synthesis for security agents; the foundation of the architecture described above.
- Your Agent Is Mine: Measuring Malicious Intermediary Attacks on the LLM Supply Chain (MINE). First work to characterize LLM routing as a supply-chain attack surface; the basis for the integrity layer in our production router.
- Semia: Auditing Agent Skills via Constraint-Guided Representation Synthesis (SEMIA). Audits hybrid skill specifications as Datalog before deployment.
Team
AgentSey is built by researchers in programming languages, formal verification, and computer security, with operating experience in AI infrastructure. Members of the team have received Distinguished or Best Paper Awards at PLDI 2018, PLDI 2022, ASE 2020, CHI 2021, and ACM SIGSOFT, alongside a Google Research Scholar Award and Ethereum Academic Award. Earlier work from the group is widely cited as foundational in formal methods and program analysis.
Between us we have shipped a certified JavaScript compiler verified end-to-end, a multi-agent security system responsible for dozens of Critical and high-severity findings in production systems, the only production LLM router with a research-grade integrity layer, a managed cloud runtime for autonomous agents, and an in-house training stack for the specialized small models behind the factory.
We are hiring
We are hiring security researchers, systems engineers, and ML specialists to extend the factory and the infrastructure beneath it. The work spans program analysis, fuzzing, exploitation, sandboxing, formal verification, large-scale agent orchestration, small-model training, and storage-aware inference. The best candidates move comfortably between low-level systems work and agent system design.
Software Engineering Intern (Remote)
- Ship real engineering work using Codex, Claude Code, and autoresearch as your primary tools.
- Build and refine the multi-agent development workflows the rest of the team runs on.
- Write production Python and Rust by hand for the parts that matter.
- Work on real systems: databases, AWS, Kubernetes, deployment, and architecture.
- Make AI coding agents do actual engineering, not just answer questions.
- Daily use of Codex and Claude Code.
- Ability to run several agent sessions and development tasks in parallel without losing the thread.
- Hands-on experience with autoresearch or similar AI coding workflows.
- Strong Python and Rust without AI — you can write the critical logic yourself.
- Working knowledge of databases, AWS, and Kubernetes.
- You drive the agents; the agents do not drive you.
- Comfort reading English technical documentation and unfamiliar codebases.
- Prior work in security research, fuzzing, systems programming, compilers, blockchain, or infrastructure.
- Public GitHub projects of your own.
- A real point of view on context management, task decomposition, and tool-call design in Claude Code, Codex, or other agent systems.
- Genuine interest in developer tooling and automation.
You may still be in school, but you already ship faster with AI than most working engineers. You may not have a big-company title, but you can land in an unfamiliar repo and find the real problem before the people who wrote it do. You are not really an intern in the traditional sense; you are a hacker who has already figured out how to use agents as leverage.
Security Research Intern (Remote)
- Work on the Security and Verify stations of the factory: applying LLMs to theorem proving, program verification, and adversarial reasoning at production scale.
- Contribute to formal verification of smart contracts, zero-knowledge proof systems, and consensus algorithms — including ongoing work with the Ethereum Foundation and Google Security Team.
- Build LLM-guided search (MCTS, beam search, structured exploration) for verification tactics, compiler optimizations, and exploit synthesis.
- Turn informal proofs and specifications into machine-checkable forms.
- Run real audits on production targets and ship the findings.
- Solid background in at least one of: formal methods, logic, automated theorem proving, programming languages theory, cryptography, or applied AI/ML.
- Hands-on experience with at least one proof assistant or verification tool (Coq, Lean, Isabelle, Dafny, Z3, CBMC, or similar).
- Strong programming ability in a systems or research language (Rust, C++, OCaml, Python, etc.).
- Comfort reading research papers and turning ideas into working prototypes within days, not months.
- Strong written and spoken English.
- Publications at PLDI, POPL, OOPSLA, CAV, CCS, S&P, USENIX Security, or similar venues.
- Open-source contributions to formal verification or program analysis tools.
- Prior work on smart contract security, zkEVMs, or consensus protocols.
- Experience automating theorem provers with LLMs.
You want to build research, not just consume it. You read papers because you want to write the next one, not pass an exam. You have an opinion on how LLMs should — and should not — be used inside a verification pipeline, and you are comfortable straddling theory and implementation.
For careers, research collaborations, and pilot inquiries, contact hr@riema.xyz.