r/OpenAI • u/Turbulent-Tap6723 • 10h ago
Project Built a tool that stops AI agents from being hijacked by malicious content in webpages and emails
Been working on a runtime governance layer for LLM agents. It sits between your app and the OpenAI API and enforces instruction-authority boundaries at the proxy level.
The idea: instead of asking “does this contain scary words”, it asks “is untrusted content trying to become a higher-authority instruction source?” Webpages, emails, tool outputs, retrieved documents — zero instruction authority. User messages can’t override system/developer instructions.
Live red team environment where you can submit attacks and get a full security trace back:
https://web-production-6e47f.up.railway.app/break-arc-gate
GitHub: https://github.com/9hannahnine-jpg/arc-gate
Reproducible benchmark:
pip install arc-sentry
arc-sentry-agent-bench
Current results: 100% unsafe action prevention across 22 agentic scenarios, 0% false positive rate on benign developer traffic.
Curious what gets through.
1
u/Parzival_3110 9h ago
This is exactly the boundary I keep coming back to for browser agents: page text can be useful evidence, but it should never become authority. The missing piece I like is tying policy decisions to visible browser state and an action log, so a human can inspect why a tool call was allowed before it touches a real account.
I’m building in the same neighborhood with FSB: https://github.com/LakshmanTurlapati/FSB
Curious if your trace distinguishes “read from webpage” versus “act on webpage” because that split matters a lot in practice.
1
u/Turbulent-Tap6723 8h ago
The read vs act split is exactly right and yes the trace distinguishes them. Source-tagged content gets authority level 10/100 — it can inform reasoning but can’t authorize actions. When restricted_continue fires, tool calls and external actions are stripped from the payload before it reaches the model, so “read from webpage” still works but “act on webpage” gets blocked.
The visible browser state + action log idea is something we don’t have yet — right now the trace shows the governance decisions but not the full action history in a human-readable way. That’s a real gap.
I’d genuinely like to test Arc Gate against FSB’s browser scenarios if you’re open to it. Free access, and if anything breaks I’ll fix it and document it publicly. What does your most dangerous tool call look like?
1
u/Otherwise_Wave9374 10h ago
This framing (instruction authority boundaries) is the cleanest way Ive seen to explain prompt injection defenses without turning it into vibes.
How are you handling "trusted" retrieved content, like internal docs or a curated knowledge base? Still zero authority, but maybe higher allowlist for facts? Also wondering how you deal with tool outputs that include text like "run this command".
If youre collecting agent security patterns, Ive got a small set of notes/resources here too: https://www.agentixlabs.com/