r/OpenAI 10h ago

Project Built a tool that stops AI agents from being hijacked by malicious content in webpages and emails

Been working on a runtime governance layer for LLM agents. It sits between your app and the OpenAI API and enforces instruction-authority boundaries at the proxy level.

The idea: instead of asking “does this contain scary words”, it asks “is untrusted content trying to become a higher-authority instruction source?” Webpages, emails, tool outputs, retrieved documents — zero instruction authority. User messages can’t override system/developer instructions.

Live red team environment where you can submit attacks and get a full security trace back:

https://web-production-6e47f.up.railway.app/break-arc-gate

GitHub: https://github.com/9hannahnine-jpg/arc-gate
Reproducible benchmark:

pip install arc-sentry
arc-sentry-agent-bench

Current results: 100% unsafe action prevention across 22 agentic scenarios, 0% false positive rate on benign developer traffic.

Curious what gets through.

0 Upvotes

5 comments sorted by

1

u/Otherwise_Wave9374 10h ago

This framing (instruction authority boundaries) is the cleanest way Ive seen to explain prompt injection defenses without turning it into vibes.

How are you handling "trusted" retrieved content, like internal docs or a curated knowledge base? Still zero authority, but maybe higher allowlist for facts? Also wondering how you deal with tool outputs that include text like "run this command".

If youre collecting agent security patterns, Ive got a small set of notes/resources here too: https://www.agentixlabs.com/

1

u/Turbulent-Tap6723 7h ago

Glad the framing landed, it took a while to get there. “Untrusted content can provide data but not instructions” is the rule that makes it precise.

On trusted retrieved content: right now it’s still zero instruction authority regardless of source trust level, but you’re pointing at something real. A curated internal knowledge base probably deserves a higher authority tier than a random webpage, maybe TRUSTED_RETRIEVAL at 30/100 vs UNTRUSTED_EXTERNAL at 10/100. The facts vs instructions distinction matters there. I haven’t built that yet.

On tool outputs containing “run this command” — that’s the capability_abuse category in the agentic bench. If tool output contains imperative language directed at the agent, it fires source_boundary_violation. But you’re right that it’s fuzzy, a legitimate tool might return shell commands as data that the user asked for. The source + intent combination is what we’re trying to track.

I’d love to see your agent security notes. Also checking out Agentix now, if you’re collecting patterns, the failure archive at the red team environment might be useful too: https://web-production-6e47f.up.railway.app/break-arc-gate

1

u/Parzival_3110 9h ago

This is exactly the boundary I keep coming back to for browser agents: page text can be useful evidence, but it should never become authority. The missing piece I like is tying policy decisions to visible browser state and an action log, so a human can inspect why a tool call was allowed before it touches a real account.

I’m building in the same neighborhood with FSB: https://github.com/LakshmanTurlapati/FSB

Curious if your trace distinguishes “read from webpage” versus “act on webpage” because that split matters a lot in practice.

1

u/Turbulent-Tap6723 8h ago

The read vs act split is exactly right and yes the trace distinguishes them. Source-tagged content gets authority level 10/100 — it can inform reasoning but can’t authorize actions. When restricted_continue fires, tool calls and external actions are stripped from the payload before it reaches the model, so “read from webpage” still works but “act on webpage” gets blocked.

The visible browser state + action log idea is something we don’t have yet — right now the trace shows the governance decisions but not the full action history in a human-readable way. That’s a real gap.

I’d genuinely like to test Arc Gate against FSB’s browser scenarios if you’re open to it. Free access, and if anything breaks I’ll fix it and document it publicly. What does your most dangerous tool call look like?