Add automatic llms.txt awareness so agents check for /llms.txt or
/.well-known/llms.txt when exploring new domains.
Changes:
- System prompt: new 'llms.txt Discovery' section (full mode only,
when web_fetch is available) instructing agents to check for llms.txt
files when visiting new domains
- web_fetch tool: updated description to mention llms.txt discovery
llms.txt is an emerging standard (like robots.txt for AI) that helps
site owners describe how AI agents should interact with their content.
Making this a default behavior helps the ecosystem adopt agent-native
web experiences.
Ref: https://llmstxt.org
Add optional urlAllowlist config at tools.web level that restricts which
URLs can be accessed by web tools:
- Config types (types.tools.ts): Add urlAllowlist?: string[] to tools.web
- Zod schema: Add urlAllowlist field to ToolsWebSchema
- Schema help: Add help text for the new config fields
- web_search: Filter Brave search results by allowlist (provider=brave)
- web_fetch: Block URLs not matching allowlist before fetching
- ssrf.ts: Export normalizeHostnameAllowlist and matchesHostnameAllowlist
URL matching supports:
- Exact domain match (example.com)
- Wildcard patterns (*.github.com)
When urlAllowlist is not configured, all URLs are allowed (backwards compatible).
Tests: Add web-tools.url-allowlist.test.ts with 23 tests covering:
- URL allowlist resolution from config
- Wildcard pattern matching
- web_fetch error response format
- Brave search result filtering
* Web: trim HTML error bodies in web_fetch
* fix: trim web_fetch HTML error bodies (#1193) (thanks @sebslight)
---------
Co-authored-by: Sebastian Slight <sbarrios93@gmail.com>
Co-authored-by: Peter Steinberger <steipete@gmail.com>