Agents: add system prompt safety guardrails (#5445)

* 🤖 agents: add system prompt safety guardrails

What:
- add safety guardrails to system prompt
- update system prompt docs
- update prompt tests

Why:
- discourage power-seeking or self-modification behavior
- clarify safety/oversight priority when conflicts arise

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 agents: tighten safety wording for prompt guardrails

What:
- scope safety wording to system prompts/safety/tool policy changes
- document Safety inclusion in minimal prompt mode
- update safety prompt tests

Why:
- avoid blocking normal code changes or PR workflows
- keep prompt mode docs consistent with implementation

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)

* 🤖 docs: note safety guardrails are soft

What:
- document system prompt safety guardrails as advisory
- add security note on prompt guardrails vs hard controls

Why:
- clarify threat model and operator expectations
- avoid implying prompt text is an enforcement layer

Tests:
- pnpm lint (pass)
- pnpm build (fails: DefaultResourceLoader missing in pi-coding-agent)
- pnpm test (not run; build failed)
This commit is contained in:
Josh Palmer
2026-01-31 15:50:15 +01:00
committed by GitHub
parent 75093ebe1c
commit 7a6c40872d
5 changed files with 41 additions and 6 deletions

View File

@@ -16,6 +16,7 @@ The prompt is assembled by OpenClaw and injected into each agent run.
The prompt is intentionally compact and uses fixed sections:
- **Tooling**: current tool list + short descriptions.
- **Safety**: short guardrail reminder to avoid power-seeking behavior or bypassing oversight.
- **Skills** (when available): tells the model how to load skill instructions on demand.
- **OpenClaw Self-Update**: how to run `config.apply` and `update.run`.
- **Workspace**: working directory (`agents.defaults.workspace`).
@@ -28,6 +29,8 @@ The prompt is intentionally compact and uses fixed sections:
- **Runtime**: host, OS, node, model, repo root (when detected), thinking level (one line).
- **Reasoning**: current visibility level + /reasoning toggle hint.
Safety guardrails in the system prompt are advisory. They guide model behavior but do not enforce policy. Use tool policy, exec approvals, sandboxing, and channel allowlists for hard enforcement; operators can disable these by design.
## Prompt modes
OpenClaw can render smaller system prompts for sub-agents. The runtime sets a
@@ -36,9 +39,9 @@ OpenClaw can render smaller system prompts for sub-agents. The runtime sets a
- `full` (default): includes all sections above.
- `minimal`: used for sub-agents; omits **Skills**, **Memory Recall**, **OpenClaw
Self-Update**, **Model Aliases**, **User Identity**, **Reply Tags**,
**Messaging**, **Silent Replies**, and **Heartbeats**. Tooling, Workspace,
Sandbox, Current Date & Time (when known), Runtime, and injected context stay
available.
**Messaging**, **Silent Replies**, and **Heartbeats**. Tooling, **Safety**,
Workspace, Sandbox, Current Date & Time (when known), Runtime, and injected
context stay available.
- `none`: returns only the base identity line.
When `promptMode=minimal`, extra injected prompts are labeled **Subagent