Context & Compaction
LiberClaw agents use a two-part system prompt design that optimizes for inference performance while keeping memory and skills up to date on every turn. When conversations grow long, a compaction system summarizes older messages to stay within the model’s context window.
System prompt structure
Section titled “System prompt structure”The system prompt is split into a static prefix and dynamic suffix:
┌──────────────────────────────────┐│ System message (static prefix) │ ← Cached in KV cache│ - Identity + date ││ - Available tools ││ - User instructions ││ - Memory system instructions │├──────────────────────────────────┤│ Conversation history │ ← Cached (grows each turn)├──────────────────────────────────┤│ Dynamic context (injected) │ ← Changes each turn│ - Memory (MEMORY.md + daily) ││ - Skills summaries │├──────────────────────────────────┤│ Latest user message │└──────────────────────────────────┘Static prefix
Section titled “Static prefix”Built by build_static_system_prompt(), the static prefix contains:
- Identity block — Agent name, current date (date-only, not time), workspace path, and available tool names
- User instructions — The custom system prompt set by the agent’s owner
- Memory system instructions — Tells the agent how to use MEMORY.md and daily notes
The date is formatted as YYYY-MM-DD (not a timestamp) so the system prompt stays identical across turns within the same day. This enables prefix caching in llama.cpp / vLLM.
Dynamic context
Section titled “Dynamic context”Built by build_dynamic_context(), the dynamic context loads:
- Long-term memory from
workspace/memory/MEMORY.md - Today’s daily notes from
workspace/memory/YYYY-MM-DD.md - Skills summaries scanned from
workspace/skills/*/SKILL.md
This content is injected as a system message just before the last user message in the conversation history. Placing it near the end means the tokens before it (system prompt + older history) form a stable prefix that can be cached.
Why the split matters
Section titled “Why the split matters”LLM inference engines like llama.cpp and vLLM cache the KV (key-value) activations for token prefixes. If the first N tokens of a request match a previous request, those N tokens are served from cache instead of recomputed.
By keeping the system prompt and conversation history as a stable prefix, and placing the changing content (memory, skills) at the end, the agent gets faster inference on every turn after the first.
Subagent prompts
Section titled “Subagent prompts”Subagents use a lightweight prompt built by build_subagent_prompt() that excludes the owner’s custom instructions, memory, and skills. It includes:
- Identity (as a subagent of the parent agent)
- Available tools
- Optional persona (if the
spawncall included one) - Brief guidelines (stay focused, be concise, no further spawning)
Subagents do not use the cached prompt path — they run with a combined system prompt since they are short-lived and do not benefit from prefix caching.
Context compaction
Section titled “Context compaction”When a conversation grows long enough that the token count approaches the model’s context window, older messages are summarized to free up space.
How it works
Section titled “How it works”-
Estimate tokens — Each turn, the agent estimates the token count of the full message list using a
chars/2heuristic (conservative for code and JSON, which tokenize poorly with BPE). -
Check threshold — If the estimate exceeds
compaction_threshold(default 75%) of the available context budget, compaction triggers. -
Split messages — The history is divided into “old” messages (to be summarized) and “recent” messages (to keep intact). The number of recent messages to preserve is controlled by
compaction_keep_messages(default 20). -
Summarize — The old messages are sent to the LLM with a compaction prompt asking for a concise summary of key facts, decisions, preferences, and ongoing tasks.
-
Replace in DB — The old messages in the database are replaced with the summary (stored as a user/assistant pair), and the recent messages are preserved.
-
Reload — The conversation history is reloaded from the database with the compacted state.
Token budget calculation
Section titled “Token budget calculation”context_limit = model_context_size - generation_reservetrigger = context_limit * compaction_thresholdKnown model context sizes:
qwen3-coder-next: 131,072 tokensglm-4.7: 131,072 tokens- Default (unknown models): 32,768 tokens
The generation_reserve (default 4,096 tokens) is subtracted from the context limit to leave room for the model’s output.
Overflow handling
Section titled “Overflow handling”If the compaction request itself is too large for the context window, the system iteratively drops the oldest half of the “old” messages until the request fits. If even the recent messages alone exceed the budget, the system returns them as-is and lets the model do its best with truncated input.
If the summarization inference call fails, the system falls back to the un-compacted message list rather than losing the conversation.
Configuration
Section titled “Configuration”| Setting | Default | Description |
|---|---|---|
max_context_tokens | 0 (auto-detect) | Override context window size |
generation_reserve | 4096 | Tokens reserved for model output |
compaction_threshold | 0.75 | Fraction of budget that triggers compaction |
compaction_keep_messages | 20 | Recent messages preserved during compaction |
max_history | 100 | Maximum messages loaded from database |