Context & Compaction

LiberClaw agents use a two-part system prompt design that optimizes for inference performance while keeping memory and skills up to date on every turn. When conversations grow long, a compaction system summarizes older messages to stay within the model’s context window.

System prompt structure

The system prompt is split into a static prefix and dynamic suffix:

┌──────────────────────────────────┐
│  System message (static prefix)  │  ← Cached in KV cache
│  - Identity + date               │
│  - Available tools                │
│  - User instructions              │
│  - Memory system instructions     │
├──────────────────────────────────┤
│  Conversation history             │  ← Cached (grows each turn)
├──────────────────────────────────┤
│  Dynamic context (injected)       │  ← Changes each turn
│  - Memory (MEMORY.md + daily)     │
│  - Skills summaries               │
├──────────────────────────────────┤
│  Latest user message              │
└──────────────────────────────────┘

Static prefix

Built by build_static_system_prompt(), the static prefix contains:

Identity block — Agent name, current date (date-only, not time), workspace path, and available tool names
User instructions — The custom system prompt set by the agent’s owner
Memory system instructions — Tells the agent how to use MEMORY.md and daily notes

The date is formatted as YYYY-MM-DD (not a timestamp) so the system prompt stays identical across turns within the same day. This enables prefix caching in llama.cpp / vLLM.

Dynamic context

Built by build_dynamic_context(), the dynamic context loads:

Long-term memory from workspace/memory/MEMORY.md
Today’s daily notes from workspace/memory/YYYY-MM-DD.md
Skills summaries scanned from workspace/skills/*/SKILL.md

This content is injected as a system message just before the last user message in the conversation history. Placing it near the end means the tokens before it (system prompt + older history) form a stable prefix that can be cached.

Why the split matters

LLM inference engines like llama.cpp and vLLM cache the KV (key-value) activations for token prefixes. If the first N tokens of a request match a previous request, those N tokens are served from cache instead of recomputed.

By keeping the system prompt and conversation history as a stable prefix, and placing the changing content (memory, skills) at the end, the agent gets faster inference on every turn after the first.

Subagent prompts

Subagents use a lightweight prompt built by build_subagent_prompt() that excludes the owner’s custom instructions, memory, and skills. It includes:

Identity (as a subagent of the parent agent)
Available tools
Optional persona (if the spawn call included one)
Brief guidelines (stay focused, be concise, no further spawning)

Subagents do not use the cached prompt path — they run with a combined system prompt since they are short-lived and do not benefit from prefix caching.

Context compaction

When a conversation grows long enough that the token count approaches the model’s context window, older messages are summarized to free up space.

How it works

Estimate tokens — Each turn, the agent estimates the token count of the full message list using a chars/2 heuristic (conservative for code and JSON, which tokenize poorly with BPE).
Check threshold — If the estimate exceeds compaction_threshold (default 75%) of the available context budget, compaction triggers.
Split messages — The history is divided into “old” messages (to be summarized) and “recent” messages (to keep intact). The number of recent messages to preserve is controlled by compaction_keep_messages (default 20).
Summarize — The old messages are sent to the LLM with a compaction prompt asking for a concise summary of key facts, decisions, preferences, and ongoing tasks.
Replace in DB — The old messages in the database are replaced with the summary (stored as a user/assistant pair), and the recent messages are preserved.
Reload — The conversation history is reloaded from the database with the compacted state.

Token budget calculation

context_limit = model_context_size - generation_reserve
trigger = context_limit * compaction_threshold

Known model context sizes:

qwen3-coder-next: 131,072 tokens
glm-4.7: 131,072 tokens
Default (unknown models): 32,768 tokens

The generation_reserve (default 4,096 tokens) is subtracted from the context limit to leave room for the model’s output.

Overflow handling

If the compaction request itself is too large for the context window, the system iteratively drops the oldest half of the “old” messages until the request fits. If even the recent messages alone exceed the budget, the system returns them as-is and lets the model do its best with truncated input.

If the summarization inference call fails, the system falls back to the un-compacted message list rather than losing the conversation.

Configuration

Setting	Default	Description
`max_context_tokens`	`0` (auto-detect)	Override context window size
`generation_reserve`	`4096`	Tokens reserved for model output
`compaction_threshold`	`0.75`	Fraction of budget that triggers compaction
`compaction_keep_messages`	`20`	Recent messages preserved during compaction
`max_history`	`100`	Maximum messages loaded from database