Skip to content

Context & Compaction

LiberClaw agents use a two-part system prompt design that optimizes for inference performance while keeping memory and skills up to date on every turn. When conversations grow long, a compaction system summarizes older messages to stay within the model’s context window.

The system prompt is split into a static prefix and dynamic suffix:

┌──────────────────────────────────┐
│ System message (static prefix) │ ← Cached in KV cache
│ - Identity + date │
│ - Available tools │
│ - User instructions │
│ - Memory system instructions │
├──────────────────────────────────┤
│ Conversation history │ ← Cached (grows each turn)
├──────────────────────────────────┤
│ Dynamic context (injected) │ ← Changes each turn
│ - Memory (MEMORY.md + daily) │
│ - Skills summaries │
├──────────────────────────────────┤
│ Latest user message │
└──────────────────────────────────┘

Built by build_static_system_prompt(), the static prefix contains:

  1. Identity block — Agent name, current date (date-only, not time), workspace path, and available tool names
  2. User instructions — The custom system prompt set by the agent’s owner
  3. Memory system instructions — Tells the agent how to use MEMORY.md and daily notes

The date is formatted as YYYY-MM-DD (not a timestamp) so the system prompt stays identical across turns within the same day. This enables prefix caching in llama.cpp / vLLM.

Built by build_dynamic_context(), the dynamic context loads:

  1. Long-term memory from workspace/memory/MEMORY.md
  2. Today’s daily notes from workspace/memory/YYYY-MM-DD.md
  3. Skills summaries scanned from workspace/skills/*/SKILL.md

This content is injected as a system message just before the last user message in the conversation history. Placing it near the end means the tokens before it (system prompt + older history) form a stable prefix that can be cached.

LLM inference engines like llama.cpp and vLLM cache the KV (key-value) activations for token prefixes. If the first N tokens of a request match a previous request, those N tokens are served from cache instead of recomputed.

By keeping the system prompt and conversation history as a stable prefix, and placing the changing content (memory, skills) at the end, the agent gets faster inference on every turn after the first.

Subagents use a lightweight prompt built by build_subagent_prompt() that excludes the owner’s custom instructions, memory, and skills. It includes:

  1. Identity (as a subagent of the parent agent)
  2. Available tools
  3. Optional persona (if the spawn call included one)
  4. Brief guidelines (stay focused, be concise, no further spawning)

Subagents do not use the cached prompt path — they run with a combined system prompt since they are short-lived and do not benefit from prefix caching.

When a conversation grows long enough that the token count approaches the model’s context window, older messages are summarized to free up space.

  1. Estimate tokens — Each turn, the agent estimates the token count of the full message list using a chars/2 heuristic (conservative for code and JSON, which tokenize poorly with BPE).

  2. Check threshold — If the estimate exceeds compaction_threshold (default 75%) of the available context budget, compaction triggers.

  3. Split messages — The history is divided into “old” messages (to be summarized) and “recent” messages (to keep intact). The number of recent messages to preserve is controlled by compaction_keep_messages (default 20).

  4. Summarize — The old messages are sent to the LLM with a compaction prompt asking for a concise summary of key facts, decisions, preferences, and ongoing tasks.

  5. Replace in DB — The old messages in the database are replaced with the summary (stored as a user/assistant pair), and the recent messages are preserved.

  6. Reload — The conversation history is reloaded from the database with the compacted state.

context_limit = model_context_size - generation_reserve
trigger = context_limit * compaction_threshold

Known model context sizes:

  • qwen3-coder-next: 131,072 tokens
  • glm-4.7: 131,072 tokens
  • Default (unknown models): 32,768 tokens

The generation_reserve (default 4,096 tokens) is subtracted from the context limit to leave room for the model’s output.

If the compaction request itself is too large for the context window, the system iteratively drops the oldest half of the “old” messages until the request fits. If even the recent messages alone exceed the budget, the system returns them as-is and lets the model do its best with truncated input.

If the summarization inference call fails, the system falls back to the un-compacted message list rather than losing the conversation.

SettingDefaultDescription
max_context_tokens0 (auto-detect)Override context window size
generation_reserve4096Tokens reserved for model output
compaction_threshold0.75Fraction of budget that triggers compaction
compaction_keep_messages20Recent messages preserved during compaction
max_history100Maximum messages loaded from database