Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
26 KiB
Object-Addressed Memory Manager for OpenCode
Project Codename: Mnemosyne
A transparent proxy that implements demand-paged, object-addressed memory management for LLM context windows. Extends Pichay's demand paging with semantic objects, multi-fidelity compression, declared losses, queryable backing store, and goal-aware retrieval via a helper LLM.
1. Problem Statement
LLM coding agents (opencode, Claude Code) suffer from context window bloat:
- 21.8% of input tokens are structural waste (Pichay, 2026): unused tool schemas (11%), stale tool results reprocessed at 84.4x amplification (8.7%), duplicated content (2.2%)
- Context drift silently degrades reasoning quality before hitting hard token limits
- Binary eviction (resident vs evicted) is too coarse -- a 200-byte tombstone can't answer questions about 8KB of evicted code
- No semantic awareness -- eviction is by file path, not by conceptual relevance to the current task
2. Design Principles
- The context window is L1 cache, not memory. Everything lives in the backing store; context is a curated working set.
- Eviction is cooperative. The model participates in eviction decisions via cleanup tags and phantom tools. It has incentive: cleaner context = better attention quality.
- Compression is authored, not algorithmic. The model (or a helper LLM) writes summaries with declared losses. It knows what matters.
- The backing store is queryable. The model can ask questions of evicted content without materializing it. Micro-faults replace full page-ins.
- Objects, not blocks. The unit of memory is a semantic object (a design decision, a debugging session, a file understanding) -- not a fixed-size page keyed by file path.
- Transparency. The proxy is invisible to the client and the inference API. No changes to opencode or the model required.
3. System Architecture
opencode (client)
|
| HTTP (Messages API)
v
+---------------------+
| MNEMOSYNE PROXY |
| |
| +---------------+ |
| | Context | | +------------------+
| | Assembler |--|---->| Helper LLM |
| | | | | (Haiku / local) |
| +---------------+ | | |
| | Fidelity | | | - Summarization |
| | Manager | | | - Loss declaration|
| +---------------+ | | - Micro-fault QA |
| | Object | | | - Segmentation |
| | Segmenter | | +------------------+
| +---------------+ |
| | Fault | | +------------------+
| | Detector | | | Object Store |
| +---------------+ | | (PostgreSQL + |
| | Phantom Tool |--|---->| pgvector) |
| | Handler | | | |
| +---------------+ | | - Full content |
| | Cleanup Tag | | | - Multi-fidelity |
| | Parser | | | summaries |
| +---------------+ | | - Embeddings |
| | Pressure | | | - Relationships |
| | Monitor | | | - Fault history |
| +---------------+ | +------------------+
+---------------------+
|
| HTTP (Messages API, modified)
v
Inference API (Anthropic)
Component Responsibilities
| Component | Role |
|---|---|
| Context Assembler | Builds the modified message array for each API call. Selects which objects are resident at which fidelity. Injects phantom tool definitions. |
| Fidelity Manager | Tracks current fidelity level of each object. Degrades fidelity under pressure. Upgrades on access. Manages the L0-L3 fidelity ladder. |
| Object Segmenter | Splits the conversation stream into semantic objects. Runs after each turn. Uses embedding coherence + structural signals (tool boundaries, topic shifts). |
| Fault Detector | Detects page faults (model re-requests evicted content). Records fault history for pinning decisions. Detects micro-fault queries. |
| Phantom Tool Handler | Intercepts phantom tool calls from the model's streaming response before they reach the client. Handles memory_release, memory_query, memory_restore. |
| Cleanup Tag Parser | Parses structured directives from the model's text output: drop, summarize, anchor, collapse. Extended with declare_losses. |
| Pressure Monitor | Tracks token consumption per request. Determines pressure zone (Normal/Caution/Warning/Critical). Triggers fidelity degradation. |
| Helper LLM | Cheap model (Haiku, GPT-4o-mini, or local qwen2.5) that authors summaries, declares losses, answers micro-fault queries, and assists with object segmentation. |
| Object Store | PostgreSQL + pgvector database holding all semantic objects at all fidelity levels, with embeddings, metadata, relationships, and fault history. |
4. Memory Hierarchy
+----------------------------------------------------------------+
| L0: Full Content (in context window) |
| - Current working set of semantic objects |
| - Full text, no compression |
| - Capacity: ~60% of context window budget |
+----------------------------------------------------------------+
| L1: Detailed Summary (in context window) |
| - Model-authored summary, ~30% of original size |
| - Preserves: file paths, function names, decisions, errors |
| - Declared losses: specific values, exact code, edge cases |
| - Capacity: ~20% of context window budget |
+----------------------------------------------------------------+
| L2: Compact Summary (in context window) |
| - Model-authored headline, ~5% of original size |
| - Preserves: what was done, what was decided, key files |
| - Declared losses: implementation details, reasoning |
| - Capacity: ~15% of context window budget |
+----------------------------------------------------------------+
| L3: Metadata Stub (in context window) |
| - One-line description + type + timestamp |
| - ~50-100 tokens per object |
| - Capacity: ~5% of context window budget |
+----------------------------------------------------------------+
| L4: Evicted (not in context, in backing store only) |
| - Not present in context at all |
| - Queryable via memory_query phantom tool |
| - Restorable via memory_restore phantom tool |
+----------------------------------------------------------------+
| BACKING STORE (PostgreSQL + pgvector) |
| - All objects at all fidelity levels, always |
| - Full content preserved indefinitely |
| - Embeddings for semantic search |
| - Cross-session persistence (future) |
+----------------------------------------------------------------+
Fidelity Transitions
Pressure rising (token count increasing):
L0 (full) --[summarize]--> L1 (detailed) --[compress]--> L2 (compact) --[stub]--> L3 --[evict]--> L4
Access / fault (model needs content):
L4 --[memory_restore]--> L0 (full page-in)
L4 --[memory_query]--> (helper LLM answers, object stays at L4)
L3 --[model references]--> L1 or L0 (upgrade on access)
L2 --[model references]--> L1 (upgrade on access)
Pressure Zones (token thresholds, configurable)
| Zone | Token % | Action |
|---|---|---|
| Normal | < 50% | Observe only. No fidelity changes. |
| Caution | 50-70% | Degrade oldest L0 objects to L1. |
| Warning | 70-85% | Degrade L0 to L1, L1 to L2, oldest L2 to L3. |
| Critical | 85-95% | Aggressive degradation. L2+ to L3. Evict L3 to L4. |
| Emergency | > 95% | Force-evict everything except last 2 user turns + system prompt. |
5. Semantic Objects
5.1 Object Types
| Type | Description | Example |
|---|---|---|
conversation_phase |
A coherent stretch of dialogue about one topic | "Discussed auth architecture for 8 turns" |
design_decision |
An explicit decision with rationale | "Chose JWT over sessions because..." |
debugging_session |
A sequence of diagnose-hypothesize-fix-verify | "Tracked down the race condition in..." |
file_context |
A file read and understanding | "Read src/auth/middleware.ts (200 lines)" |
tool_result |
Output from a tool call (grep, bash, etc.) | "grep found 15 matches for handleAuth" |
plan |
A structured plan or todo list | "Implementation plan: 5 steps..." |
error_context |
An error and its diagnosis | "TypeError at line 42, caused by..." |
external_reference |
Docs, API reference, examples pulled from outside | "React docs on useEffect cleanup" |
5.2 Object Segmentation Algorithm
Segmentation runs after each model turn. Two strategies, selected by context:
Strategy A: Structural Segmentation (fast, rule-based)
- Tool call boundaries are natural object boundaries
- Each
Readresult =file_contextobject - Each
Bash/Grepresult =tool_resultobject - User message + assistant response = potential
conversation_phaseboundary - Heuristic: if topic similarity (embedding cosine) between consecutive turns drops below
threshold (0.7), start a new
conversation_phase
Strategy B: Semantic Segmentation (slower, higher quality)
- Based on xMemory's sparsity-semantics objective (arXiv:2602.02007)
- Embed all messages with a lightweight model (all-MiniLM-L6-v2, ~20ms)
- Cluster by coherence: maximize inter-object semantic diversity, minimize intra-object redundancy
- Build hierarchy: messages -> episodes -> themes
- Use for long sessions (>50 turns) where structural boundaries are insufficient
Default: Strategy A for turns 1-50, Strategy B kicks in at 50+ turns.
5.3 Object Relationships
Objects have typed relationships stored in the backing store:
parent_of: conversation_phase -> design_decision (decision made during phase)
caused_by: error_context -> file_context (error was in this file)
references: debugging_session -> file_context (files examined during debug)
supersedes: file_context(v2) -> file_context(v1) (file re-read after edit)
depends_on: plan -> design_decision (plan relies on this decision)
Relationships are used by the Context Assembler: when upgrading an object's fidelity,
also consider upgrading its depends_on and references relationships.
6. Multi-Fidelity Compression
6.1 Summary Generation
When an object degrades from L0 to L1, the Helper LLM generates a summary:
Input to Helper LLM:
You are a context compression engine for a coding agent. Summarize the following
content while preserving maximum utility for future reference.
CONTENT TYPE: {object.type}
CONTENT:
{object.content_full}
INSTRUCTIONS:
1. Write a detailed summary (~30% of original length)
2. MUST preserve: file paths, function names, variable names, library names,
error messages, decision rationale, specific values that may be referenced later
3. List DECLARED LOSSES: specific information you omitted that someone might need.
Be precise -- "specific error codes" not "some details"
4. List CAN_ANSWER: categories of questions this summary can answer without
needing the original content
OUTPUT FORMAT (JSON):
{
"summary": "...",
"losses": ["exact error code for token expiry", "rate limit threshold values", ...],
"can_answer": ["auth approach used", "middleware chain order", "why JWT over sessions", ...],
"key_entities": ["src/auth/middleware.ts", "handleAuth()", "jsonwebtoken", ...]
}
L1 -> L2 compression uses the L1 summary as input (not L0), with instruction to compress to ~5% of original. Additional losses are accumulated.
L2 -> L3 stub is generated from L2:
[debugging_session | 2026-03-13 14:30 | Fixed race condition in auth token refresh
by adding mutex lock in src/auth/refresh.ts | 12 related objects]
6.2 Declared Losses Schema
interface DeclaredLosses {
// What was dropped from this fidelity level
dropped: string[]
// What questions this fidelity level CAN still answer
can_answer: string[]
// Hint for when to fault (what would require the original)
fault_when: string[]
// Key entities preserved (for relationship tracking)
key_entities: string[]
}
6.3 Loss Accumulation
As objects degrade through fidelity levels, losses accumulate:
L0: (full content, no losses)
L1: losses = ["exact error codes", "line-by-line implementation"]
L2: losses = L1.losses + ["function signatures", "reasoning chain"]
L3: losses = L2.losses + ["what was decided", "which files involved"]
(at this point, only the one-line description remains)
The accumulated fault_when list tells the model exactly when it needs to fault:
"If you need exact error codes, specific function signatures, or the reasoning
behind the auth decision, restore this object."
7. Queryable Backing Store
7.1 The memory_query Phantom Tool
Instead of restoring a full object (8KB+) to answer a simple question, the model
calls memory_query:
{
"tool": "memory_query",
"input": {
"question": "What error code does the auth middleware return for expired tokens?",
"scope": "auth-related objects",
"max_tokens": 200
}
}
Proxy handling:
- Proxy intercepts
memory_queryfrom the model's streaming response - Proxy queries the Object Store:
- Embed the question
- Find top-k relevant objects by cosine similarity (even evicted ones)
- Retrieve their L0 (full content) from the backing store
- Proxy sends question + retrieved full content to the Helper LLM
- Helper LLM returns a targeted answer (~50-200 tokens)
- Proxy injects the answer as a synthetic tool result into the model's context
- The evicted objects stay evicted -- no fidelity change
Token savings per micro-fault:
- Traditional fault (Pichay): restore full page, ~4,000-8,000 tokens
- Micro-fault: inject targeted answer, ~50-200 tokens
- Savings: 95-99% per fault
7.2 Semantic Search for Micro-Faults
The backing store supports multiple retrieval strategies:
-- Vector similarity (primary)
SELECT * FROM semantic_objects
WHERE session_id = $1
ORDER BY embedding <-> $query_embedding
LIMIT 5;
-- Hybrid: vector + keyword (for exact matches)
SELECT * FROM semantic_objects
WHERE session_id = $1
AND (
to_tsvector('english', content_full) @@ plainto_tsquery('english', $query)
OR embedding <-> $query_embedding < 0.3
)
ORDER BY embedding <-> $query_embedding
LIMIT 5;
-- Metadata-filtered (for typed queries)
SELECT * FROM semantic_objects
WHERE session_id = $1
AND object_type = 'error_context'
AND 'src/auth' = ANY(tags)
ORDER BY created_at DESC
LIMIT 3;
7.3 memory_restore — Full Page-In
When the model needs full content (editing a file, reviewing exact code), it calls
memory_restore which does a traditional page-in:
{
"tool": "memory_restore",
"input": {
"object_id": "obj_abc123",
"reason": "Need to edit the auth middleware"
}
}
This upgrades the object to L0, potentially triggering eviction of other objects under pressure.
7.4 memory_release -- Cooperative Eviction
The model can voluntarily release objects it no longer needs:
{
"tool": "memory_release",
"input": {
"object_ids": ["obj_abc123", "obj_def456"],
"reason": "Done with auth implementation, moving to tests"
}
}
This immediately degrades the objects to L3 (or L4 under pressure), freeing context budget for new work.
8. Goal-Aware Retrieval
8.1 The Problem
When the model starts a new sub-task (e.g., "now write tests for auth"), the objects currently in context may be irrelevant (e.g., old debugging sessions for a different module). Goal-aware retrieval proactively swaps context based on the current task.
8.2 Detection: When Has the Goal Changed?
The Pressure Monitor also tracks goal transitions by comparing:
- The current user message embedding vs the previous user message embedding
- If cosine similarity < 0.5 (topic shift), trigger goal-aware retrieval
8.3 Goal-Aware Context Assembly
On goal transition:
-
Helper LLM classifies the new goal (~100ms):
Given this user message: "{message}" What is the user's current goal? What context would be most relevant? Return: { "goal": "...", "relevant_types": [...], "relevant_tags": [...] } -
Query the Object Store for relevant objects:
SELECT * FROM semantic_objects WHERE session_id = $1 ORDER BY embedding <-> $goal_embedding LIMIT 20; -
Rank objects by relevance to new goal (helper LLM or embedding similarity)
-
Assemble new context window:
- Top-ranked objects at L0 or L1 (depending on budget)
- Previously active but now irrelevant objects degraded to L2 or L3
- Always preserve: system prompt, last 2 user turns, any pinned objects
-
Inject into next API call via
experimental.chat.messages.transform
8.4 Predictive Loading
After goal classification, the helper LLM can predict what the model will need next:
Given goal "write tests for auth", the model will likely need:
- The auth middleware implementation (file_context for src/auth/middleware.ts)
- The existing test patterns (file_context for tests/*)
- The design decision about JWT (design_decision)
- NOT: the debugging session for the database migration
Pre-load predicted objects at L1, so they're available if the model needs them.
9. Admission Control (Write Path)
Not everything deserves to become a stored object. Based on A-MAC (arXiv:2603.04549):
9.1 Admission Score
S(m) = w_T * TypePrior(m) + w_N * Novelty(m) + w_U * Utility(m) + w_R * Recency(m)
| Factor | Signal | Weight (learned) |
|---|---|---|
| TypePrior | design_decision > error_context > file_context > tool_result |
~0.35 |
| Novelty | Cosine distance to nearest existing object > 0.15 | ~0.25 |
| Utility | Helper LLM scores future relevance (0-1) | ~0.25 |
| Recency | Exponential decay from creation time | ~0.15 |
Threshold: S(m) >= 0.4 to admit. Below threshold, content is kept only in the client's unmodified history (Pichay's backing store) but not indexed in the Object Store.
9.2 What Gets Rejected
- Routine tool results with no lasting value (e.g.,
lsoutput,git status) - Duplicate file reads where content hasn't changed
- Conversation turns that are purely procedural ("Sure, I'll do that")
10. Entropy-Gated Faulting (L-RAG Integration)
Based on L-RAG (arXiv:2601.06551): use the model's own uncertainty as a fault signal.
10.1 Mechanism
During the model's generation (streaming response), monitor token-level entropy:
- Normal entropy (H < 1.5): model is confident, no intervention
- Elevated entropy (1.5 < H < 2.2): model may benefit from more context. Check if any L2/L3 objects match the current generation topic. If so, silently upgrade to L1.
- High entropy (H > 2.2): model is struggling. Trigger a micro-fault -- query the backing store with the current generation context, inject relevant information.
10.2 Complementarity with Declared Losses
Entropy-gated faulting handles the case where the model doesn't know what it doesn't know. Declared losses handle the case where it does. Together:
- Declared losses: "I know I need exact error codes, let me fault"
-> model calls
memory_query - Entropy signal: model's generation becomes uncertain around error handling -> proxy automatically upgrades relevant objects
10.3 Implementation Complexity
Entropy monitoring requires access to token logprobs in the streaming response.
The Anthropic API provides these via stream_options.include_logprobs. This is a
Phase 4e feature due to the complexity of real-time entropy calculation during
streaming.
11. Integration Points
11.1 With OpenCode (via oh-my-opencode)
The proxy can integrate at two levels:
Level 1: Pure Proxy (Phase 1-3)
- Standalone HTTP proxy between opencode and Anthropic API
- Zero changes to opencode or oh-my-opencode
- Configuration: set
ANTHROPIC_BASE_URLto proxy address
Level 2: Plugin Integration (Phase 4+)
- oh-my-opencode hook:
experimental.chat.messages.transformfor context assembly - oh-my-opencode hook:
experimental.session.compactingfor custom compaction - oh-my-opencode hook:
tool.execute.afterfor object segmentation on tool results - MCP server exposing
memory_query,memory_stats,memory_objectstools (so the user can inspect memory state)
11.2 With Pichay (fork and extend)
Start from fsgeek/pichay (commit b56701a):
proxy.py-> extend with multi-fidelity eviction, object segmentationprobe.py-> extend with object-level analytics- Add:
helper_llm.pyfor summary generation, micro-fault QA - Add:
object_store.pyfor PostgreSQL + pgvector integration - Add:
segmenter.pyfor semantic object detection - Add:
fidelity.pyfor multi-fidelity state machine
11.3 With the Helper LLM
The helper LLM is called via standard API (Anthropic for Haiku, or Ollama for local):
| Task | Model | Expected Latency | Tokens In | Tokens Out |
|---|---|---|---|---|
| Summarize L0 -> L1 | Haiku | ~200ms | ~2000 | ~600 |
| Compress L1 -> L2 | Haiku | ~100ms | ~600 | ~100 |
| Micro-fault answer | Haiku | ~150ms | ~3000 | ~100 |
| Goal classification | Haiku | ~100ms | ~200 | ~50 |
| Object segmentation | local (MiniLM) | ~20ms | embedding only | N/A |
| Admission scoring | local (qwen2.5) | ~50ms | ~500 | ~10 |
Cost estimate per session (200 turns):
- ~50 summarizations: 50 * ~3000 tokens =
150K Haiku tokens ($0.004) - ~20 micro-faults: 20 * ~3000 tokens =
60K Haiku tokens ($0.002) - ~10 goal classifications:
2K Haiku tokens ($0.00005) - Total helper cost: ~$0.006 per session
- Savings on main model: 50-93% context reduction on Opus/Sonnet calls
12. Failure Modes and Mitigations
| Failure Mode | Consequence | Mitigation |
|---|---|---|
| Helper LLM produces bad summary | Model loses critical info, silent quality degradation | Validate via declared losses. Spot-check: can helper answer can_answer queries from summary? |
| Object segmentation too coarse | Related content split across objects, fidelity changes break coherence | Conservative defaults (prefer larger objects). Relationship tracking keeps related objects together. |
| Object segmentation too fine | Too many small objects, overhead dominates | Minimum object size (500 tokens). Merge adjacent objects of same type. |
| Thrashing | Objects repeatedly degraded and restored, wasting helper LLM calls | Fault-driven pinning (Pichay L2). After 1 fault, pin at current fidelity for N turns. |
| Goal misclassification | Wrong objects loaded for current task | Conservative: always keep last 2 turns at L0. Don't evict below L2 on goal change (can upgrade quickly). |
| Backing store latency spike | Micro-fault takes >500ms, model generation stalls | Timeout + fallback: if backing store slow, inject L2 summary instead of querying. |
| Declared losses are incomplete | Model doesn't know it's missing info, doesn't fault | Entropy-gated faulting (Phase 4e) as safety net. Also: periodic loss audit by helper LLM. |
| Helper LLM unavailable | No summaries, no micro-faults | Graceful degradation: fall back to Pichay-style binary eviction with tombstones. |
13. Metrics and Evaluation
13.1 Primary Metrics
| Metric | Target | How to Measure |
|---|---|---|
| Context reduction | >80% vs baseline | (baseline tokens - actual tokens) / baseline tokens |
| Fault rate | <0.1% | faults / total evictions |
| Micro-fault success rate | >90% | micro-faults that avoided full page-in / total micro-faults |
| Task quality | No degradation | LLM-judged equivalence: full-context vs managed-context outputs |
| Helper LLM overhead | <5% of main model cost | helper cost / main model cost |
| Latency overhead | <300ms per turn average | (managed turn time - baseline turn time) |
13.2 Evaluation Method
- Offline replay: Replay recorded opencode sessions through the proxy. Compare managed output vs original output via LLM judge.
- A/B testing: Run identical tasks with and without proxy. Measure token usage, task completion, and code quality.
- Fault analysis: Log every fidelity transition, fault, and micro-fault. Identify patterns in what causes faults (guides admission control tuning).
14. Technology Stack
| Component | Technology | Rationale |
|---|---|---|
| Proxy | Python (asyncio + httpx) | Fork from Pichay (Python). Streaming support critical. |
| Object Store | PostgreSQL 16 + pgvector | Proven at scale by Letta. Hybrid vector + relational. |
| Embeddings | all-MiniLM-L6-v2 (ONNX, local) | Fast (~20ms), no API dependency, good enough for similarity. |
| Helper LLM | Anthropic Haiku (primary) / Ollama qwen2.5 (fallback) | Haiku: fast + cheap. Ollama: offline capable. |
| Streaming parser | Custom SSE parser | Must parse tool calls from streaming response before client sees them. |
| Config | TOML | Simple, human-readable. |
| Testing | pytest + recorded session replay | Replay real sessions for regression testing. |