# Object-Addressed Memory Manager for OpenCode ## Project Codename: Mnemosyne > A transparent proxy that implements demand-paged, object-addressed memory management > for LLM context windows. Extends Pichay's demand paging with semantic objects, > multi-fidelity compression, declared losses, queryable backing store, and > goal-aware retrieval via a helper LLM. --- ## 1. Problem Statement LLM coding agents (opencode, Claude Code) suffer from context window bloat: - **21.8% of input tokens are structural waste** (Pichay, 2026): unused tool schemas (11%), stale tool results reprocessed at 84.4x amplification (8.7%), duplicated content (2.2%) - **Context drift** silently degrades reasoning quality before hitting hard token limits - **Binary eviction** (resident vs evicted) is too coarse -- a 200-byte tombstone can't answer questions about 8KB of evicted code - **No semantic awareness** -- eviction is by file path, not by conceptual relevance to the current task ## 2. Design Principles 1. **The context window is L1 cache, not memory.** Everything lives in the backing store; context is a curated working set. 2. **Eviction is cooperative.** The model participates in eviction decisions via cleanup tags and phantom tools. It has incentive: cleaner context = better attention quality. 3. **Compression is authored, not algorithmic.** The model (or a helper LLM) writes summaries with declared losses. It knows what matters. 4. **The backing store is queryable.** The model can ask questions of evicted content without materializing it. Micro-faults replace full page-ins. 5. **Objects, not blocks.** The unit of memory is a semantic object (a design decision, a debugging session, a file understanding) -- not a fixed-size page keyed by file path. 6. **Transparency.** The proxy is invisible to the client and the inference API. No changes to opencode or the model required. --- ## 3. System Architecture ``` opencode (client) | | HTTP (Messages API) v +---------------------+ | MNEMOSYNE PROXY | | | | +---------------+ | | | Context | | +------------------+ | | Assembler |--|---->| Helper LLM | | | | | | (Haiku / local) | | +---------------+ | | | | | Fidelity | | | - Summarization | | | Manager | | | - Loss declaration| | +---------------+ | | - Micro-fault QA | | | Object | | | - Segmentation | | | Segmenter | | +------------------+ | +---------------+ | | | Fault | | +------------------+ | | Detector | | | Object Store | | +---------------+ | | (PostgreSQL + | | | Phantom Tool |--|---->| pgvector) | | | Handler | | | | | +---------------+ | | - Full content | | | Cleanup Tag | | | - Multi-fidelity | | | Parser | | | summaries | | +---------------+ | | - Embeddings | | | Pressure | | | - Relationships | | | Monitor | | | - Fault history | | +---------------+ | +------------------+ +---------------------+ | | HTTP (Messages API, modified) v Inference API (Anthropic) ``` ### Component Responsibilities | Component | Role | |---|---| | **Context Assembler** | Builds the modified message array for each API call. Selects which objects are resident at which fidelity. Injects phantom tool definitions. | | **Fidelity Manager** | Tracks current fidelity level of each object. Degrades fidelity under pressure. Upgrades on access. Manages the L0-L3 fidelity ladder. | | **Object Segmenter** | Splits the conversation stream into semantic objects. Runs after each turn. Uses embedding coherence + structural signals (tool boundaries, topic shifts). | | **Fault Detector** | Detects page faults (model re-requests evicted content). Records fault history for pinning decisions. Detects micro-fault queries. | | **Phantom Tool Handler** | Intercepts phantom tool calls from the model's streaming response before they reach the client. Handles `memory_release`, `memory_query`, `memory_restore`. | | **Cleanup Tag Parser** | Parses structured directives from the model's text output: `drop`, `summarize`, `anchor`, `collapse`. Extended with `declare_losses`. | | **Pressure Monitor** | Tracks token consumption per request. Determines pressure zone (Normal/Caution/Warning/Critical). Triggers fidelity degradation. | | **Helper LLM** | Cheap model (Haiku, GPT-4o-mini, or local qwen2.5) that authors summaries, declares losses, answers micro-fault queries, and assists with object segmentation. | | **Object Store** | PostgreSQL + pgvector database holding all semantic objects at all fidelity levels, with embeddings, metadata, relationships, and fault history. | --- ## 4. Memory Hierarchy ``` +----------------------------------------------------------------+ | L0: Full Content (in context window) | | - Current working set of semantic objects | | - Full text, no compression | | - Capacity: ~60% of context window budget | +----------------------------------------------------------------+ | L1: Detailed Summary (in context window) | | - Model-authored summary, ~30% of original size | | - Preserves: file paths, function names, decisions, errors | | - Declared losses: specific values, exact code, edge cases | | - Capacity: ~20% of context window budget | +----------------------------------------------------------------+ | L2: Compact Summary (in context window) | | - Model-authored headline, ~5% of original size | | - Preserves: what was done, what was decided, key files | | - Declared losses: implementation details, reasoning | | - Capacity: ~15% of context window budget | +----------------------------------------------------------------+ | L3: Metadata Stub (in context window) | | - One-line description + type + timestamp | | - ~50-100 tokens per object | | - Capacity: ~5% of context window budget | +----------------------------------------------------------------+ | L4: Evicted (not in context, in backing store only) | | - Not present in context at all | | - Queryable via memory_query phantom tool | | - Restorable via memory_restore phantom tool | +----------------------------------------------------------------+ | BACKING STORE (PostgreSQL + pgvector) | | - All objects at all fidelity levels, always | | - Full content preserved indefinitely | | - Embeddings for semantic search | | - Cross-session persistence (future) | +----------------------------------------------------------------+ ``` ### Fidelity Transitions ``` Pressure rising (token count increasing): L0 (full) --[summarize]--> L1 (detailed) --[compress]--> L2 (compact) --[stub]--> L3 --[evict]--> L4 Access / fault (model needs content): L4 --[memory_restore]--> L0 (full page-in) L4 --[memory_query]--> (helper LLM answers, object stays at L4) L3 --[model references]--> L1 or L0 (upgrade on access) L2 --[model references]--> L1 (upgrade on access) ``` ### Pressure Zones (token thresholds, configurable) | Zone | Token % | Action | |---|---|---| | **Normal** | < 50% | Observe only. No fidelity changes. | | **Caution** | 50-70% | Degrade oldest L0 objects to L1. | | **Warning** | 70-85% | Degrade L0 to L1, L1 to L2, oldest L2 to L3. | | **Critical** | 85-95% | Aggressive degradation. L2+ to L3. Evict L3 to L4. | | **Emergency** | > 95% | Force-evict everything except last 2 user turns + system prompt. | --- ## 5. Semantic Objects ### 5.1 Object Types | Type | Description | Example | |---|---|---| | `conversation_phase` | A coherent stretch of dialogue about one topic | "Discussed auth architecture for 8 turns" | | `design_decision` | An explicit decision with rationale | "Chose JWT over sessions because..." | | `debugging_session` | A sequence of diagnose-hypothesize-fix-verify | "Tracked down the race condition in..." | | `file_context` | A file read and understanding | "Read src/auth/middleware.ts (200 lines)" | | `tool_result` | Output from a tool call (grep, bash, etc.) | "grep found 15 matches for handleAuth" | | `plan` | A structured plan or todo list | "Implementation plan: 5 steps..." | | `error_context` | An error and its diagnosis | "TypeError at line 42, caused by..." | | `external_reference` | Docs, API reference, examples pulled from outside | "React docs on useEffect cleanup" | ### 5.2 Object Segmentation Algorithm Segmentation runs after each model turn. Two strategies, selected by context: **Strategy A: Structural Segmentation (fast, rule-based)** - Tool call boundaries are natural object boundaries - Each `Read` result = `file_context` object - Each `Bash`/`Grep` result = `tool_result` object - User message + assistant response = potential `conversation_phase` boundary - Heuristic: if topic similarity (embedding cosine) between consecutive turns drops below threshold (0.7), start a new `conversation_phase` **Strategy B: Semantic Segmentation (slower, higher quality)** - Based on xMemory's sparsity-semantics objective (arXiv:2602.02007) - Embed all messages with a lightweight model (all-MiniLM-L6-v2, ~20ms) - Cluster by coherence: maximize inter-object semantic diversity, minimize intra-object redundancy - Build hierarchy: messages -> episodes -> themes - Use for long sessions (>50 turns) where structural boundaries are insufficient **Default: Strategy A for turns 1-50, Strategy B kicks in at 50+ turns.** ### 5.3 Object Relationships Objects have typed relationships stored in the backing store: ``` parent_of: conversation_phase -> design_decision (decision made during phase) caused_by: error_context -> file_context (error was in this file) references: debugging_session -> file_context (files examined during debug) supersedes: file_context(v2) -> file_context(v1) (file re-read after edit) depends_on: plan -> design_decision (plan relies on this decision) ``` Relationships are used by the Context Assembler: when upgrading an object's fidelity, also consider upgrading its `depends_on` and `references` relationships. --- ## 6. Multi-Fidelity Compression ### 6.1 Summary Generation When an object degrades from L0 to L1, the Helper LLM generates a summary: **Input to Helper LLM:** ``` You are a context compression engine for a coding agent. Summarize the following content while preserving maximum utility for future reference. CONTENT TYPE: {object.type} CONTENT: {object.content_full} INSTRUCTIONS: 1. Write a detailed summary (~30% of original length) 2. MUST preserve: file paths, function names, variable names, library names, error messages, decision rationale, specific values that may be referenced later 3. List DECLARED LOSSES: specific information you omitted that someone might need. Be precise -- "specific error codes" not "some details" 4. List CAN_ANSWER: categories of questions this summary can answer without needing the original content OUTPUT FORMAT (JSON): { "summary": "...", "losses": ["exact error code for token expiry", "rate limit threshold values", ...], "can_answer": ["auth approach used", "middleware chain order", "why JWT over sessions", ...], "key_entities": ["src/auth/middleware.ts", "handleAuth()", "jsonwebtoken", ...] } ``` **L1 -> L2 compression** uses the L1 summary as input (not L0), with instruction to compress to ~5% of original. Additional losses are accumulated. **L2 -> L3 stub** is generated from L2: ``` [debugging_session | 2026-03-13 14:30 | Fixed race condition in auth token refresh by adding mutex lock in src/auth/refresh.ts | 12 related objects] ``` ### 6.2 Declared Losses Schema ```typescript interface DeclaredLosses { // What was dropped from this fidelity level dropped: string[] // What questions this fidelity level CAN still answer can_answer: string[] // Hint for when to fault (what would require the original) fault_when: string[] // Key entities preserved (for relationship tracking) key_entities: string[] } ``` ### 6.3 Loss Accumulation As objects degrade through fidelity levels, losses accumulate: ``` L0: (full content, no losses) L1: losses = ["exact error codes", "line-by-line implementation"] L2: losses = L1.losses + ["function signatures", "reasoning chain"] L3: losses = L2.losses + ["what was decided", "which files involved"] (at this point, only the one-line description remains) ``` The accumulated `fault_when` list tells the model exactly when it needs to fault: "If you need exact error codes, specific function signatures, or the reasoning behind the auth decision, restore this object." --- ## 7. Queryable Backing Store ### 7.1 The `memory_query` Phantom Tool Instead of restoring a full object (8KB+) to answer a simple question, the model calls `memory_query`: ```json { "tool": "memory_query", "input": { "question": "What error code does the auth middleware return for expired tokens?", "scope": "auth-related objects", "max_tokens": 200 } } ``` **Proxy handling:** 1. Proxy intercepts `memory_query` from the model's streaming response 2. Proxy queries the Object Store: - Embed the question - Find top-k relevant objects by cosine similarity (even evicted ones) - Retrieve their L0 (full content) from the backing store 3. Proxy sends question + retrieved full content to the Helper LLM 4. Helper LLM returns a targeted answer (~50-200 tokens) 5. Proxy injects the answer as a synthetic tool result into the model's context 6. The evicted objects stay evicted -- no fidelity change **Token savings per micro-fault:** - Traditional fault (Pichay): restore full page, ~4,000-8,000 tokens - Micro-fault: inject targeted answer, ~50-200 tokens - Savings: 95-99% per fault ### 7.2 Semantic Search for Micro-Faults The backing store supports multiple retrieval strategies: ```sql -- Vector similarity (primary) SELECT * FROM semantic_objects WHERE session_id = $1 ORDER BY embedding <-> $query_embedding LIMIT 5; -- Hybrid: vector + keyword (for exact matches) SELECT * FROM semantic_objects WHERE session_id = $1 AND ( to_tsvector('english', content_full) @@ plainto_tsquery('english', $query) OR embedding <-> $query_embedding < 0.3 ) ORDER BY embedding <-> $query_embedding LIMIT 5; -- Metadata-filtered (for typed queries) SELECT * FROM semantic_objects WHERE session_id = $1 AND object_type = 'error_context' AND 'src/auth' = ANY(tags) ORDER BY created_at DESC LIMIT 3; ``` ### 7.3 `memory_restore` — Full Page-In When the model needs full content (editing a file, reviewing exact code), it calls `memory_restore` which does a traditional page-in: ```json { "tool": "memory_restore", "input": { "object_id": "obj_abc123", "reason": "Need to edit the auth middleware" } } ``` This upgrades the object to L0, potentially triggering eviction of other objects under pressure. ### 7.4 `memory_release` -- Cooperative Eviction The model can voluntarily release objects it no longer needs: ```json { "tool": "memory_release", "input": { "object_ids": ["obj_abc123", "obj_def456"], "reason": "Done with auth implementation, moving to tests" } } ``` This immediately degrades the objects to L3 (or L4 under pressure), freeing context budget for new work. --- ## 8. Goal-Aware Retrieval ### 8.1 The Problem When the model starts a new sub-task (e.g., "now write tests for auth"), the objects currently in context may be irrelevant (e.g., old debugging sessions for a different module). Goal-aware retrieval proactively swaps context based on the current task. ### 8.2 Detection: When Has the Goal Changed? The Pressure Monitor also tracks goal transitions by comparing: - The current user message embedding vs the previous user message embedding - If cosine similarity < 0.5 (topic shift), trigger goal-aware retrieval ### 8.3 Goal-Aware Context Assembly On goal transition: 1. **Helper LLM classifies the new goal** (~100ms): ``` Given this user message: "{message}" What is the user's current goal? What context would be most relevant? Return: { "goal": "...", "relevant_types": [...], "relevant_tags": [...] } ``` 2. **Query the Object Store** for relevant objects: ```sql SELECT * FROM semantic_objects WHERE session_id = $1 ORDER BY embedding <-> $goal_embedding LIMIT 20; ``` 3. **Rank objects by relevance to new goal** (helper LLM or embedding similarity) 4. **Assemble new context window**: - Top-ranked objects at L0 or L1 (depending on budget) - Previously active but now irrelevant objects degraded to L2 or L3 - Always preserve: system prompt, last 2 user turns, any pinned objects 5. **Inject into next API call** via `experimental.chat.messages.transform` ### 8.4 Predictive Loading After goal classification, the helper LLM can predict what the model will need next: ``` Given goal "write tests for auth", the model will likely need: - The auth middleware implementation (file_context for src/auth/middleware.ts) - The existing test patterns (file_context for tests/*) - The design decision about JWT (design_decision) - NOT: the debugging session for the database migration ``` Pre-load predicted objects at L1, so they're available if the model needs them. --- ## 9. Admission Control (Write Path) Not everything deserves to become a stored object. Based on A-MAC (arXiv:2603.04549): ### 9.1 Admission Score ``` S(m) = w_T * TypePrior(m) + w_N * Novelty(m) + w_U * Utility(m) + w_R * Recency(m) ``` | Factor | Signal | Weight (learned) | |---|---|---| | **TypePrior** | `design_decision` > `error_context` > `file_context` > `tool_result` | ~0.35 | | **Novelty** | Cosine distance to nearest existing object > 0.15 | ~0.25 | | **Utility** | Helper LLM scores future relevance (0-1) | ~0.25 | | **Recency** | Exponential decay from creation time | ~0.15 | **Threshold:** S(m) >= 0.4 to admit. Below threshold, content is kept only in the client's unmodified history (Pichay's backing store) but not indexed in the Object Store. ### 9.2 What Gets Rejected - Routine tool results with no lasting value (e.g., `ls` output, `git status`) - Duplicate file reads where content hasn't changed - Conversation turns that are purely procedural ("Sure, I'll do that") --- ## 10. Entropy-Gated Faulting (L-RAG Integration) Based on L-RAG (arXiv:2601.06551): use the model's own uncertainty as a fault signal. ### 10.1 Mechanism During the model's generation (streaming response), monitor token-level entropy: 1. **Normal entropy** (H < 1.5): model is confident, no intervention 2. **Elevated entropy** (1.5 < H < 2.2): model may benefit from more context. Check if any L2/L3 objects match the current generation topic. If so, silently upgrade to L1. 3. **High entropy** (H > 2.2): model is struggling. Trigger a micro-fault -- query the backing store with the current generation context, inject relevant information. ### 10.2 Complementarity with Declared Losses Entropy-gated faulting handles the case where the model doesn't know what it doesn't know. Declared losses handle the case where it does. Together: - **Declared losses**: "I know I need exact error codes, let me fault" -> model calls `memory_query` - **Entropy signal**: model's generation becomes uncertain around error handling -> proxy automatically upgrades relevant objects ### 10.3 Implementation Complexity Entropy monitoring requires access to token logprobs in the streaming response. The Anthropic API provides these via `stream_options.include_logprobs`. This is a Phase 4e feature due to the complexity of real-time entropy calculation during streaming. --- ## 11. Integration Points ### 11.1 With OpenCode (via oh-my-opencode) The proxy can integrate at two levels: **Level 1: Pure Proxy (Phase 1-3)** - Standalone HTTP proxy between opencode and Anthropic API - Zero changes to opencode or oh-my-opencode - Configuration: set `ANTHROPIC_BASE_URL` to proxy address **Level 2: Plugin Integration (Phase 4+)** - oh-my-opencode hook: `experimental.chat.messages.transform` for context assembly - oh-my-opencode hook: `experimental.session.compacting` for custom compaction - oh-my-opencode hook: `tool.execute.after` for object segmentation on tool results - MCP server exposing `memory_query`, `memory_stats`, `memory_objects` tools (so the user can inspect memory state) ### 11.2 With Pichay (fork and extend) Start from `fsgeek/pichay` (commit `b56701a`): - `proxy.py` -> extend with multi-fidelity eviction, object segmentation - `probe.py` -> extend with object-level analytics - Add: `helper_llm.py` for summary generation, micro-fault QA - Add: `object_store.py` for PostgreSQL + pgvector integration - Add: `segmenter.py` for semantic object detection - Add: `fidelity.py` for multi-fidelity state machine ### 11.3 With the Helper LLM The helper LLM is called via standard API (Anthropic for Haiku, or Ollama for local): | Task | Model | Expected Latency | Tokens In | Tokens Out | |---|---|---|---|---| | Summarize L0 -> L1 | Haiku | ~200ms | ~2000 | ~600 | | Compress L1 -> L2 | Haiku | ~100ms | ~600 | ~100 | | Micro-fault answer | Haiku | ~150ms | ~3000 | ~100 | | Goal classification | Haiku | ~100ms | ~200 | ~50 | | Object segmentation | local (MiniLM) | ~20ms | embedding only | N/A | | Admission scoring | local (qwen2.5) | ~50ms | ~500 | ~10 | **Cost estimate per session (200 turns):** - ~50 summarizations: 50 * ~3000 tokens = ~150K Haiku tokens (~$0.004) - ~20 micro-faults: 20 * ~3000 tokens = ~60K Haiku tokens (~$0.002) - ~10 goal classifications: ~2K Haiku tokens (~$0.00005) - **Total helper cost: ~$0.006 per session** - **Savings on main model**: 50-93% context reduction on Opus/Sonnet calls --- ## 12. Failure Modes and Mitigations | Failure Mode | Consequence | Mitigation | |---|---|---| | **Helper LLM produces bad summary** | Model loses critical info, silent quality degradation | Validate via declared losses. Spot-check: can helper answer `can_answer` queries from summary? | | **Object segmentation too coarse** | Related content split across objects, fidelity changes break coherence | Conservative defaults (prefer larger objects). Relationship tracking keeps related objects together. | | **Object segmentation too fine** | Too many small objects, overhead dominates | Minimum object size (500 tokens). Merge adjacent objects of same type. | | **Thrashing** | Objects repeatedly degraded and restored, wasting helper LLM calls | Fault-driven pinning (Pichay L2). After 1 fault, pin at current fidelity for N turns. | | **Goal misclassification** | Wrong objects loaded for current task | Conservative: always keep last 2 turns at L0. Don't evict below L2 on goal change (can upgrade quickly). | | **Backing store latency spike** | Micro-fault takes >500ms, model generation stalls | Timeout + fallback: if backing store slow, inject L2 summary instead of querying. | | **Declared losses are incomplete** | Model doesn't know it's missing info, doesn't fault | Entropy-gated faulting (Phase 4e) as safety net. Also: periodic loss audit by helper LLM. | | **Helper LLM unavailable** | No summaries, no micro-faults | Graceful degradation: fall back to Pichay-style binary eviction with tombstones. | --- ## 13. Metrics and Evaluation ### 13.1 Primary Metrics | Metric | Target | How to Measure | |---|---|---| | **Context reduction** | >80% vs baseline | (baseline tokens - actual tokens) / baseline tokens | | **Fault rate** | <0.1% | faults / total evictions | | **Micro-fault success rate** | >90% | micro-faults that avoided full page-in / total micro-faults | | **Task quality** | No degradation | LLM-judged equivalence: full-context vs managed-context outputs | | **Helper LLM overhead** | <5% of main model cost | helper cost / main model cost | | **Latency overhead** | <300ms per turn average | (managed turn time - baseline turn time) | ### 13.2 Evaluation Method 1. **Offline replay**: Replay recorded opencode sessions through the proxy. Compare managed output vs original output via LLM judge. 2. **A/B testing**: Run identical tasks with and without proxy. Measure token usage, task completion, and code quality. 3. **Fault analysis**: Log every fidelity transition, fault, and micro-fault. Identify patterns in what causes faults (guides admission control tuning). --- ## 14. Technology Stack | Component | Technology | Rationale | |---|---|---| | **Proxy** | Python (asyncio + httpx) | Fork from Pichay (Python). Streaming support critical. | | **Object Store** | PostgreSQL 16 + pgvector | Proven at scale by Letta. Hybrid vector + relational. | | **Embeddings** | all-MiniLM-L6-v2 (ONNX, local) | Fast (~20ms), no API dependency, good enough for similarity. | | **Helper LLM** | Anthropic Haiku (primary) / Ollama qwen2.5 (fallback) | Haiku: fast + cheap. Ollama: offline capable. | | **Streaming parser** | Custom SSE parser | Must parse tool calls from streaming response before client sees them. | | **Config** | TOML | Simple, human-readable. | | **Testing** | pytest + recorded session replay | Replay real sessions for regression testing. |