mnemosyne/ARCHITECTURE.md
Joey Yakimowich-Payne 7c6a3dbe4a docs: add architecture and reference documentation
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-03-13 11:41:41 -06:00

26 KiB

Object-Addressed Memory Manager for OpenCode

Project Codename: Mnemosyne

A transparent proxy that implements demand-paged, object-addressed memory management for LLM context windows. Extends Pichay's demand paging with semantic objects, multi-fidelity compression, declared losses, queryable backing store, and goal-aware retrieval via a helper LLM.


1. Problem Statement

LLM coding agents (opencode, Claude Code) suffer from context window bloat:

  • 21.8% of input tokens are structural waste (Pichay, 2026): unused tool schemas (11%), stale tool results reprocessed at 84.4x amplification (8.7%), duplicated content (2.2%)
  • Context drift silently degrades reasoning quality before hitting hard token limits
  • Binary eviction (resident vs evicted) is too coarse -- a 200-byte tombstone can't answer questions about 8KB of evicted code
  • No semantic awareness -- eviction is by file path, not by conceptual relevance to the current task

2. Design Principles

  1. The context window is L1 cache, not memory. Everything lives in the backing store; context is a curated working set.
  2. Eviction is cooperative. The model participates in eviction decisions via cleanup tags and phantom tools. It has incentive: cleaner context = better attention quality.
  3. Compression is authored, not algorithmic. The model (or a helper LLM) writes summaries with declared losses. It knows what matters.
  4. The backing store is queryable. The model can ask questions of evicted content without materializing it. Micro-faults replace full page-ins.
  5. Objects, not blocks. The unit of memory is a semantic object (a design decision, a debugging session, a file understanding) -- not a fixed-size page keyed by file path.
  6. Transparency. The proxy is invisible to the client and the inference API. No changes to opencode or the model required.

3. System Architecture

                    opencode (client)
                         |
                         | HTTP (Messages API)
                         v
              +---------------------+
              |    MNEMOSYNE PROXY  |
              |                     |
              |  +---------------+  |
              |  | Context       |  |     +------------------+
              |  | Assembler     |--|---->| Helper LLM       |
              |  |               |  |     | (Haiku / local)  |
              |  +---------------+  |     |                  |
              |  | Fidelity      |  |     | - Summarization  |
              |  | Manager       |  |     | - Loss declaration|
              |  +---------------+  |     | - Micro-fault QA |
              |  | Object        |  |     | - Segmentation   |
              |  | Segmenter     |  |     +------------------+
              |  +---------------+  |
              |  | Fault         |  |     +------------------+
              |  | Detector      |  |     | Object Store     |
              |  +---------------+  |     | (PostgreSQL +    |
              |  | Phantom Tool  |--|---->|  pgvector)       |
              |  | Handler       |  |     |                  |
              |  +---------------+  |     | - Full content   |
              |  | Cleanup Tag   |  |     | - Multi-fidelity |
              |  | Parser        |  |     |   summaries      |
              |  +---------------+  |     | - Embeddings     |
              |  | Pressure      |  |     | - Relationships  |
              |  | Monitor       |  |     | - Fault history  |
              |  +---------------+  |     +------------------+
              +---------------------+
                         |
                         | HTTP (Messages API, modified)
                         v
                  Inference API (Anthropic)

Component Responsibilities

Component Role
Context Assembler Builds the modified message array for each API call. Selects which objects are resident at which fidelity. Injects phantom tool definitions.
Fidelity Manager Tracks current fidelity level of each object. Degrades fidelity under pressure. Upgrades on access. Manages the L0-L3 fidelity ladder.
Object Segmenter Splits the conversation stream into semantic objects. Runs after each turn. Uses embedding coherence + structural signals (tool boundaries, topic shifts).
Fault Detector Detects page faults (model re-requests evicted content). Records fault history for pinning decisions. Detects micro-fault queries.
Phantom Tool Handler Intercepts phantom tool calls from the model's streaming response before they reach the client. Handles memory_release, memory_query, memory_restore.
Cleanup Tag Parser Parses structured directives from the model's text output: drop, summarize, anchor, collapse. Extended with declare_losses.
Pressure Monitor Tracks token consumption per request. Determines pressure zone (Normal/Caution/Warning/Critical). Triggers fidelity degradation.
Helper LLM Cheap model (Haiku, GPT-4o-mini, or local qwen2.5) that authors summaries, declares losses, answers micro-fault queries, and assists with object segmentation.
Object Store PostgreSQL + pgvector database holding all semantic objects at all fidelity levels, with embeddings, metadata, relationships, and fault history.

4. Memory Hierarchy

+----------------------------------------------------------------+
|  L0: Full Content (in context window)                          |
|  - Current working set of semantic objects                     |
|  - Full text, no compression                                  |
|  - Capacity: ~60% of context window budget                     |
+----------------------------------------------------------------+
|  L1: Detailed Summary (in context window)                      |
|  - Model-authored summary, ~30% of original size               |
|  - Preserves: file paths, function names, decisions, errors    |
|  - Declared losses: specific values, exact code, edge cases    |
|  - Capacity: ~20% of context window budget                     |
+----------------------------------------------------------------+
|  L2: Compact Summary (in context window)                       |
|  - Model-authored headline, ~5% of original size               |
|  - Preserves: what was done, what was decided, key files       |
|  - Declared losses: implementation details, reasoning          |
|  - Capacity: ~15% of context window budget                     |
+----------------------------------------------------------------+
|  L3: Metadata Stub (in context window)                         |
|  - One-line description + type + timestamp                     |
|  - ~50-100 tokens per object                                   |
|  - Capacity: ~5% of context window budget                      |
+----------------------------------------------------------------+
|  L4: Evicted (not in context, in backing store only)           |
|  - Not present in context at all                               |
|  - Queryable via memory_query phantom tool                     |
|  - Restorable via memory_restore phantom tool                  |
+----------------------------------------------------------------+
|  BACKING STORE (PostgreSQL + pgvector)                         |
|  - All objects at all fidelity levels, always                  |
|  - Full content preserved indefinitely                         |
|  - Embeddings for semantic search                              |
|  - Cross-session persistence (future)                          |
+----------------------------------------------------------------+

Fidelity Transitions

Pressure rising (token count increasing):
  L0 (full) --[summarize]--> L1 (detailed) --[compress]--> L2 (compact) --[stub]--> L3 --[evict]--> L4

Access / fault (model needs content):
  L4 --[memory_restore]--> L0 (full page-in)
  L4 --[memory_query]--> (helper LLM answers, object stays at L4)
  L3 --[model references]--> L1 or L0 (upgrade on access)
  L2 --[model references]--> L1 (upgrade on access)

Pressure Zones (token thresholds, configurable)

Zone Token % Action
Normal < 50% Observe only. No fidelity changes.
Caution 50-70% Degrade oldest L0 objects to L1.
Warning 70-85% Degrade L0 to L1, L1 to L2, oldest L2 to L3.
Critical 85-95% Aggressive degradation. L2+ to L3. Evict L3 to L4.
Emergency > 95% Force-evict everything except last 2 user turns + system prompt.

5. Semantic Objects

5.1 Object Types

Type Description Example
conversation_phase A coherent stretch of dialogue about one topic "Discussed auth architecture for 8 turns"
design_decision An explicit decision with rationale "Chose JWT over sessions because..."
debugging_session A sequence of diagnose-hypothesize-fix-verify "Tracked down the race condition in..."
file_context A file read and understanding "Read src/auth/middleware.ts (200 lines)"
tool_result Output from a tool call (grep, bash, etc.) "grep found 15 matches for handleAuth"
plan A structured plan or todo list "Implementation plan: 5 steps..."
error_context An error and its diagnosis "TypeError at line 42, caused by..."
external_reference Docs, API reference, examples pulled from outside "React docs on useEffect cleanup"

5.2 Object Segmentation Algorithm

Segmentation runs after each model turn. Two strategies, selected by context:

Strategy A: Structural Segmentation (fast, rule-based)

  • Tool call boundaries are natural object boundaries
  • Each Read result = file_context object
  • Each Bash/Grep result = tool_result object
  • User message + assistant response = potential conversation_phase boundary
  • Heuristic: if topic similarity (embedding cosine) between consecutive turns drops below threshold (0.7), start a new conversation_phase

Strategy B: Semantic Segmentation (slower, higher quality)

  • Based on xMemory's sparsity-semantics objective (arXiv:2602.02007)
  • Embed all messages with a lightweight model (all-MiniLM-L6-v2, ~20ms)
  • Cluster by coherence: maximize inter-object semantic diversity, minimize intra-object redundancy
  • Build hierarchy: messages -> episodes -> themes
  • Use for long sessions (>50 turns) where structural boundaries are insufficient

Default: Strategy A for turns 1-50, Strategy B kicks in at 50+ turns.

5.3 Object Relationships

Objects have typed relationships stored in the backing store:

parent_of:    conversation_phase -> design_decision (decision made during phase)
caused_by:    error_context -> file_context (error was in this file)
references:   debugging_session -> file_context (files examined during debug)
supersedes:   file_context(v2) -> file_context(v1) (file re-read after edit)
depends_on:   plan -> design_decision (plan relies on this decision)

Relationships are used by the Context Assembler: when upgrading an object's fidelity, also consider upgrading its depends_on and references relationships.


6. Multi-Fidelity Compression

6.1 Summary Generation

When an object degrades from L0 to L1, the Helper LLM generates a summary:

Input to Helper LLM:

You are a context compression engine for a coding agent. Summarize the following
content while preserving maximum utility for future reference.

CONTENT TYPE: {object.type}
CONTENT:
{object.content_full}

INSTRUCTIONS:
1. Write a detailed summary (~30% of original length)
2. MUST preserve: file paths, function names, variable names, library names,
   error messages, decision rationale, specific values that may be referenced later
3. List DECLARED LOSSES: specific information you omitted that someone might need.
   Be precise -- "specific error codes" not "some details"
4. List CAN_ANSWER: categories of questions this summary can answer without
   needing the original content

OUTPUT FORMAT (JSON):
{
  "summary": "...",
  "losses": ["exact error code for token expiry", "rate limit threshold values", ...],
  "can_answer": ["auth approach used", "middleware chain order", "why JWT over sessions", ...],
  "key_entities": ["src/auth/middleware.ts", "handleAuth()", "jsonwebtoken", ...]
}

L1 -> L2 compression uses the L1 summary as input (not L0), with instruction to compress to ~5% of original. Additional losses are accumulated.

L2 -> L3 stub is generated from L2:

[debugging_session | 2026-03-13 14:30 | Fixed race condition in auth token refresh
 by adding mutex lock in src/auth/refresh.ts | 12 related objects]

6.2 Declared Losses Schema

interface DeclaredLosses {
  // What was dropped from this fidelity level
  dropped: string[]
  
  // What questions this fidelity level CAN still answer
  can_answer: string[]
  
  // Hint for when to fault (what would require the original)
  fault_when: string[]
  
  // Key entities preserved (for relationship tracking)
  key_entities: string[]
}

6.3 Loss Accumulation

As objects degrade through fidelity levels, losses accumulate:

L0: (full content, no losses)
L1: losses = ["exact error codes", "line-by-line implementation"]
L2: losses = L1.losses + ["function signatures", "reasoning chain"]
L3: losses = L2.losses + ["what was decided", "which files involved"]
     (at this point, only the one-line description remains)

The accumulated fault_when list tells the model exactly when it needs to fault: "If you need exact error codes, specific function signatures, or the reasoning behind the auth decision, restore this object."


7. Queryable Backing Store

7.1 The memory_query Phantom Tool

Instead of restoring a full object (8KB+) to answer a simple question, the model calls memory_query:

{
  "tool": "memory_query",
  "input": {
    "question": "What error code does the auth middleware return for expired tokens?",
    "scope": "auth-related objects",
    "max_tokens": 200
  }
}

Proxy handling:

  1. Proxy intercepts memory_query from the model's streaming response
  2. Proxy queries the Object Store:
    • Embed the question
    • Find top-k relevant objects by cosine similarity (even evicted ones)
    • Retrieve their L0 (full content) from the backing store
  3. Proxy sends question + retrieved full content to the Helper LLM
  4. Helper LLM returns a targeted answer (~50-200 tokens)
  5. Proxy injects the answer as a synthetic tool result into the model's context
  6. The evicted objects stay evicted -- no fidelity change

Token savings per micro-fault:

  • Traditional fault (Pichay): restore full page, ~4,000-8,000 tokens
  • Micro-fault: inject targeted answer, ~50-200 tokens
  • Savings: 95-99% per fault

7.2 Semantic Search for Micro-Faults

The backing store supports multiple retrieval strategies:

-- Vector similarity (primary)
SELECT * FROM semantic_objects
WHERE session_id = $1
ORDER BY embedding <-> $query_embedding
LIMIT 5;

-- Hybrid: vector + keyword (for exact matches)
SELECT * FROM semantic_objects
WHERE session_id = $1
  AND (
    to_tsvector('english', content_full) @@ plainto_tsquery('english', $query)
    OR embedding <-> $query_embedding < 0.3
  )
ORDER BY embedding <-> $query_embedding
LIMIT 5;

-- Metadata-filtered (for typed queries)
SELECT * FROM semantic_objects
WHERE session_id = $1
  AND object_type = 'error_context'
  AND 'src/auth' = ANY(tags)
ORDER BY created_at DESC
LIMIT 3;

7.3 memory_restore — Full Page-In

When the model needs full content (editing a file, reviewing exact code), it calls memory_restore which does a traditional page-in:

{
  "tool": "memory_restore",
  "input": {
    "object_id": "obj_abc123",
    "reason": "Need to edit the auth middleware"
  }
}

This upgrades the object to L0, potentially triggering eviction of other objects under pressure.

7.4 memory_release -- Cooperative Eviction

The model can voluntarily release objects it no longer needs:

{
  "tool": "memory_release",
  "input": {
    "object_ids": ["obj_abc123", "obj_def456"],
    "reason": "Done with auth implementation, moving to tests"
  }
}

This immediately degrades the objects to L3 (or L4 under pressure), freeing context budget for new work.


8. Goal-Aware Retrieval

8.1 The Problem

When the model starts a new sub-task (e.g., "now write tests for auth"), the objects currently in context may be irrelevant (e.g., old debugging sessions for a different module). Goal-aware retrieval proactively swaps context based on the current task.

8.2 Detection: When Has the Goal Changed?

The Pressure Monitor also tracks goal transitions by comparing:

  • The current user message embedding vs the previous user message embedding
  • If cosine similarity < 0.5 (topic shift), trigger goal-aware retrieval

8.3 Goal-Aware Context Assembly

On goal transition:

  1. Helper LLM classifies the new goal (~100ms):

    Given this user message: "{message}"
    What is the user's current goal? What context would be most relevant?
    Return: { "goal": "...", "relevant_types": [...], "relevant_tags": [...] }
    
  2. Query the Object Store for relevant objects:

    SELECT * FROM semantic_objects
    WHERE session_id = $1
    ORDER BY embedding <-> $goal_embedding
    LIMIT 20;
    
  3. Rank objects by relevance to new goal (helper LLM or embedding similarity)

  4. Assemble new context window:

    • Top-ranked objects at L0 or L1 (depending on budget)
    • Previously active but now irrelevant objects degraded to L2 or L3
    • Always preserve: system prompt, last 2 user turns, any pinned objects
  5. Inject into next API call via experimental.chat.messages.transform

8.4 Predictive Loading

After goal classification, the helper LLM can predict what the model will need next:

Given goal "write tests for auth", the model will likely need:
- The auth middleware implementation (file_context for src/auth/middleware.ts)
- The existing test patterns (file_context for tests/*)
- The design decision about JWT (design_decision)
- NOT: the debugging session for the database migration

Pre-load predicted objects at L1, so they're available if the model needs them.


9. Admission Control (Write Path)

Not everything deserves to become a stored object. Based on A-MAC (arXiv:2603.04549):

9.1 Admission Score

S(m) = w_T * TypePrior(m) + w_N * Novelty(m) + w_U * Utility(m) + w_R * Recency(m)
Factor Signal Weight (learned)
TypePrior design_decision > error_context > file_context > tool_result ~0.35
Novelty Cosine distance to nearest existing object > 0.15 ~0.25
Utility Helper LLM scores future relevance (0-1) ~0.25
Recency Exponential decay from creation time ~0.15

Threshold: S(m) >= 0.4 to admit. Below threshold, content is kept only in the client's unmodified history (Pichay's backing store) but not indexed in the Object Store.

9.2 What Gets Rejected

  • Routine tool results with no lasting value (e.g., ls output, git status)
  • Duplicate file reads where content hasn't changed
  • Conversation turns that are purely procedural ("Sure, I'll do that")

10. Entropy-Gated Faulting (L-RAG Integration)

Based on L-RAG (arXiv:2601.06551): use the model's own uncertainty as a fault signal.

10.1 Mechanism

During the model's generation (streaming response), monitor token-level entropy:

  1. Normal entropy (H < 1.5): model is confident, no intervention
  2. Elevated entropy (1.5 < H < 2.2): model may benefit from more context. Check if any L2/L3 objects match the current generation topic. If so, silently upgrade to L1.
  3. High entropy (H > 2.2): model is struggling. Trigger a micro-fault -- query the backing store with the current generation context, inject relevant information.

10.2 Complementarity with Declared Losses

Entropy-gated faulting handles the case where the model doesn't know what it doesn't know. Declared losses handle the case where it does. Together:

  • Declared losses: "I know I need exact error codes, let me fault" -> model calls memory_query
  • Entropy signal: model's generation becomes uncertain around error handling -> proxy automatically upgrades relevant objects

10.3 Implementation Complexity

Entropy monitoring requires access to token logprobs in the streaming response. The Anthropic API provides these via stream_options.include_logprobs. This is a Phase 4e feature due to the complexity of real-time entropy calculation during streaming.


11. Integration Points

11.1 With OpenCode (via oh-my-opencode)

The proxy can integrate at two levels:

Level 1: Pure Proxy (Phase 1-3)

  • Standalone HTTP proxy between opencode and Anthropic API
  • Zero changes to opencode or oh-my-opencode
  • Configuration: set ANTHROPIC_BASE_URL to proxy address

Level 2: Plugin Integration (Phase 4+)

  • oh-my-opencode hook: experimental.chat.messages.transform for context assembly
  • oh-my-opencode hook: experimental.session.compacting for custom compaction
  • oh-my-opencode hook: tool.execute.after for object segmentation on tool results
  • MCP server exposing memory_query, memory_stats, memory_objects tools (so the user can inspect memory state)

11.2 With Pichay (fork and extend)

Start from fsgeek/pichay (commit b56701a):

  • proxy.py -> extend with multi-fidelity eviction, object segmentation
  • probe.py -> extend with object-level analytics
  • Add: helper_llm.py for summary generation, micro-fault QA
  • Add: object_store.py for PostgreSQL + pgvector integration
  • Add: segmenter.py for semantic object detection
  • Add: fidelity.py for multi-fidelity state machine

11.3 With the Helper LLM

The helper LLM is called via standard API (Anthropic for Haiku, or Ollama for local):

Task Model Expected Latency Tokens In Tokens Out
Summarize L0 -> L1 Haiku ~200ms ~2000 ~600
Compress L1 -> L2 Haiku ~100ms ~600 ~100
Micro-fault answer Haiku ~150ms ~3000 ~100
Goal classification Haiku ~100ms ~200 ~50
Object segmentation local (MiniLM) ~20ms embedding only N/A
Admission scoring local (qwen2.5) ~50ms ~500 ~10

Cost estimate per session (200 turns):

  • ~50 summarizations: 50 * ~3000 tokens = 150K Haiku tokens ($0.004)
  • ~20 micro-faults: 20 * ~3000 tokens = 60K Haiku tokens ($0.002)
  • ~10 goal classifications: 2K Haiku tokens ($0.00005)
  • Total helper cost: ~$0.006 per session
  • Savings on main model: 50-93% context reduction on Opus/Sonnet calls

12. Failure Modes and Mitigations

Failure Mode Consequence Mitigation
Helper LLM produces bad summary Model loses critical info, silent quality degradation Validate via declared losses. Spot-check: can helper answer can_answer queries from summary?
Object segmentation too coarse Related content split across objects, fidelity changes break coherence Conservative defaults (prefer larger objects). Relationship tracking keeps related objects together.
Object segmentation too fine Too many small objects, overhead dominates Minimum object size (500 tokens). Merge adjacent objects of same type.
Thrashing Objects repeatedly degraded and restored, wasting helper LLM calls Fault-driven pinning (Pichay L2). After 1 fault, pin at current fidelity for N turns.
Goal misclassification Wrong objects loaded for current task Conservative: always keep last 2 turns at L0. Don't evict below L2 on goal change (can upgrade quickly).
Backing store latency spike Micro-fault takes >500ms, model generation stalls Timeout + fallback: if backing store slow, inject L2 summary instead of querying.
Declared losses are incomplete Model doesn't know it's missing info, doesn't fault Entropy-gated faulting (Phase 4e) as safety net. Also: periodic loss audit by helper LLM.
Helper LLM unavailable No summaries, no micro-faults Graceful degradation: fall back to Pichay-style binary eviction with tombstones.

13. Metrics and Evaluation

13.1 Primary Metrics

Metric Target How to Measure
Context reduction >80% vs baseline (baseline tokens - actual tokens) / baseline tokens
Fault rate <0.1% faults / total evictions
Micro-fault success rate >90% micro-faults that avoided full page-in / total micro-faults
Task quality No degradation LLM-judged equivalence: full-context vs managed-context outputs
Helper LLM overhead <5% of main model cost helper cost / main model cost
Latency overhead <300ms per turn average (managed turn time - baseline turn time)

13.2 Evaluation Method

  1. Offline replay: Replay recorded opencode sessions through the proxy. Compare managed output vs original output via LLM judge.
  2. A/B testing: Run identical tasks with and without proxy. Measure token usage, task completion, and code quality.
  3. Fault analysis: Log every fidelity transition, fault, and micro-fault. Identify patterns in what causes faults (guides admission control tuning).

14. Technology Stack

Component Technology Rationale
Proxy Python (asyncio + httpx) Fork from Pichay (Python). Streaming support critical.
Object Store PostgreSQL 16 + pgvector Proven at scale by Letta. Hybrid vector + relational.
Embeddings all-MiniLM-L6-v2 (ONNX, local) Fast (~20ms), no API dependency, good enough for similarity.
Helper LLM Anthropic Haiku (primary) / Ollama qwen2.5 (fallback) Haiku: fast + cheap. Ollama: offline capable.
Streaming parser Custom SSE parser Must parse tool calls from streaming response before client sees them.
Config TOML Simple, human-readable.
Testing pytest + recorded session replay Replay real sessions for regression testing.