Joey Yakimowich-Payne 7c6a3dbe4a docs: add architecture and reference documentation

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

2026-03-13 11:41:41 -06:00

26 KiB

Raw Blame History

Object-Addressed Memory Manager for OpenCode

Project Codename: Mnemosyne

A transparent proxy that implements demand-paged, object-addressed memory management for LLM context windows. Extends Pichay's demand paging with semantic objects, multi-fidelity compression, declared losses, queryable backing store, and goal-aware retrieval via a helper LLM.

1. Problem Statement

LLM coding agents (opencode, Claude Code) suffer from context window bloat:

21.8% of input tokens are structural waste (Pichay, 2026): unused tool schemas (11%), stale tool results reprocessed at 84.4x amplification (8.7%), duplicated content (2.2%)
Context drift silently degrades reasoning quality before hitting hard token limits
Binary eviction (resident vs evicted) is too coarse -- a 200-byte tombstone can't answer questions about 8KB of evicted code
No semantic awareness -- eviction is by file path, not by conceptual relevance to the current task

2. Design Principles

The context window is L1 cache, not memory. Everything lives in the backing store; context is a curated working set.
Eviction is cooperative. The model participates in eviction decisions via cleanup tags and phantom tools. It has incentive: cleaner context = better attention quality.
Compression is authored, not algorithmic. The model (or a helper LLM) writes summaries with declared losses. It knows what matters.
The backing store is queryable. The model can ask questions of evicted content without materializing it. Micro-faults replace full page-ins.
Objects, not blocks. The unit of memory is a semantic object (a design decision, a debugging session, a file understanding) -- not a fixed-size page keyed by file path.
Transparency. The proxy is invisible to the client and the inference API. No changes to opencode or the model required.

3. System Architecture

                    opencode (client)
                         |
                         | HTTP (Messages API)
                         v
              +---------------------+
              |    MNEMOSYNE PROXY  |
              |                     |
              |  +---------------+  |
              |  | Context       |  |     +------------------+
              |  | Assembler     |--|---->| Helper LLM       |
              |  |               |  |     | (Haiku / local)  |
              |  +---------------+  |     |                  |
              |  | Fidelity      |  |     | - Summarization  |
              |  | Manager       |  |     | - Loss declaration|
              |  +---------------+  |     | - Micro-fault QA |
              |  | Object        |  |     | - Segmentation   |
              |  | Segmenter     |  |     +------------------+
              |  +---------------+  |
              |  | Fault         |  |     +------------------+
              |  | Detector      |  |     | Object Store     |
              |  +---------------+  |     | (PostgreSQL +    |
              |  | Phantom Tool  |--|---->|  pgvector)       |
              |  | Handler       |  |     |                  |
              |  +---------------+  |     | - Full content   |
              |  | Cleanup Tag   |  |     | - Multi-fidelity |
              |  | Parser        |  |     |   summaries      |
              |  +---------------+  |     | - Embeddings     |
              |  | Pressure      |  |     | - Relationships  |
              |  | Monitor       |  |     | - Fault history  |
              |  +---------------+  |     +------------------+
              +---------------------+
                         |
                         | HTTP (Messages API, modified)
                         v
                  Inference API (Anthropic)

Component Responsibilities

Component	Role
Context Assembler	Builds the modified message array for each API call. Selects which objects are resident at which fidelity. Injects phantom tool definitions.
Fidelity Manager	Tracks current fidelity level of each object. Degrades fidelity under pressure. Upgrades on access. Manages the L0-L3 fidelity ladder.
Object Segmenter	Splits the conversation stream into semantic objects. Runs after each turn. Uses embedding coherence + structural signals (tool boundaries, topic shifts).
Fault Detector	Detects page faults (model re-requests evicted content). Records fault history for pinning decisions. Detects micro-fault queries.
Phantom Tool Handler	Intercepts phantom tool calls from the model's streaming response before they reach the client. Handles `memory_release`, `memory_query`, `memory_restore`.
Cleanup Tag Parser	Parses structured directives from the model's text output: `drop`, `summarize`, `anchor`, `collapse`. Extended with `declare_losses`.
Pressure Monitor	Tracks token consumption per request. Determines pressure zone (Normal/Caution/Warning/Critical). Triggers fidelity degradation.
Helper LLM	Cheap model (Haiku, GPT-4o-mini, or local qwen2.5) that authors summaries, declares losses, answers micro-fault queries, and assists with object segmentation.
Object Store	PostgreSQL + pgvector database holding all semantic objects at all fidelity levels, with embeddings, metadata, relationships, and fault history.

4. Memory Hierarchy

+----------------------------------------------------------------+
|  L0: Full Content (in context window)                          |
|  - Current working set of semantic objects                     |
|  - Full text, no compression                                  |
|  - Capacity: ~60% of context window budget                     |
+----------------------------------------------------------------+
|  L1: Detailed Summary (in context window)                      |
|  - Model-authored summary, ~30% of original size               |
|  - Preserves: file paths, function names, decisions, errors    |
|  - Declared losses: specific values, exact code, edge cases    |
|  - Capacity: ~20% of context window budget                     |
+----------------------------------------------------------------+
|  L2: Compact Summary (in context window)                       |
|  - Model-authored headline, ~5% of original size               |
|  - Preserves: what was done, what was decided, key files       |
|  - Declared losses: implementation details, reasoning          |
|  - Capacity: ~15% of context window budget                     |
+----------------------------------------------------------------+
|  L3: Metadata Stub (in context window)                         |
|  - One-line description + type + timestamp                     |
|  - ~50-100 tokens per object                                   |
|  - Capacity: ~5% of context window budget                      |
+----------------------------------------------------------------+
|  L4: Evicted (not in context, in backing store only)           |
|  - Not present in context at all                               |
|  - Queryable via memory_query phantom tool                     |
|  - Restorable via memory_restore phantom tool                  |
+----------------------------------------------------------------+
|  BACKING STORE (PostgreSQL + pgvector)                         |
|  - All objects at all fidelity levels, always                  |
|  - Full content preserved indefinitely                         |
|  - Embeddings for semantic search                              |
|  - Cross-session persistence (future)                          |
+----------------------------------------------------------------+

Fidelity Transitions

Pressure rising (token count increasing):
  L0 (full) --[summarize]--> L1 (detailed) --[compress]--> L2 (compact) --[stub]--> L3 --[evict]--> L4

Access / fault (model needs content):
  L4 --[memory_restore]--> L0 (full page-in)
  L4 --[memory_query]--> (helper LLM answers, object stays at L4)
  L3 --[model references]--> L1 or L0 (upgrade on access)
  L2 --[model references]--> L1 (upgrade on access)

Pressure Zones (token thresholds, configurable)

Zone	Token %	Action
Normal	< 50%	Observe only. No fidelity changes.
Caution	50-70%	Degrade oldest L0 objects to L1.
Warning	70-85%	Degrade L0 to L1, L1 to L2, oldest L2 to L3.
Critical	85-95%	Aggressive degradation. L2+ to L3. Evict L3 to L4.
Emergency	> 95%	Force-evict everything except last 2 user turns + system prompt.

5. Semantic Objects

5.1 Object Types

Type	Description	Example
`conversation_phase`	A coherent stretch of dialogue about one topic	"Discussed auth architecture for 8 turns"
`design_decision`	An explicit decision with rationale	"Chose JWT over sessions because..."
`debugging_session`	A sequence of diagnose-hypothesize-fix-verify	"Tracked down the race condition in..."
`file_context`	A file read and understanding	"Read src/auth/middleware.ts (200 lines)"
`tool_result`	Output from a tool call (grep, bash, etc.)	"grep found 15 matches for handleAuth"
`plan`	A structured plan or todo list	"Implementation plan: 5 steps..."
`error_context`	An error and its diagnosis	"TypeError at line 42, caused by..."
`external_reference`	Docs, API reference, examples pulled from outside	"React docs on useEffect cleanup"

5.2 Object Segmentation Algorithm

Segmentation runs after each model turn. Two strategies, selected by context:

Strategy A: Structural Segmentation (fast, rule-based)

Tool call boundaries are natural object boundaries
Each Read result = file_context object
Each Bash/Grep result = tool_result object
User message + assistant response = potential conversation_phase boundary
Heuristic: if topic similarity (embedding cosine) between consecutive turns drops below threshold (0.7), start a new conversation_phase

Strategy B: Semantic Segmentation (slower, higher quality)

Based on xMemory's sparsity-semantics objective (arXiv:2602.02007)
Embed all messages with a lightweight model (all-MiniLM-L6-v2, ~20ms)
Cluster by coherence: maximize inter-object semantic diversity, minimize intra-object redundancy
Build hierarchy: messages -> episodes -> themes
Use for long sessions (>50 turns) where structural boundaries are insufficient

Default: Strategy A for turns 1-50, Strategy B kicks in at 50+ turns.

5.3 Object Relationships

Objects have typed relationships stored in the backing store:

parent_of:    conversation_phase -> design_decision (decision made during phase)
caused_by:    error_context -> file_context (error was in this file)
references:   debugging_session -> file_context (files examined during debug)
supersedes:   file_context(v2) -> file_context(v1) (file re-read after edit)
depends_on:   plan -> design_decision (plan relies on this decision)

Relationships are used by the Context Assembler: when upgrading an object's fidelity, also consider upgrading its depends_on and references relationships.

6. Multi-Fidelity Compression

6.1 Summary Generation

When an object degrades from L0 to L1, the Helper LLM generates a summary:

Input to Helper LLM:

You are a context compression engine for a coding agent. Summarize the following
content while preserving maximum utility for future reference.

CONTENT TYPE: {object.type}
CONTENT:
{object.content_full}

INSTRUCTIONS:
1. Write a detailed summary (~30% of original length)
2. MUST preserve: file paths, function names, variable names, library names,
   error messages, decision rationale, specific values that may be referenced later
3. List DECLARED LOSSES: specific information you omitted that someone might need.
   Be precise -- "specific error codes" not "some details"
4. List CAN_ANSWER: categories of questions this summary can answer without
   needing the original content

OUTPUT FORMAT (JSON):
{
  "summary": "...",
  "losses": ["exact error code for token expiry", "rate limit threshold values", ...],
  "can_answer": ["auth approach used", "middleware chain order", "why JWT over sessions", ...],
  "key_entities": ["src/auth/middleware.ts", "handleAuth()", "jsonwebtoken", ...]
}

L1 -> L2 compression uses the L1 summary as input (not L0), with instruction to compress to ~5% of original. Additional losses are accumulated.

L2 -> L3 stub is generated from L2:

[debugging_session | 2026-03-13 14:30 | Fixed race condition in auth token refresh
 by adding mutex lock in src/auth/refresh.ts | 12 related objects]

6.2 Declared Losses Schema

interface DeclaredLosses {
  // What was dropped from this fidelity level
  dropped: string[]
  
  // What questions this fidelity level CAN still answer
  can_answer: string[]
  
  // Hint for when to fault (what would require the original)
  fault_when: string[]
  
  // Key entities preserved (for relationship tracking)
  key_entities: string[]
}

6.3 Loss Accumulation

As objects degrade through fidelity levels, losses accumulate:

L0: (full content, no losses)
L1: losses = ["exact error codes", "line-by-line implementation"]
L2: losses = L1.losses + ["function signatures", "reasoning chain"]
L3: losses = L2.losses + ["what was decided", "which files involved"]
     (at this point, only the one-line description remains)

The accumulated fault_when list tells the model exactly when it needs to fault: "If you need exact error codes, specific function signatures, or the reasoning behind the auth decision, restore this object."

7. Queryable Backing Store

7.1 The `memory_query` Phantom Tool

Instead of restoring a full object (8KB+) to answer a simple question, the model calls memory_query:

{
  "tool": "memory_query",
  "input": {
    "question": "What error code does the auth middleware return for expired tokens?",
    "scope": "auth-related objects",
    "max_tokens": 200
  }
}

Proxy handling:

Proxy intercepts memory_query from the model's streaming response
Proxy queries the Object Store:
- Embed the question
- Find top-k relevant objects by cosine similarity (even evicted ones)
- Retrieve their L0 (full content) from the backing store
Proxy sends question + retrieved full content to the Helper LLM
Helper LLM returns a targeted answer (~50-200 tokens)
Proxy injects the answer as a synthetic tool result into the model's context
The evicted objects stay evicted -- no fidelity change

Token savings per micro-fault:

Traditional fault (Pichay): restore full page, ~4,000-8,000 tokens
Micro-fault: inject targeted answer, ~50-200 tokens
Savings: 95-99% per fault

7.2 Semantic Search for Micro-Faults

The backing store supports multiple retrieval strategies:

-- Vector similarity (primary)
SELECT * FROM semantic_objects
WHERE session_id = $1
ORDER BY embedding <-> $query_embedding
LIMIT 5;

-- Hybrid: vector + keyword (for exact matches)
SELECT * FROM semantic_objects
WHERE session_id = $1
  AND (
    to_tsvector('english', content_full) @@ plainto_tsquery('english', $query)
    OR embedding <-> $query_embedding < 0.3
  )
ORDER BY embedding <-> $query_embedding
LIMIT 5;

-- Metadata-filtered (for typed queries)
SELECT * FROM semantic_objects
WHERE session_id = $1
  AND object_type = 'error_context'
  AND 'src/auth' = ANY(tags)
ORDER BY created_at DESC
LIMIT 3;

7.3 `memory_restore` — Full Page-In

When the model needs full content (editing a file, reviewing exact code), it calls memory_restore which does a traditional page-in:

{
  "tool": "memory_restore",
  "input": {
    "object_id": "obj_abc123",
    "reason": "Need to edit the auth middleware"
  }
}

This upgrades the object to L0, potentially triggering eviction of other objects under pressure.

7.4 `memory_release` -- Cooperative Eviction

The model can voluntarily release objects it no longer needs:

{
  "tool": "memory_release",
  "input": {
    "object_ids": ["obj_abc123", "obj_def456"],
    "reason": "Done with auth implementation, moving to tests"
  }
}

This immediately degrades the objects to L3 (or L4 under pressure), freeing context budget for new work.

8. Goal-Aware Retrieval

8.1 The Problem

When the model starts a new sub-task (e.g., "now write tests for auth"), the objects currently in context may be irrelevant (e.g., old debugging sessions for a different module). Goal-aware retrieval proactively swaps context based on the current task.

8.2 Detection: When Has the Goal Changed?

The Pressure Monitor also tracks goal transitions by comparing:

The current user message embedding vs the previous user message embedding
If cosine similarity < 0.5 (topic shift), trigger goal-aware retrieval

8.3 Goal-Aware Context Assembly

On goal transition:

Helper LLM classifies the new goal (~100ms):

Given this user message: "{message}"
What is the user's current goal? What context would be most relevant?
Return: { "goal": "...", "relevant_types": [...], "relevant_tags": [...] }

Query the Object Store for relevant objects:

SELECT * FROM semantic_objects
WHERE session_id = $1
ORDER BY embedding <-> $goal_embedding
LIMIT 20;

Rank objects by relevance to new goal (helper LLM or embedding similarity)
Assemble new context window:
- Top-ranked objects at L0 or L1 (depending on budget)
- Previously active but now irrelevant objects degraded to L2 or L3
- Always preserve: system prompt, last 2 user turns, any pinned objects
Inject into next API call via experimental.chat.messages.transform

8.4 Predictive Loading

After goal classification, the helper LLM can predict what the model will need next:

Given goal "write tests for auth", the model will likely need:
- The auth middleware implementation (file_context for src/auth/middleware.ts)
- The existing test patterns (file_context for tests/*)
- The design decision about JWT (design_decision)
- NOT: the debugging session for the database migration

Pre-load predicted objects at L1, so they're available if the model needs them.

9. Admission Control (Write Path)

Not everything deserves to become a stored object. Based on A-MAC (arXiv:2603.04549):

9.1 Admission Score

S(m) = w_T * TypePrior(m) + w_N * Novelty(m) + w_U * Utility(m) + w_R * Recency(m)

Factor	Signal	Weight (learned)
TypePrior	`design_decision` > `error_context` > `file_context` > `tool_result`	~0.35
Novelty	Cosine distance to nearest existing object > 0.15	~0.25
Utility	Helper LLM scores future relevance (0-1)	~0.25
Recency	Exponential decay from creation time	~0.15

Threshold: S(m) >= 0.4 to admit. Below threshold, content is kept only in the client's unmodified history (Pichay's backing store) but not indexed in the Object Store.

9.2 What Gets Rejected

Routine tool results with no lasting value (e.g., ls output, git status)
Duplicate file reads where content hasn't changed
Conversation turns that are purely procedural ("Sure, I'll do that")

10. Entropy-Gated Faulting (L-RAG Integration)

Based on L-RAG (arXiv:2601.06551): use the model's own uncertainty as a fault signal.

10.1 Mechanism

During the model's generation (streaming response), monitor token-level entropy:

Normal entropy (H < 1.5): model is confident, no intervention
Elevated entropy (1.5 < H < 2.2): model may benefit from more context. Check if any L2/L3 objects match the current generation topic. If so, silently upgrade to L1.
High entropy (H > 2.2): model is struggling. Trigger a micro-fault -- query the backing store with the current generation context, inject relevant information.

10.2 Complementarity with Declared Losses

Entropy-gated faulting handles the case where the model doesn't know what it doesn't know. Declared losses handle the case where it does. Together:

Declared losses: "I know I need exact error codes, let me fault" -> model calls memory_query
Entropy signal: model's generation becomes uncertain around error handling -> proxy automatically upgrades relevant objects

10.3 Implementation Complexity

Entropy monitoring requires access to token logprobs in the streaming response. The Anthropic API provides these via stream_options.include_logprobs. This is a Phase 4e feature due to the complexity of real-time entropy calculation during streaming.

11. Integration Points

11.1 With OpenCode (via oh-my-opencode)

The proxy can integrate at two levels:

Level 1: Pure Proxy (Phase 1-3)

Standalone HTTP proxy between opencode and Anthropic API
Zero changes to opencode or oh-my-opencode
Configuration: set ANTHROPIC_BASE_URL to proxy address

Level 2: Plugin Integration (Phase 4+)

oh-my-opencode hook: experimental.chat.messages.transform for context assembly
oh-my-opencode hook: experimental.session.compacting for custom compaction
oh-my-opencode hook: tool.execute.after for object segmentation on tool results
MCP server exposing memory_query, memory_stats, memory_objects tools (so the user can inspect memory state)

11.2 With Pichay (fork and extend)

Start from fsgeek/pichay (commit b56701a):

proxy.py -> extend with multi-fidelity eviction, object segmentation
probe.py -> extend with object-level analytics
Add: helper_llm.py for summary generation, micro-fault QA
Add: object_store.py for PostgreSQL + pgvector integration
Add: segmenter.py for semantic object detection
Add: fidelity.py for multi-fidelity state machine

11.3 With the Helper LLM

The helper LLM is called via standard API (Anthropic for Haiku, or Ollama for local):

Task	Model	Expected Latency	Tokens In	Tokens Out
Summarize L0 -> L1	Haiku	~200ms	~2000	~600
Compress L1 -> L2	Haiku	~100ms	~600	~100
Micro-fault answer	Haiku	~150ms	~3000	~100
Goal classification	Haiku	~100ms	~200	~50
Object segmentation	local (MiniLM)	~20ms	embedding only	N/A
Admission scoring	local (qwen2.5)	~50ms	~500	~10

Cost estimate per session (200 turns):

~50 summarizations: 50 * ~3000 tokens = ~~150K Haiku tokens (~~$0.004)
~20 micro-faults: 20 * ~3000 tokens = ~~60K Haiku tokens (~~$0.002)
~10 goal classifications: ~~2K Haiku tokens (~~$0.00005)
Total helper cost: ~$0.006 per session
Savings on main model: 50-93% context reduction on Opus/Sonnet calls

12. Failure Modes and Mitigations

Failure Mode	Consequence	Mitigation
Helper LLM produces bad summary	Model loses critical info, silent quality degradation	Validate via declared losses. Spot-check: can helper answer `can_answer` queries from summary?
Object segmentation too coarse	Related content split across objects, fidelity changes break coherence	Conservative defaults (prefer larger objects). Relationship tracking keeps related objects together.
Object segmentation too fine	Too many small objects, overhead dominates	Minimum object size (500 tokens). Merge adjacent objects of same type.
Thrashing	Objects repeatedly degraded and restored, wasting helper LLM calls	Fault-driven pinning (Pichay L2). After 1 fault, pin at current fidelity for N turns.
Goal misclassification	Wrong objects loaded for current task	Conservative: always keep last 2 turns at L0. Don't evict below L2 on goal change (can upgrade quickly).
Backing store latency spike	Micro-fault takes >500ms, model generation stalls	Timeout + fallback: if backing store slow, inject L2 summary instead of querying.
Declared losses are incomplete	Model doesn't know it's missing info, doesn't fault	Entropy-gated faulting (Phase 4e) as safety net. Also: periodic loss audit by helper LLM.
Helper LLM unavailable	No summaries, no micro-faults	Graceful degradation: fall back to Pichay-style binary eviction with tombstones.

13. Metrics and Evaluation

13.1 Primary Metrics

Metric	Target	How to Measure
Context reduction	>80% vs baseline	(baseline tokens - actual tokens) / baseline tokens
Fault rate	<0.1%	faults / total evictions
Micro-fault success rate	>90%	micro-faults that avoided full page-in / total micro-faults
Task quality	No degradation	LLM-judged equivalence: full-context vs managed-context outputs
Helper LLM overhead	<5% of main model cost	helper cost / main model cost
Latency overhead	<300ms per turn average	(managed turn time - baseline turn time)

13.2 Evaluation Method

Offline replay: Replay recorded opencode sessions through the proxy. Compare managed output vs original output via LLM judge.
A/B testing: Run identical tasks with and without proxy. Measure token usage, task completion, and code quality.
Fault analysis: Log every fidelity transition, fault, and micro-fault. Identify patterns in what causes faults (guides admission control tuning).

14. Technology Stack

Component	Technology	Rationale
Proxy	Python (asyncio + httpx)	Fork from Pichay (Python). Streaming support critical.
Object Store	PostgreSQL 16 + pgvector	Proven at scale by Letta. Hybrid vector + relational.
Embeddings	all-MiniLM-L6-v2 (ONNX, local)	Fast (~20ms), no API dependency, good enough for similarity.
Helper LLM	Anthropic Haiku (primary) / Ollama qwen2.5 (fallback)	Haiku: fast + cheap. Ollama: offline capable.
Streaming parser	Custom SSE parser	Must parse tool calls from streaming response before client sees them.
Config	TOML	Simple, human-readable.
Testing	pytest + recorded session replay	Replay real sessions for regression testing.

26 KiB Raw Blame History