# Object-Addressed Memory Manager for OpenCode

## Project Codename: Mnemosyne

> A transparent proxy that implements demand-paged, object-addressed memory management
> for LLM context windows. Extends Pichay's demand paging with semantic objects,
> multi-fidelity compression, declared losses, queryable backing store, and
> goal-aware retrieval via a helper LLM.

---

## 1. Problem Statement

LLM coding agents (opencode, Claude Code) suffer from context window bloat:

- **21.8% of input tokens are structural waste** (Pichay, 2026): unused tool schemas (11%),
  stale tool results reprocessed at 84.4x amplification (8.7%), duplicated content (2.2%)
- **Context drift** silently degrades reasoning quality before hitting hard token limits
- **Binary eviction** (resident vs evicted) is too coarse -- a 200-byte tombstone can't answer
  questions about 8KB of evicted code
- **No semantic awareness** -- eviction is by file path, not by conceptual relevance to the
  current task

## 2. Design Principles

1. **The context window is L1 cache, not memory.** Everything lives in the backing store;
   context is a curated working set.
2. **Eviction is cooperative.** The model participates in eviction decisions via cleanup tags
   and phantom tools. It has incentive: cleaner context = better attention quality.
3. **Compression is authored, not algorithmic.** The model (or a helper LLM) writes summaries
   with declared losses. It knows what matters.
4. **The backing store is queryable.** The model can ask questions of evicted content without
   materializing it. Micro-faults replace full page-ins.
5. **Objects, not blocks.** The unit of memory is a semantic object (a design decision, a
   debugging session, a file understanding) -- not a fixed-size page keyed by file path.
6. **Transparency.** The proxy is invisible to the client and the inference API. No changes
   to opencode or the model required.

---

## 3. System Architecture

```
                    opencode (client)
                         |
                         | HTTP (Messages API)
                         v
              +---------------------+
              |    MNEMOSYNE PROXY  |
              |                     |
              |  +---------------+  |
              |  | Context       |  |     +------------------+
              |  | Assembler     |--|---->| Helper LLM       |
              |  |               |  |     | (Haiku / local)  |
              |  +---------------+  |     |                  |
              |  | Fidelity      |  |     | - Summarization  |
              |  | Manager       |  |     | - Loss declaration|
              |  +---------------+  |     | - Micro-fault QA |
              |  | Object        |  |     | - Segmentation   |
              |  | Segmenter     |  |     +------------------+
              |  +---------------+  |
              |  | Fault         |  |     +------------------+
              |  | Detector      |  |     | Object Store     |
              |  +---------------+  |     | (PostgreSQL +    |
              |  | Phantom Tool  |--|---->|  pgvector)       |
              |  | Handler       |  |     |                  |
              |  +---------------+  |     | - Full content   |
              |  | Cleanup Tag   |  |     | - Multi-fidelity |
              |  | Parser        |  |     |   summaries      |
              |  +---------------+  |     | - Embeddings     |
              |  | Pressure      |  |     | - Relationships  |
              |  | Monitor       |  |     | - Fault history  |
              |  +---------------+  |     +------------------+
              +---------------------+
                         |
                         | HTTP (Messages API, modified)
                         v
                  Inference API (Anthropic)
```

### Component Responsibilities

| Component | Role |
|---|---|
| **Context Assembler** | Builds the modified message array for each API call. Selects which objects are resident at which fidelity. Injects phantom tool definitions. |
| **Fidelity Manager** | Tracks current fidelity level of each object. Degrades fidelity under pressure. Upgrades on access. Manages the L0-L3 fidelity ladder. |
| **Object Segmenter** | Splits the conversation stream into semantic objects. Runs after each turn. Uses embedding coherence + structural signals (tool boundaries, topic shifts). |
| **Fault Detector** | Detects page faults (model re-requests evicted content). Records fault history for pinning decisions. Detects micro-fault queries. |
| **Phantom Tool Handler** | Intercepts phantom tool calls from the model's streaming response before they reach the client. Handles `memory_release`, `memory_query`, `memory_restore`. |
| **Cleanup Tag Parser** | Parses structured directives from the model's text output: `drop`, `summarize`, `anchor`, `collapse`. Extended with `declare_losses`. |
| **Pressure Monitor** | Tracks token consumption per request. Determines pressure zone (Normal/Caution/Warning/Critical). Triggers fidelity degradation. |
| **Helper LLM** | Cheap model (Haiku, GPT-4o-mini, or local qwen2.5) that authors summaries, declares losses, answers micro-fault queries, and assists with object segmentation. |
| **Object Store** | PostgreSQL + pgvector database holding all semantic objects at all fidelity levels, with embeddings, metadata, relationships, and fault history. |

---

## 4. Memory Hierarchy

```
+----------------------------------------------------------------+
|  L0: Full Content (in context window)                          |
|  - Current working set of semantic objects                     |
|  - Full text, no compression                                  |
|  - Capacity: ~60% of context window budget                     |
+----------------------------------------------------------------+
|  L1: Detailed Summary (in context window)                      |
|  - Model-authored summary, ~30% of original size               |
|  - Preserves: file paths, function names, decisions, errors    |
|  - Declared losses: specific values, exact code, edge cases    |
|  - Capacity: ~20% of context window budget                     |
+----------------------------------------------------------------+
|  L2: Compact Summary (in context window)                       |
|  - Model-authored headline, ~5% of original size               |
|  - Preserves: what was done, what was decided, key files       |
|  - Declared losses: implementation details, reasoning          |
|  - Capacity: ~15% of context window budget                     |
+----------------------------------------------------------------+
|  L3: Metadata Stub (in context window)                         |
|  - One-line description + type + timestamp                     |
|  - ~50-100 tokens per object                                   |
|  - Capacity: ~5% of context window budget                      |
+----------------------------------------------------------------+
|  L4: Evicted (not in context, in backing store only)           |
|  - Not present in context at all                               |
|  - Queryable via memory_query phantom tool                     |
|  - Restorable via memory_restore phantom tool                  |
+----------------------------------------------------------------+
|  BACKING STORE (PostgreSQL + pgvector)                         |
|  - All objects at all fidelity levels, always                  |
|  - Full content preserved indefinitely                         |
|  - Embeddings for semantic search                              |
|  - Cross-session persistence (future)                          |
+----------------------------------------------------------------+
```

### Fidelity Transitions

```
Pressure rising (token count increasing):
  L0 (full) --[summarize]--> L1 (detailed) --[compress]--> L2 (compact) --[stub]--> L3 --[evict]--> L4

Access / fault (model needs content):
  L4 --[memory_restore]--> L0 (full page-in)
  L4 --[memory_query]--> (helper LLM answers, object stays at L4)
  L3 --[model references]--> L1 or L0 (upgrade on access)
  L2 --[model references]--> L1 (upgrade on access)
```

### Pressure Zones (token thresholds, configurable)

| Zone | Token % | Action |
|---|---|---|
| **Normal** | < 50% | Observe only. No fidelity changes. |
| **Caution** | 50-70% | Degrade oldest L0 objects to L1. |
| **Warning** | 70-85% | Degrade L0 to L1, L1 to L2, oldest L2 to L3. |
| **Critical** | 85-95% | Aggressive degradation. L2+ to L3. Evict L3 to L4. |
| **Emergency** | > 95% | Force-evict everything except last 2 user turns + system prompt. |

---

## 5. Semantic Objects

### 5.1 Object Types

| Type | Description | Example |
|---|---|---|
| `conversation_phase` | A coherent stretch of dialogue about one topic | "Discussed auth architecture for 8 turns" |
| `design_decision` | An explicit decision with rationale | "Chose JWT over sessions because..." |
| `debugging_session` | A sequence of diagnose-hypothesize-fix-verify | "Tracked down the race condition in..." |
| `file_context` | A file read and understanding | "Read src/auth/middleware.ts (200 lines)" |
| `tool_result` | Output from a tool call (grep, bash, etc.) | "grep found 15 matches for handleAuth" |
| `plan` | A structured plan or todo list | "Implementation plan: 5 steps..." |
| `error_context` | An error and its diagnosis | "TypeError at line 42, caused by..." |
| `external_reference` | Docs, API reference, examples pulled from outside | "React docs on useEffect cleanup" |

### 5.2 Object Segmentation Algorithm

Segmentation runs after each model turn. Two strategies, selected by context:

**Strategy A: Structural Segmentation (fast, rule-based)**
- Tool call boundaries are natural object boundaries
- Each `Read` result = `file_context` object
- Each `Bash`/`Grep` result = `tool_result` object
- User message + assistant response = potential `conversation_phase` boundary
- Heuristic: if topic similarity (embedding cosine) between consecutive turns drops below
  threshold (0.7), start a new `conversation_phase`

**Strategy B: Semantic Segmentation (slower, higher quality)**
- Based on xMemory's sparsity-semantics objective (arXiv:2602.02007)
- Embed all messages with a lightweight model (all-MiniLM-L6-v2, ~20ms)
- Cluster by coherence: maximize inter-object semantic diversity, minimize intra-object
  redundancy
- Build hierarchy: messages -> episodes -> themes
- Use for long sessions (>50 turns) where structural boundaries are insufficient

**Default: Strategy A for turns 1-50, Strategy B kicks in at 50+ turns.**

### 5.3 Object Relationships

Objects have typed relationships stored in the backing store:

```
parent_of:    conversation_phase -> design_decision (decision made during phase)
caused_by:    error_context -> file_context (error was in this file)
references:   debugging_session -> file_context (files examined during debug)
supersedes:   file_context(v2) -> file_context(v1) (file re-read after edit)
depends_on:   plan -> design_decision (plan relies on this decision)
```

Relationships are used by the Context Assembler: when upgrading an object's fidelity,
also consider upgrading its `depends_on` and `references` relationships.

---

## 6. Multi-Fidelity Compression

### 6.1 Summary Generation

When an object degrades from L0 to L1, the Helper LLM generates a summary:

**Input to Helper LLM:**
```
You are a context compression engine for a coding agent. Summarize the following
content while preserving maximum utility for future reference.

CONTENT TYPE: {object.type}
CONTENT:
{object.content_full}

INSTRUCTIONS:
1. Write a detailed summary (~30% of original length)
2. MUST preserve: file paths, function names, variable names, library names,
   error messages, decision rationale, specific values that may be referenced later
3. List DECLARED LOSSES: specific information you omitted that someone might need.
   Be precise -- "specific error codes" not "some details"
4. List CAN_ANSWER: categories of questions this summary can answer without
   needing the original content

OUTPUT FORMAT (JSON):
{
  "summary": "...",
  "losses": ["exact error code for token expiry", "rate limit threshold values", ...],
  "can_answer": ["auth approach used", "middleware chain order", "why JWT over sessions", ...],
  "key_entities": ["src/auth/middleware.ts", "handleAuth()", "jsonwebtoken", ...]
}
```

**L1 -> L2 compression** uses the L1 summary as input (not L0), with instruction to
compress to ~5% of original. Additional losses are accumulated.

**L2 -> L3 stub** is generated from L2:
```
[debugging_session | 2026-03-13 14:30 | Fixed race condition in auth token refresh
 by adding mutex lock in src/auth/refresh.ts | 12 related objects]
```

### 6.2 Declared Losses Schema

```typescript
interface DeclaredLosses {
  // What was dropped from this fidelity level
  dropped: string[]
  
  // What questions this fidelity level CAN still answer
  can_answer: string[]
  
  // Hint for when to fault (what would require the original)
  fault_when: string[]
  
  // Key entities preserved (for relationship tracking)
  key_entities: string[]
}
```

### 6.3 Loss Accumulation

As objects degrade through fidelity levels, losses accumulate:

```
L0: (full content, no losses)
L1: losses = ["exact error codes", "line-by-line implementation"]
L2: losses = L1.losses + ["function signatures", "reasoning chain"]
L3: losses = L2.losses + ["what was decided", "which files involved"]
     (at this point, only the one-line description remains)
```

The accumulated `fault_when` list tells the model exactly when it needs to fault:
"If you need exact error codes, specific function signatures, or the reasoning
behind the auth decision, restore this object."

---

## 7. Queryable Backing Store

### 7.1 The `memory_query` Phantom Tool

Instead of restoring a full object (8KB+) to answer a simple question, the model
calls `memory_query`:

```json
{
  "tool": "memory_query",
  "input": {
    "question": "What error code does the auth middleware return for expired tokens?",
    "scope": "auth-related objects",
    "max_tokens": 200
  }
}
```

**Proxy handling:**

1. Proxy intercepts `memory_query` from the model's streaming response
2. Proxy queries the Object Store:
   - Embed the question
   - Find top-k relevant objects by cosine similarity (even evicted ones)
   - Retrieve their L0 (full content) from the backing store
3. Proxy sends question + retrieved full content to the Helper LLM
4. Helper LLM returns a targeted answer (~50-200 tokens)
5. Proxy injects the answer as a synthetic tool result into the model's context
6. The evicted objects stay evicted -- no fidelity change

**Token savings per micro-fault:**
- Traditional fault (Pichay): restore full page, ~4,000-8,000 tokens
- Micro-fault: inject targeted answer, ~50-200 tokens
- Savings: 95-99% per fault

### 7.2 Semantic Search for Micro-Faults

The backing store supports multiple retrieval strategies:

```sql
-- Vector similarity (primary)
SELECT * FROM semantic_objects
WHERE session_id = $1
ORDER BY embedding <-> $query_embedding
LIMIT 5;

-- Hybrid: vector + keyword (for exact matches)
SELECT * FROM semantic_objects
WHERE session_id = $1
  AND (
    to_tsvector('english', content_full) @@ plainto_tsquery('english', $query)
    OR embedding <-> $query_embedding < 0.3
  )
ORDER BY embedding <-> $query_embedding
LIMIT 5;

-- Metadata-filtered (for typed queries)
SELECT * FROM semantic_objects
WHERE session_id = $1
  AND object_type = 'error_context'
  AND 'src/auth' = ANY(tags)
ORDER BY created_at DESC
LIMIT 3;
```

### 7.3 `memory_restore` — Full Page-In

When the model needs full content (editing a file, reviewing exact code), it calls
`memory_restore` which does a traditional page-in:

```json
{
  "tool": "memory_restore",
  "input": {
    "object_id": "obj_abc123",
    "reason": "Need to edit the auth middleware"
  }
}
```

This upgrades the object to L0, potentially triggering eviction of other objects
under pressure.

### 7.4 `memory_release` -- Cooperative Eviction

The model can voluntarily release objects it no longer needs:

```json
{
  "tool": "memory_release",
  "input": {
    "object_ids": ["obj_abc123", "obj_def456"],
    "reason": "Done with auth implementation, moving to tests"
  }
}
```

This immediately degrades the objects to L3 (or L4 under pressure), freeing context
budget for new work.

---

## 8. Goal-Aware Retrieval

### 8.1 The Problem

When the model starts a new sub-task (e.g., "now write tests for auth"), the objects
currently in context may be irrelevant (e.g., old debugging sessions for a different
module). Goal-aware retrieval proactively swaps context based on the current task.

### 8.2 Detection: When Has the Goal Changed?

The Pressure Monitor also tracks goal transitions by comparing:
- The current user message embedding vs the previous user message embedding
- If cosine similarity < 0.5 (topic shift), trigger goal-aware retrieval

### 8.3 Goal-Aware Context Assembly

On goal transition:

1. **Helper LLM classifies the new goal** (~100ms):
   ```
   Given this user message: "{message}"
   What is the user's current goal? What context would be most relevant?
   Return: { "goal": "...", "relevant_types": [...], "relevant_tags": [...] }
   ```

2. **Query the Object Store** for relevant objects:
   ```sql
   SELECT * FROM semantic_objects
   WHERE session_id = $1
   ORDER BY embedding <-> $goal_embedding
   LIMIT 20;
   ```

3. **Rank objects by relevance to new goal** (helper LLM or embedding similarity)

4. **Assemble new context window**:
   - Top-ranked objects at L0 or L1 (depending on budget)
   - Previously active but now irrelevant objects degraded to L2 or L3
   - Always preserve: system prompt, last 2 user turns, any pinned objects

5. **Inject into next API call** via `experimental.chat.messages.transform`

### 8.4 Predictive Loading

After goal classification, the helper LLM can predict what the model will need next:

```
Given goal "write tests for auth", the model will likely need:
- The auth middleware implementation (file_context for src/auth/middleware.ts)
- The existing test patterns (file_context for tests/*)
- The design decision about JWT (design_decision)
- NOT: the debugging session for the database migration
```

Pre-load predicted objects at L1, so they're available if the model needs them.

---

## 9. Admission Control (Write Path)

Not everything deserves to become a stored object. Based on A-MAC (arXiv:2603.04549):

### 9.1 Admission Score

```
S(m) = w_T * TypePrior(m) + w_N * Novelty(m) + w_U * Utility(m) + w_R * Recency(m)
```

| Factor | Signal | Weight (learned) |
|---|---|---|
| **TypePrior** | `design_decision` > `error_context` > `file_context` > `tool_result` | ~0.35 |
| **Novelty** | Cosine distance to nearest existing object > 0.15 | ~0.25 |
| **Utility** | Helper LLM scores future relevance (0-1) | ~0.25 |
| **Recency** | Exponential decay from creation time | ~0.15 |

**Threshold:** S(m) >= 0.4 to admit. Below threshold, content is kept only in the
client's unmodified history (Pichay's backing store) but not indexed in the Object Store.

### 9.2 What Gets Rejected

- Routine tool results with no lasting value (e.g., `ls` output, `git status`)
- Duplicate file reads where content hasn't changed
- Conversation turns that are purely procedural ("Sure, I'll do that")

---

## 10. Entropy-Gated Faulting (L-RAG Integration)

Based on L-RAG (arXiv:2601.06551): use the model's own uncertainty as a fault signal.

### 10.1 Mechanism

During the model's generation (streaming response), monitor token-level entropy:

1. **Normal entropy** (H < 1.5): model is confident, no intervention
2. **Elevated entropy** (1.5 < H < 2.2): model may benefit from more context.
   Check if any L2/L3 objects match the current generation topic. If so,
   silently upgrade to L1.
3. **High entropy** (H > 2.2): model is struggling. Trigger a micro-fault --
   query the backing store with the current generation context, inject relevant
   information.

### 10.2 Complementarity with Declared Losses

Entropy-gated faulting handles the case where the model doesn't know what it
doesn't know. Declared losses handle the case where it does. Together:

- **Declared losses**: "I know I need exact error codes, let me fault"
  -> model calls `memory_query`
- **Entropy signal**: model's generation becomes uncertain around error handling
  -> proxy automatically upgrades relevant objects

### 10.3 Implementation Complexity

Entropy monitoring requires access to token logprobs in the streaming response.
The Anthropic API provides these via `stream_options.include_logprobs`. This is a
Phase 4e feature due to the complexity of real-time entropy calculation during
streaming.

---

## 11. Integration Points

### 11.1 With OpenCode (via oh-my-opencode)

The proxy can integrate at two levels:

**Level 1: Pure Proxy (Phase 1-3)**
- Standalone HTTP proxy between opencode and Anthropic API
- Zero changes to opencode or oh-my-opencode
- Configuration: set `ANTHROPIC_BASE_URL` to proxy address

**Level 2: Plugin Integration (Phase 4+)**
- oh-my-opencode hook: `experimental.chat.messages.transform` for context assembly
- oh-my-opencode hook: `experimental.session.compacting` for custom compaction
- oh-my-opencode hook: `tool.execute.after` for object segmentation on tool results
- MCP server exposing `memory_query`, `memory_stats`, `memory_objects` tools
  (so the user can inspect memory state)

### 11.2 With Pichay (fork and extend)

Start from `fsgeek/pichay` (commit `b56701a`):
- `proxy.py` -> extend with multi-fidelity eviction, object segmentation
- `probe.py` -> extend with object-level analytics
- Add: `helper_llm.py` for summary generation, micro-fault QA
- Add: `object_store.py` for PostgreSQL + pgvector integration
- Add: `segmenter.py` for semantic object detection
- Add: `fidelity.py` for multi-fidelity state machine

### 11.3 With the Helper LLM

The helper LLM is called via standard API (Anthropic for Haiku, or Ollama for local):

| Task | Model | Expected Latency | Tokens In | Tokens Out |
|---|---|---|---|---|
| Summarize L0 -> L1 | Haiku | ~200ms | ~2000 | ~600 |
| Compress L1 -> L2 | Haiku | ~100ms | ~600 | ~100 |
| Micro-fault answer | Haiku | ~150ms | ~3000 | ~100 |
| Goal classification | Haiku | ~100ms | ~200 | ~50 |
| Object segmentation | local (MiniLM) | ~20ms | embedding only | N/A |
| Admission scoring | local (qwen2.5) | ~50ms | ~500 | ~10 |

**Cost estimate per session (200 turns):**
- ~50 summarizations: 50 * ~3000 tokens = ~150K Haiku tokens (~$0.004)
- ~20 micro-faults: 20 * ~3000 tokens = ~60K Haiku tokens (~$0.002)
- ~10 goal classifications: ~2K Haiku tokens (~$0.00005)
- **Total helper cost: ~$0.006 per session**
- **Savings on main model**: 50-93% context reduction on Opus/Sonnet calls

---

## 12. Failure Modes and Mitigations

| Failure Mode | Consequence | Mitigation |
|---|---|---|
| **Helper LLM produces bad summary** | Model loses critical info, silent quality degradation | Validate via declared losses. Spot-check: can helper answer `can_answer` queries from summary? |
| **Object segmentation too coarse** | Related content split across objects, fidelity changes break coherence | Conservative defaults (prefer larger objects). Relationship tracking keeps related objects together. |
| **Object segmentation too fine** | Too many small objects, overhead dominates | Minimum object size (500 tokens). Merge adjacent objects of same type. |
| **Thrashing** | Objects repeatedly degraded and restored, wasting helper LLM calls | Fault-driven pinning (Pichay L2). After 1 fault, pin at current fidelity for N turns. |
| **Goal misclassification** | Wrong objects loaded for current task | Conservative: always keep last 2 turns at L0. Don't evict below L2 on goal change (can upgrade quickly). |
| **Backing store latency spike** | Micro-fault takes >500ms, model generation stalls | Timeout + fallback: if backing store slow, inject L2 summary instead of querying. |
| **Declared losses are incomplete** | Model doesn't know it's missing info, doesn't fault | Entropy-gated faulting (Phase 4e) as safety net. Also: periodic loss audit by helper LLM. |
| **Helper LLM unavailable** | No summaries, no micro-faults | Graceful degradation: fall back to Pichay-style binary eviction with tombstones. |

---

## 13. Metrics and Evaluation

### 13.1 Primary Metrics

| Metric | Target | How to Measure |
|---|---|---|
| **Context reduction** | >80% vs baseline | (baseline tokens - actual tokens) / baseline tokens |
| **Fault rate** | <0.1% | faults / total evictions |
| **Micro-fault success rate** | >90% | micro-faults that avoided full page-in / total micro-faults |
| **Task quality** | No degradation | LLM-judged equivalence: full-context vs managed-context outputs |
| **Helper LLM overhead** | <5% of main model cost | helper cost / main model cost |
| **Latency overhead** | <300ms per turn average | (managed turn time - baseline turn time) |

### 13.2 Evaluation Method

1. **Offline replay**: Replay recorded opencode sessions through the proxy.
   Compare managed output vs original output via LLM judge.
2. **A/B testing**: Run identical tasks with and without proxy. Measure
   token usage, task completion, and code quality.
3. **Fault analysis**: Log every fidelity transition, fault, and micro-fault.
   Identify patterns in what causes faults (guides admission control tuning).

---

## 14. Technology Stack

| Component | Technology | Rationale |
|---|---|---|
| **Proxy** | Python (asyncio + httpx) | Fork from Pichay (Python). Streaming support critical. |
| **Object Store** | PostgreSQL 16 + pgvector | Proven at scale by Letta. Hybrid vector + relational. |
| **Embeddings** | all-MiniLM-L6-v2 (ONNX, local) | Fast (~20ms), no API dependency, good enough for similarity. |
| **Helper LLM** | Anthropic Haiku (primary) / Ollama qwen2.5 (fallback) | Haiku: fast + cheap. Ollama: offline capable. |
| **Streaming parser** | Custom SSE parser | Must parse tool calls from streaming response before client sees them. |
| **Config** | TOML | Simple, human-readable. |
| **Testing** | pytest + recorded session replay | Replay real sessions for regression testing. |