Joey Yakimowich-Payne 7c6a3dbe4a docs: add architecture and reference documentation

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

2026-03-13 11:41:41 -06:00

22 KiB

Raw Permalink Blame History

Implementation Roadmap

Overview

9 phases, incrementally building from a working Pichay fork to the full object-addressed memory system. Each phase produces a testable, usable artifact.

Estimated total effort: 8-12 weeks for a solo developer.

Phase 1: Pichay Baseline (Week 1)

Goal: Get the existing Pichay proxy running between opencode and Anthropic API. Validate structural waste reduction on real sessions.

Tasks

1.1 Fork fsgeek/pichay at tag v0.1.0-paper (commit b56701a)
1.2 Set up development environment
- Python 3.11+, asyncio, httpx
- Local opencode instance
- Proxy configuration: ANTHROPIC_BASE_URL=http://localhost:8080
1.3 Run proxy in passthrough mode (no eviction)
- Verify: opencode works normally through proxy
- Log: request/response sizes, token counts, tool call inventory
1.4 Enable FIFO eviction (Pichay default settings)
- tau = 4 turns, s_min = 500 bytes
- Verify: tombstones appear for old tool results
- Measure: tokens saved, fault rate
1.5 Record 5+ real coding sessions through the proxy
- Use probe.py to generate session analytics
- Baseline metrics: waste %, amplification factor, fault rate
1.6 Write session replay infrastructure
- Record full message traces (request + response pairs)
- Replay tool for offline testing of later phases

Deliverable

Working proxy with FIFO eviction. Baseline metrics on real sessions.

Success Criteria

Proxy is transparent (opencode works identically)
Measurable token reduction (target: >15%)
Fault rate < 0.1%
5+ recorded sessions for replay testing

Phase 2: Multi-Fidelity Eviction (Weeks 2-3)

Goal: Replace binary eviction (resident/tombstone) with graduated fidelity levels. Introduce the Helper LLM for summary generation.

Tasks

2.1 Implement the Fidelity Manager state machine
- States: L0 (full), L1 (detailed summary), L2 (compact summary), L3 (stub), L4 (evicted)
- Transitions: degrade (pressure), upgrade (access/fault), pin (fault-driven)
- In-memory state per session (no DB yet)
2.2 Implement pressure zones
- Normal (<50%), Caution (50-70%), Warning (70-85%), Critical (85-95%), Emergency (>95%)
- Token counting per fidelity level
- Configurable thresholds via TOML config
2.3 Integrate Helper LLM (Anthropic Haiku)
- API client with retry, timeout, error handling
- Prompt template for L0 -> L1 summarization (detailed summary + declared losses)
- Prompt template for L1 -> L2 compression
- Response parsing: extract summary, losses, can_answer, key_entities
2.4 Implement fidelity degradation on pressure
- Walk objects oldest-first
- Generate summaries via Helper LLM on transition
- Replace content in message array with summary + loss declaration
- Format: [Summary of {type}: {stub}]\n{summary}\n[Cannot answer: {losses}]
2.5 Implement fidelity upgrade on access
- When model references an L1/L2/L3 object (detected by content overlap or tool call), upgrade to L0
- Restore full content from in-memory cache (client history is the backing store)
2.6 Extend fault detection for fidelity-aware faults
- L3 stub referenced -> upgrade to L1 (not necessarily L0)
- L2 compact referenced -> upgrade to L1
- Only full page-in if model explicitly re-requests (tool call match)
2.7 Add declared losses to eviction tombstones
- Format: [Paged out: {stub}. Lost: {losses}. Restore if you need: {fault_when}]
2.8 Replay testing against Phase 1 baseline
- Compare: token reduction, fault rate, summary quality
- Manual review: are declared losses accurate?

Deliverable

Multi-fidelity proxy with Helper LLM summarization and declared losses.

Success Criteria

Token reduction > 40% (up from Phase 1's ~20%)
Fault rate < 0.05% (better than binary eviction -- summaries prevent unnecessary faults)
Helper LLM cost < $0.01 per session
Declared losses are accurate in >90% of spot checks

Phase 3: Queryable Backing Store + Micro-Faults (Weeks 4-5)

Goal: Add PostgreSQL + pgvector as persistent backing store. Implement memory_query phantom tool for micro-faults. This is the single highest-impact feature.

Tasks

3.1 Set up PostgreSQL + pgvector
- Docker Compose for local dev
- Schema from SCHEMA.md (sessions, semantic_objects tables)
- Connection pooling (asyncpg)
3.2 Implement Object Store module
- CRUD operations for semantic_objects
- Embedding generation (all-MiniLM-L6-v2 via sentence-transformers, ONNX runtime)
- Store objects on creation (every tool result, message span)
- Full content always preserved in DB regardless of context fidelity
3.3 Implement semantic search
- Vector similarity search (pgvector cosine distance)
- Full-text search (PostgreSQL tsvector)
- Hybrid search (weighted combination)
- Metadata-filtered search (by type, tags, date range)
3.4 Implement memory_query phantom tool
- Add tool definition to phantom tool injection
- Intercept from streaming response
- Flow: intercept -> query Object Store -> send top-k results to Helper LLM -> inject answer
- Synthetic tool result format: [Memory Query Result]\nQ: {question}\nA: {answer}\n[Source: {object stubs}]
3.5 Implement memory_restore phantom tool
- Full page-in from backing store (traditional fault)
- Upgrade object to L0
- Trigger pressure-based eviction of other objects if needed
3.6 Implement memory_release phantom tool
- Model voluntarily releases objects
- Immediate degradation to L3 or L4
- Log cooperative eviction event
3.7 Implement phantom tool injection
- Add phantom tool definitions to the model's tool list in each request
- Parse phantom tool calls from streaming response BEFORE client receives them
- Handle phantom tool results transparently
3.8 Measure micro-fault effectiveness
- Track: micro-faults attempted, successful (model didn't need to restore after), tokens saved per micro-fault
- Compare: micro-fault answer quality vs full restore (LLM-judged)

Deliverable

Full proxy with persistent backing store, semantic search, and micro-fault capability.

Success Criteria

Token reduction > 60%
Micro-fault success rate > 85% (model doesn't need full restore after micro-fault)
Micro-fault latency < 500ms (embedding + DB query + Helper LLM)
Token savings per micro-fault: >95% vs full restore
DB query latency < 50ms for semantic search

Phase 4a: Object Segmentation (Week 6)

Goal: Replace file-path-based page identity with semantic object detection. Conversations are segmented into coherent objects with types and relationships.

Tasks

4a.1 Implement structural segmentation (Strategy A)
- Rule-based: tool call boundaries -> tool_result / file_context objects
- User turn + assistant response = conversation span
- Topic shift detection: embedding cosine between consecutive spans < 0.7 -> new object
- Object type classification: rules based on tool name, content patterns
  - Read tool -> file_context
  - Bash/Grep tool -> tool_result
  - Error in output -> error_context
  - "I'll implement..." / plan language -> plan
  - "Let's use X because Y" / decision language -> design_decision
4a.2 Implement object type classifier
- Helper LLM or simple keyword-based classifier
- Input: content span + preceding context
- Output: object_type + stub + tags
- Latency budget: <100ms (prefer rules, fallback to Helper LLM)
4a.3 Retroactive segmentation
- On session start: no segmentation (objects created per-tool-result)
- Every 10 turns: re-examine recent objects, merge small ones, split large ones
- Merge criterion: consecutive objects of same type with embedding similarity > 0.8
- Split criterion: single object > 5000 tokens with internal topic shift
4a.4 Object deduplication
- file_context: dedup by source_key (file path). New read supersedes old.
- tool_result: no dedup (each is unique)
- conversation_phase: no dedup
- Add 'supersedes' relationship when replacing
4a.5 Store relationships
- Automatic: file_context referenced in debugging_session -> 'references' edge
- Automatic: design_decision made during conversation_phase -> 'parent_of' edge
- Detection: key_entities overlap between objects suggests relationship
4a.6 Test segmentation quality
- Replay recorded sessions
- Manual review: do objects correspond to intuitive "chunks" of work?
- Measure: average object size, type distribution, relationship density

Deliverable

Proxy segments conversations into typed semantic objects with relationships.

Success Criteria

Objects correspond to intuitive conversation segments (manual review)
Average object size: 500-3000 tokens (not too fine, not too coarse)
Type classification accuracy > 85%
Segmentation latency < 50ms per turn (structural) or < 200ms (semantic)

Phase 4b: Object Relationships + Co-Fidelity (Week 7)

Goal: Use relationships between objects to make smarter fidelity decisions. When one object is upgraded, related objects are considered for upgrade too.

Tasks

4b.1 Implement relationship-aware fidelity management
- When upgrading object X to L0, check depends_on and references edges
- If related object Y is at L2+, upgrade to L1 (not L0 -- don't over-promote)
- Configurable: max relationship hops (default: 2), max co-upgrades (default: 3)
4b.2 Implement relationship-aware eviction
- When degrading object X, DON'T degrade objects that X depends_on if they're actively referenced by other L0 objects
- Eviction priority: objects with no inbound references or depends_on edges are evicted first (they're "leaf" objects)
4b.3 Relationship visualization (debug tool)
- CLI command: mnemosyne graph --session <id>
- Output: DOT graph of objects + relationships + fidelity levels
- For debugging: identify orphaned objects, over-connected clusters
4b.4 Test co-fidelity management
- Scenario: model starts working on auth -> auth objects at L0, related files at L1
- Model switches to tests -> auth demoted to L1/L2, test objects promoted
- Model returns to auth -> auth restored, relationships pull in dependencies

Deliverable

Relationship-aware fidelity management. Objects that belong together stay together.

Success Criteria

Co-fidelity reduces fault rate by >20% vs independent fidelity management
No "orphaned dependency" faults (model needs X, but X's dependency Y is evicted)

Phase 4c: Goal-Aware Retrieval (Week 8)

Goal: When the user's goal changes, proactively swap context. The Helper LLM reads the new goal, queries the Object Store, and assembles a focused context window.

Tasks

4c.1 Implement goal transition detection
- Embed each user message
- Compare to previous user message embedding (cosine similarity)
- Threshold: < 0.5 = major topic shift, trigger goal-aware retrieval
- Also detect explicit signals: "now let's work on...", "moving to...", "switching to..."
4c.2 Implement goal classification
- Helper LLM call (~100ms): Input: current user message + last 2 turns of context Output: { goal, relevant_types, relevant_tags, predicted_needs }
- Cache goal classification (don't re-classify if message is a follow-up)
4c.3 Implement context swap on goal change
- Query Object Store: find top-20 objects by similarity to new goal
- Rank by: embedding similarity * recency_weight * type_match_bonus
- Assemble new context:
  - Always: system prompt, last 2 user turns, pinned objects
  - From goal query: top objects at appropriate fidelity (budget permitting)
  - Previously active but now irrelevant: degrade to L2/L3 (not L4 -- recent work)
4c.4 Implement predictive loading
- Helper LLM predicts what the model will need for this goal
- Pre-load predicted objects at L1 (ready for quick upgrade)
- Track prediction accuracy: did the model actually access predicted objects?
4c.5 Test goal-aware retrieval
- Scenario: multi-task session (auth -> tests -> docs -> bugfix)
- Measure: context relevance at each goal transition
- Compare: goal-aware vs naive (no swap) in token efficiency and fault rate

Deliverable

Proxy proactively loads relevant context on goal transitions.

Success Criteria

Goal detection accuracy > 80% (catches real transitions, few false positives)
Context relevance after swap > 70% (measured by: did the model use the loaded objects?)
Predictive loading accuracy > 50% (better than random)
No regressions in token efficiency or fault rate

Phase 4d: Admission Control (Week 9)

Goal: Not everything deserves to be a stored object. Score incoming content and reject low-value items to keep the Object Store clean.

Tasks

4d.1 Implement admission scorer
- Four factors: TypePrior, Novelty, Utility, Recency
- TypePrior: static weights per object_type (design_decision=1.0, tool_result=0.4, etc.)
- Novelty: cosine distance to nearest existing object in session (>0.15 = novel)
- Utility: Helper LLM scores future relevance 0-1 (or local model via Ollama)
- Recency: exponential decay from turn number
- Configurable weights (default from A-MAC: T=0.35, N=0.25, U=0.25, R=0.15)
4d.2 Implement admission threshold
- Default: 0.4
- Items below threshold: not stored in Object Store
- Still present in client's unmodified history (Pichay backing store)
- Log rejected items for threshold tuning
4d.3 Implement rejection patterns
- Always reject: ls output, git status, routine directory listings
- Always reject: duplicate file reads where content hash matches existing object
- Always reject: purely procedural assistant responses ("Sure, I'll do that")
- Configurable: rejection rules in TOML config
4d.4 Threshold tuning
- Replay recorded sessions with different thresholds
- Find threshold that minimizes (false rejections * fault_cost + storage * store_cost)
- Log admission_scores table for analysis

Deliverable

Admission gate filters low-value content from the Object Store.

Success Criteria

30% of tool results rejected (routine/duplicate content)
No false rejections that cause faults later (measure: rejected items that model would have needed, detected via fault-after-reject tracking)
Object Store grows linearly with session complexity, not session length

Phase 4e: Entropy-Gated Faulting (Week 10)

Goal: Use the model's token-level entropy during generation as an automatic signal to inject more context. Complements declared losses.

Tasks

4e.1 Implement logprob extraction from streaming response
- Anthropic API: stream_options.include_logprobs (if available)
- Parse token logprobs from SSE stream in real-time
- Calculate rolling entropy: H = -sum(p * log(p)) over top-k logprobs
4e.2 Implement entropy monitoring
- Rolling window: last 20 tokens
- Thresholds: normal (H < 1.5), elevated (1.5-2.2), high (H > 2.2)
- Debounce: don't trigger on single high-entropy token (require 3+ consecutive)
4e.3 Implement entropy-triggered context injection
- On elevated entropy: check if any L2/L3 objects match current generation topic
  - Extract current generation context (last 50 tokens)
  - Embed and search Object Store
  - If match found: silently upgrade to L1 in NEXT request (can't modify current)
- On high entropy: more aggressive -- prepare micro-fault answer for likely question
4e.4 Evaluate entropy signal reliability
- Compare: entropy at points where model made errors vs correct generation
- Calibrate thresholds per model (Opus vs Sonnet vs Haiku have different baselines)
- Measure: false positive rate (elevated entropy but model was fine)

Deliverable

Proxy monitors generation entropy and proactively loads context when model is uncertain.

Success Criteria

Entropy signal detects genuine uncertainty >70% of the time
False positive rate < 30% (elevated entropy that didn't need intervention)
Measurable quality improvement on tasks where entropy-gating activated
Note: This is the most experimental phase. Success criteria may be revised.

Fallback

If logprobs are not reliably available from the API, this phase can be deferred. The system works well without it -- declared losses + manual faulting cover most cases.

Phase 5: OpenCode Plugin Integration (Week 11)

Goal: Package as an oh-my-opencode plugin with a companion MCP server for user-facing memory inspection tools.

Tasks

5.1 Create oh-my-opencode plugin package
- npm package: opencode-mnemosyne
- Hook: experimental.chat.messages.transform for context assembly
- Hook: experimental.session.compacting for custom compaction
- Hook: tool.execute.after for object creation on tool results
- Configuration via oh-my-opencode.json
5.2 Create MCP server for user-facing tools
- memory_stats: show current context pressure, object counts by type/fidelity
- memory_objects: list all objects with fidelity, type, stub
- memory_inspect <id>: show object detail (all fidelity levels, losses, relationships)
- memory_graph: show object relationship graph
- memory_config: view/update runtime configuration
5.3 Documentation
- Installation guide
- Configuration reference
- Troubleshooting guide
- Architecture overview for contributors
5.4 Session dashboard (optional)
- Local web UI (served by proxy) showing:
  - Real-time context pressure gauge
  - Object timeline (creation, fidelity transitions, faults)
  - Token savings over time
  - Fault log

Deliverable

Installable plugin + MCP server. Users can inspect and configure memory behavior.

Phase 6: Semantic Segmentation + xMemory Hierarchy (Week 12)

Goal: Replace rule-based segmentation with xMemory's hierarchical approach for long sessions. Build the full messages -> episodes -> semantics -> themes hierarchy.

Tasks

6.1 Implement xMemory-style sparsity-semantics segmentation
- Embed all messages in a session with all-MiniLM-L6-v2
- Cluster by coherence using the sparsity-semantics objective: maximize inter-cluster diversity, minimize intra-cluster redundancy
- Output: episodes (coherent sub-conversations)
6.2 Build hierarchy
- Level 0: individual messages/tool results (existing objects)
- Level 1: episodes (groups of related objects, from clustering)
- Level 2: semantics (abstract themes spanning multiple episodes)
- Level 3: themes (top-level categories for the entire session)
6.3 Hierarchical retrieval
- Top-down: query matches theme -> expand to semantics -> expand to episodes -> expand to objects
- Only expand when similarity score justifies it (reader uncertainty reduction)
- Prevents redundant retrieval (a key xMemory advantage over flat search)
6.4 Incremental hierarchy maintenance
- Don't rebuild from scratch every turn
- New objects: assign to nearest episode, update episode embedding
- Every 20 turns: re-cluster to catch topic drift
- Major goal change: full re-hierarchy
6.5 Hierarchy-aware fidelity management
- When an episode is at L2, all its objects are at L2 or lower
- Upgrading an episode promotes its most relevant objects to L1
- Themes can have their own summaries (super-summaries of episode summaries)

Deliverable

Full hierarchical segmentation for long sessions (>50 turns).

Success Criteria

Retrieval quality improves for sessions >100 turns (measured by LLM-judged relevance)
Hierarchy reduces redundancy in retrieved context (measured by token overlap between results)
Incremental maintenance is fast (<500ms per turn)

Dependencies Between Phases

Phase 1 (Pichay baseline)
    |
Phase 2 (Multi-fidelity + Helper LLM)
    |
Phase 3 (Backing store + micro-faults)
   / \
  /   \
4a    4d (can run in parallel)
(segmentation)  (admission control)
  |
4b (relationships + co-fidelity)
  |
4c (goal-aware retrieval)
  |
4e (entropy-gated faulting) -- optional, experimental
  |
Phase 5 (plugin integration)
  |
Phase 6 (xMemory hierarchy)

Phases 4a-4e can be partially parallelized:

4a + 4d can be built simultaneously
4b depends on 4a
4c depends on 4a + 4b
4e is independent (depends only on Phase 3)
Phase 5 can start after Phase 3 (plugin wrapping doesn't need 4a-4e)
Phase 6 depends on 4a (needs basic segmentation first)

Risk Register

Risk	Likelihood	Impact	Mitigation
Anthropic API doesn't expose logprobs for streaming	Medium	Phase 4e blocked	Phase 4e is optional. System works without entropy gating.
Helper LLM summaries lose critical info	Medium	Quality degradation	Declared losses + micro-faults as safety net. Spot-check auditing.
Proxy adds too much latency	Low	User experience	Helper LLM calls are async (don't block response). Summarization happens post-response.
pgvector search too slow at scale	Low	Micro-fault latency	IVFFlat index. For extreme scale, switch to dedicated vector DB (Qdrant).
Object segmentation too noisy	Medium	Poor fidelity decisions	Conservative defaults (larger objects). Rule-based segmentation is robust.
Phantom tool parsing from streaming response is fragile	Medium	Proxy breaks	Extensive testing on recorded sessions. Fallback: don't parse, let tool call through.
Model doesn't use cooperative eviction (memory_release)	High	Reduced savings	Cooperative eviction is bonus. Pressure-based eviction works without model cooperation.
Cross-session memory introduces stale/wrong context	Medium	Wrong answers	Phase 6+ only. Confidence decay on persistent objects.

22 KiB Raw Permalink Blame History

Implementation Roadmap

Overview

Phase 1: Pichay Baseline (Week 1)

Tasks

Deliverable

Success Criteria

Phase 2: Multi-Fidelity Eviction (Weeks 2-3)

Tasks

Deliverable

Success Criteria

Phase 3: Queryable Backing Store + Micro-Faults (Weeks 4-5)

Tasks

Deliverable

Success Criteria

Phase 4a: Object Segmentation (Week 6)

Tasks

Deliverable

Success Criteria

Phase 4b: Object Relationships + Co-Fidelity (Week 7)

Tasks

Deliverable

Success Criteria

Phase 4c: Goal-Aware Retrieval (Week 8)

Tasks

Deliverable

Success Criteria

Phase 4d: Admission Control (Week 9)

Tasks

Deliverable

Success Criteria

Phase 4e: Entropy-Gated Faulting (Week 10)

Tasks

Deliverable

Success Criteria

Fallback

Phase 5: OpenCode Plugin Integration (Week 11)

Tasks

Deliverable

Phase 6: Semantic Segmentation + xMemory Hierarchy (Week 12)

Tasks

Deliverable

Success Criteria

Dependencies Between Phases

Risk Register

22 KiB

Raw Permalink Blame History