Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
22 KiB
Implementation Roadmap
Overview
9 phases, incrementally building from a working Pichay fork to the full object-addressed memory system. Each phase produces a testable, usable artifact.
Estimated total effort: 8-12 weeks for a solo developer.
Phase 1: Pichay Baseline (Week 1)
Goal: Get the existing Pichay proxy running between opencode and Anthropic API. Validate structural waste reduction on real sessions.
Tasks
- 1.1 Fork
fsgeek/pichayat tagv0.1.0-paper(commitb56701a) - 1.2 Set up development environment
- Python 3.11+, asyncio, httpx
- Local opencode instance
- Proxy configuration:
ANTHROPIC_BASE_URL=http://localhost:8080
- 1.3 Run proxy in passthrough mode (no eviction)
- Verify: opencode works normally through proxy
- Log: request/response sizes, token counts, tool call inventory
- 1.4 Enable FIFO eviction (Pichay default settings)
- tau = 4 turns, s_min = 500 bytes
- Verify: tombstones appear for old tool results
- Measure: tokens saved, fault rate
- 1.5 Record 5+ real coding sessions through the proxy
- Use
probe.pyto generate session analytics - Baseline metrics: waste %, amplification factor, fault rate
- Use
- 1.6 Write session replay infrastructure
- Record full message traces (request + response pairs)
- Replay tool for offline testing of later phases
Deliverable
Working proxy with FIFO eviction. Baseline metrics on real sessions.
Success Criteria
- Proxy is transparent (opencode works identically)
- Measurable token reduction (target: >15%)
- Fault rate < 0.1%
- 5+ recorded sessions for replay testing
Phase 2: Multi-Fidelity Eviction (Weeks 2-3)
Goal: Replace binary eviction (resident/tombstone) with graduated fidelity levels. Introduce the Helper LLM for summary generation.
Tasks
- 2.1 Implement the Fidelity Manager state machine
- States: L0 (full), L1 (detailed summary), L2 (compact summary), L3 (stub), L4 (evicted)
- Transitions: degrade (pressure), upgrade (access/fault), pin (fault-driven)
- In-memory state per session (no DB yet)
- 2.2 Implement pressure zones
- Normal (<50%), Caution (50-70%), Warning (70-85%), Critical (85-95%), Emergency (>95%)
- Token counting per fidelity level
- Configurable thresholds via TOML config
- 2.3 Integrate Helper LLM (Anthropic Haiku)
- API client with retry, timeout, error handling
- Prompt template for L0 -> L1 summarization (detailed summary + declared losses)
- Prompt template for L1 -> L2 compression
- Response parsing: extract summary, losses, can_answer, key_entities
- 2.4 Implement fidelity degradation on pressure
- Walk objects oldest-first
- Generate summaries via Helper LLM on transition
- Replace content in message array with summary + loss declaration
- Format:
[Summary of {type}: {stub}]\n{summary}\n[Cannot answer: {losses}]
- 2.5 Implement fidelity upgrade on access
- When model references an L1/L2/L3 object (detected by content overlap or tool call), upgrade to L0
- Restore full content from in-memory cache (client history is the backing store)
- 2.6 Extend fault detection for fidelity-aware faults
- L3 stub referenced -> upgrade to L1 (not necessarily L0)
- L2 compact referenced -> upgrade to L1
- Only full page-in if model explicitly re-requests (tool call match)
- 2.7 Add declared losses to eviction tombstones
- Format:
[Paged out: {stub}. Lost: {losses}. Restore if you need: {fault_when}]
- Format:
- 2.8 Replay testing against Phase 1 baseline
- Compare: token reduction, fault rate, summary quality
- Manual review: are declared losses accurate?
Deliverable
Multi-fidelity proxy with Helper LLM summarization and declared losses.
Success Criteria
- Token reduction > 40% (up from Phase 1's ~20%)
- Fault rate < 0.05% (better than binary eviction -- summaries prevent unnecessary faults)
- Helper LLM cost < $0.01 per session
- Declared losses are accurate in >90% of spot checks
Phase 3: Queryable Backing Store + Micro-Faults (Weeks 4-5)
Goal: Add PostgreSQL + pgvector as persistent backing store. Implement memory_query
phantom tool for micro-faults. This is the single highest-impact feature.
Tasks
- 3.1 Set up PostgreSQL + pgvector
- Docker Compose for local dev
- Schema from SCHEMA.md (sessions, semantic_objects tables)
- Connection pooling (asyncpg)
- 3.2 Implement Object Store module
- CRUD operations for semantic_objects
- Embedding generation (all-MiniLM-L6-v2 via sentence-transformers, ONNX runtime)
- Store objects on creation (every tool result, message span)
- Full content always preserved in DB regardless of context fidelity
- 3.3 Implement semantic search
- Vector similarity search (pgvector cosine distance)
- Full-text search (PostgreSQL tsvector)
- Hybrid search (weighted combination)
- Metadata-filtered search (by type, tags, date range)
- 3.4 Implement
memory_queryphantom tool- Add tool definition to phantom tool injection
- Intercept from streaming response
- Flow: intercept -> query Object Store -> send top-k results to Helper LLM -> inject answer
- Synthetic tool result format:
[Memory Query Result]\nQ: {question}\nA: {answer}\n[Source: {object stubs}]
- 3.5 Implement
memory_restorephantom tool- Full page-in from backing store (traditional fault)
- Upgrade object to L0
- Trigger pressure-based eviction of other objects if needed
- 3.6 Implement
memory_releasephantom tool- Model voluntarily releases objects
- Immediate degradation to L3 or L4
- Log cooperative eviction event
- 3.7 Implement phantom tool injection
- Add phantom tool definitions to the model's tool list in each request
- Parse phantom tool calls from streaming response BEFORE client receives them
- Handle phantom tool results transparently
- 3.8 Measure micro-fault effectiveness
- Track: micro-faults attempted, successful (model didn't need to restore after), tokens saved per micro-fault
- Compare: micro-fault answer quality vs full restore (LLM-judged)
Deliverable
Full proxy with persistent backing store, semantic search, and micro-fault capability.
Success Criteria
- Token reduction > 60%
- Micro-fault success rate > 85% (model doesn't need full restore after micro-fault)
- Micro-fault latency < 500ms (embedding + DB query + Helper LLM)
- Token savings per micro-fault: >95% vs full restore
- DB query latency < 50ms for semantic search
Phase 4a: Object Segmentation (Week 6)
Goal: Replace file-path-based page identity with semantic object detection. Conversations are segmented into coherent objects with types and relationships.
Tasks
- 4a.1 Implement structural segmentation (Strategy A)
- Rule-based: tool call boundaries -> tool_result / file_context objects
- User turn + assistant response = conversation span
- Topic shift detection: embedding cosine between consecutive spans < 0.7 -> new object
- Object type classification: rules based on tool name, content patterns
- Read tool -> file_context
- Bash/Grep tool -> tool_result
- Error in output -> error_context
- "I'll implement..." / plan language -> plan
- "Let's use X because Y" / decision language -> design_decision
- 4a.2 Implement object type classifier
- Helper LLM or simple keyword-based classifier
- Input: content span + preceding context
- Output: object_type + stub + tags
- Latency budget: <100ms (prefer rules, fallback to Helper LLM)
- 4a.3 Retroactive segmentation
- On session start: no segmentation (objects created per-tool-result)
- Every 10 turns: re-examine recent objects, merge small ones, split large ones
- Merge criterion: consecutive objects of same type with embedding similarity > 0.8
- Split criterion: single object > 5000 tokens with internal topic shift
- 4a.4 Object deduplication
- file_context: dedup by source_key (file path). New read supersedes old.
- tool_result: no dedup (each is unique)
- conversation_phase: no dedup
- Add 'supersedes' relationship when replacing
- 4a.5 Store relationships
- Automatic: file_context referenced in debugging_session -> 'references' edge
- Automatic: design_decision made during conversation_phase -> 'parent_of' edge
- Detection: key_entities overlap between objects suggests relationship
- 4a.6 Test segmentation quality
- Replay recorded sessions
- Manual review: do objects correspond to intuitive "chunks" of work?
- Measure: average object size, type distribution, relationship density
Deliverable
Proxy segments conversations into typed semantic objects with relationships.
Success Criteria
- Objects correspond to intuitive conversation segments (manual review)
- Average object size: 500-3000 tokens (not too fine, not too coarse)
- Type classification accuracy > 85%
- Segmentation latency < 50ms per turn (structural) or < 200ms (semantic)
Phase 4b: Object Relationships + Co-Fidelity (Week 7)
Goal: Use relationships between objects to make smarter fidelity decisions. When one object is upgraded, related objects are considered for upgrade too.
Tasks
- 4b.1 Implement relationship-aware fidelity management
- When upgrading object X to L0, check
depends_onandreferencesedges - If related object Y is at L2+, upgrade to L1 (not L0 -- don't over-promote)
- Configurable: max relationship hops (default: 2), max co-upgrades (default: 3)
- When upgrading object X to L0, check
- 4b.2 Implement relationship-aware eviction
- When degrading object X, DON'T degrade objects that X
depends_onif they're actively referenced by other L0 objects - Eviction priority: objects with no inbound
referencesordepends_onedges are evicted first (they're "leaf" objects)
- When degrading object X, DON'T degrade objects that X
- 4b.3 Relationship visualization (debug tool)
- CLI command:
mnemosyne graph --session <id> - Output: DOT graph of objects + relationships + fidelity levels
- For debugging: identify orphaned objects, over-connected clusters
- CLI command:
- 4b.4 Test co-fidelity management
- Scenario: model starts working on auth -> auth objects at L0, related files at L1
- Model switches to tests -> auth demoted to L1/L2, test objects promoted
- Model returns to auth -> auth restored, relationships pull in dependencies
Deliverable
Relationship-aware fidelity management. Objects that belong together stay together.
Success Criteria
- Co-fidelity reduces fault rate by >20% vs independent fidelity management
- No "orphaned dependency" faults (model needs X, but X's dependency Y is evicted)
Phase 4c: Goal-Aware Retrieval (Week 8)
Goal: When the user's goal changes, proactively swap context. The Helper LLM reads the new goal, queries the Object Store, and assembles a focused context window.
Tasks
- 4c.1 Implement goal transition detection
- Embed each user message
- Compare to previous user message embedding (cosine similarity)
- Threshold: < 0.5 = major topic shift, trigger goal-aware retrieval
- Also detect explicit signals: "now let's work on...", "moving to...", "switching to..."
- 4c.2 Implement goal classification
- Helper LLM call (~100ms): Input: current user message + last 2 turns of context Output: { goal, relevant_types, relevant_tags, predicted_needs }
- Cache goal classification (don't re-classify if message is a follow-up)
- 4c.3 Implement context swap on goal change
- Query Object Store: find top-20 objects by similarity to new goal
- Rank by: embedding similarity * recency_weight * type_match_bonus
- Assemble new context:
- Always: system prompt, last 2 user turns, pinned objects
- From goal query: top objects at appropriate fidelity (budget permitting)
- Previously active but now irrelevant: degrade to L2/L3 (not L4 -- recent work)
- 4c.4 Implement predictive loading
- Helper LLM predicts what the model will need for this goal
- Pre-load predicted objects at L1 (ready for quick upgrade)
- Track prediction accuracy: did the model actually access predicted objects?
- 4c.5 Test goal-aware retrieval
- Scenario: multi-task session (auth -> tests -> docs -> bugfix)
- Measure: context relevance at each goal transition
- Compare: goal-aware vs naive (no swap) in token efficiency and fault rate
Deliverable
Proxy proactively loads relevant context on goal transitions.
Success Criteria
- Goal detection accuracy > 80% (catches real transitions, few false positives)
- Context relevance after swap > 70% (measured by: did the model use the loaded objects?)
- Predictive loading accuracy > 50% (better than random)
- No regressions in token efficiency or fault rate
Phase 4d: Admission Control (Week 9)
Goal: Not everything deserves to be a stored object. Score incoming content and reject low-value items to keep the Object Store clean.
Tasks
- 4d.1 Implement admission scorer
- Four factors: TypePrior, Novelty, Utility, Recency
- TypePrior: static weights per object_type (design_decision=1.0, tool_result=0.4, etc.)
- Novelty: cosine distance to nearest existing object in session (>0.15 = novel)
- Utility: Helper LLM scores future relevance 0-1 (or local model via Ollama)
- Recency: exponential decay from turn number
- Configurable weights (default from A-MAC: T=0.35, N=0.25, U=0.25, R=0.15)
- 4d.2 Implement admission threshold
- Default: 0.4
- Items below threshold: not stored in Object Store
- Still present in client's unmodified history (Pichay backing store)
- Log rejected items for threshold tuning
- 4d.3 Implement rejection patterns
- Always reject:
lsoutput,git status, routine directory listings - Always reject: duplicate file reads where content hash matches existing object
- Always reject: purely procedural assistant responses ("Sure, I'll do that")
- Configurable: rejection rules in TOML config
- Always reject:
- 4d.4 Threshold tuning
- Replay recorded sessions with different thresholds
- Find threshold that minimizes (false rejections * fault_cost + storage * store_cost)
- Log admission_scores table for analysis
Deliverable
Admission gate filters low-value content from the Object Store.
Success Criteria
-
30% of tool results rejected (routine/duplicate content)
- No false rejections that cause faults later (measure: rejected items that model would have needed, detected via fault-after-reject tracking)
- Object Store grows linearly with session complexity, not session length
Phase 4e: Entropy-Gated Faulting (Week 10)
Goal: Use the model's token-level entropy during generation as an automatic signal to inject more context. Complements declared losses.
Tasks
- 4e.1 Implement logprob extraction from streaming response
- Anthropic API:
stream_options.include_logprobs(if available) - Parse token logprobs from SSE stream in real-time
- Calculate rolling entropy: H = -sum(p * log(p)) over top-k logprobs
- Anthropic API:
- 4e.2 Implement entropy monitoring
- Rolling window: last 20 tokens
- Thresholds: normal (H < 1.5), elevated (1.5-2.2), high (H > 2.2)
- Debounce: don't trigger on single high-entropy token (require 3+ consecutive)
- 4e.3 Implement entropy-triggered context injection
- On elevated entropy: check if any L2/L3 objects match current generation topic
- Extract current generation context (last 50 tokens)
- Embed and search Object Store
- If match found: silently upgrade to L1 in NEXT request (can't modify current)
- On high entropy: more aggressive -- prepare micro-fault answer for likely question
- On elevated entropy: check if any L2/L3 objects match current generation topic
- 4e.4 Evaluate entropy signal reliability
- Compare: entropy at points where model made errors vs correct generation
- Calibrate thresholds per model (Opus vs Sonnet vs Haiku have different baselines)
- Measure: false positive rate (elevated entropy but model was fine)
Deliverable
Proxy monitors generation entropy and proactively loads context when model is uncertain.
Success Criteria
- Entropy signal detects genuine uncertainty >70% of the time
- False positive rate < 30% (elevated entropy that didn't need intervention)
- Measurable quality improvement on tasks where entropy-gating activated
- Note: This is the most experimental phase. Success criteria may be revised.
Fallback
If logprobs are not reliably available from the API, this phase can be deferred. The system works well without it -- declared losses + manual faulting cover most cases.
Phase 5: OpenCode Plugin Integration (Week 11)
Goal: Package as an oh-my-opencode plugin with a companion MCP server for user-facing memory inspection tools.
Tasks
- 5.1 Create oh-my-opencode plugin package
- npm package:
opencode-mnemosyne - Hook:
experimental.chat.messages.transformfor context assembly - Hook:
experimental.session.compactingfor custom compaction - Hook:
tool.execute.afterfor object creation on tool results - Configuration via
oh-my-opencode.json
- npm package:
- 5.2 Create MCP server for user-facing tools
memory_stats: show current context pressure, object counts by type/fidelitymemory_objects: list all objects with fidelity, type, stubmemory_inspect <id>: show object detail (all fidelity levels, losses, relationships)memory_graph: show object relationship graphmemory_config: view/update runtime configuration
- 5.3 Documentation
- Installation guide
- Configuration reference
- Troubleshooting guide
- Architecture overview for contributors
- 5.4 Session dashboard (optional)
- Local web UI (served by proxy) showing:
- Real-time context pressure gauge
- Object timeline (creation, fidelity transitions, faults)
- Token savings over time
- Fault log
- Local web UI (served by proxy) showing:
Deliverable
Installable plugin + MCP server. Users can inspect and configure memory behavior.
Phase 6: Semantic Segmentation + xMemory Hierarchy (Week 12)
Goal: Replace rule-based segmentation with xMemory's hierarchical approach for long sessions. Build the full messages -> episodes -> semantics -> themes hierarchy.
Tasks
- 6.1 Implement xMemory-style sparsity-semantics segmentation
- Embed all messages in a session with all-MiniLM-L6-v2
- Cluster by coherence using the sparsity-semantics objective: maximize inter-cluster diversity, minimize intra-cluster redundancy
- Output: episodes (coherent sub-conversations)
- 6.2 Build hierarchy
- Level 0: individual messages/tool results (existing objects)
- Level 1: episodes (groups of related objects, from clustering)
- Level 2: semantics (abstract themes spanning multiple episodes)
- Level 3: themes (top-level categories for the entire session)
- 6.3 Hierarchical retrieval
- Top-down: query matches theme -> expand to semantics -> expand to episodes -> expand to objects
- Only expand when similarity score justifies it (reader uncertainty reduction)
- Prevents redundant retrieval (a key xMemory advantage over flat search)
- 6.4 Incremental hierarchy maintenance
- Don't rebuild from scratch every turn
- New objects: assign to nearest episode, update episode embedding
- Every 20 turns: re-cluster to catch topic drift
- Major goal change: full re-hierarchy
- 6.5 Hierarchy-aware fidelity management
- When an episode is at L2, all its objects are at L2 or lower
- Upgrading an episode promotes its most relevant objects to L1
- Themes can have their own summaries (super-summaries of episode summaries)
Deliverable
Full hierarchical segmentation for long sessions (>50 turns).
Success Criteria
- Retrieval quality improves for sessions >100 turns (measured by LLM-judged relevance)
- Hierarchy reduces redundancy in retrieved context (measured by token overlap between results)
- Incremental maintenance is fast (<500ms per turn)
Dependencies Between Phases
Phase 1 (Pichay baseline)
|
Phase 2 (Multi-fidelity + Helper LLM)
|
Phase 3 (Backing store + micro-faults)
/ \
/ \
4a 4d (can run in parallel)
(segmentation) (admission control)
|
4b (relationships + co-fidelity)
|
4c (goal-aware retrieval)
|
4e (entropy-gated faulting) -- optional, experimental
|
Phase 5 (plugin integration)
|
Phase 6 (xMemory hierarchy)
Phases 4a-4e can be partially parallelized:
- 4a + 4d can be built simultaneously
- 4b depends on 4a
- 4c depends on 4a + 4b
- 4e is independent (depends only on Phase 3)
- Phase 5 can start after Phase 3 (plugin wrapping doesn't need 4a-4e)
- Phase 6 depends on 4a (needs basic segmentation first)
Risk Register
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Anthropic API doesn't expose logprobs for streaming | Medium | Phase 4e blocked | Phase 4e is optional. System works without entropy gating. |
| Helper LLM summaries lose critical info | Medium | Quality degradation | Declared losses + micro-faults as safety net. Spot-check auditing. |
| Proxy adds too much latency | Low | User experience | Helper LLM calls are async (don't block response). Summarization happens post-response. |
| pgvector search too slow at scale | Low | Micro-fault latency | IVFFlat index. For extreme scale, switch to dedicated vector DB (Qdrant). |
| Object segmentation too noisy | Medium | Poor fidelity decisions | Conservative defaults (larger objects). Rule-based segmentation is robust. |
| Phantom tool parsing from streaming response is fragile | Medium | Proxy breaks | Extensive testing on recorded sessions. Fallback: don't parse, let tool call through. |
| Model doesn't use cooperative eviction (memory_release) | High | Reduced savings | Cooperative eviction is bonus. Pressure-based eviction works without model cooperation. |
| Cross-session memory introduces stale/wrong context | Medium | Wrong answers | Phase 6+ only. Confidence decay on persistent objects. |