Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
12 KiB
12 KiB
Research References
All papers, repositories, and prior art that informed this design.
Core Papers
Pichay — Demand Paging for LLM Context Windows (PRIMARY)
- Paper: The Missing Memory Hierarchy: Demand Paging for LLM Context Windows
- Author: Tony Mason (UBC / Georgia Tech)
- Date: March 2026, accepted ACM SIGOPS
- Repo: https://github.com/fsgeek/pichay (tag: v0.1.0-paper, commit b56701a)
- Archival: https://doi.org/10.5281/zenodo.18930122
- Key findings: 21.8% structural waste across 857 sessions / 4.45B tokens. 93% context reduction in live deployment. 0.0254% fault rate over 1.4M evictions. Cooperative eviction via phantom tools and cleanup tags. FIFO eviction with pressure zones. Transparent HTTP proxy architecture.
- Used in: Phase 1 (fork baseline), Phase 2 (pressure zones, cleanup tags), Phase 3 (phantom tools)
MemGPT / Letta — Virtual Memory for LLMs
- Paper: MemGPT: Towards LLMs as Operating Systems
- Authors: Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, Joseph E. Gonzalez (UC Berkeley)
- Date: October 2023 (revised February 2024)
- Repo: https://github.com/letta-ai/letta (SHA: 4cb2f21c)
- Key findings: Three-tier memory hierarchy (core/recall/archival). Agent-initiated paging via tool calls. PostgreSQL + pgvector for archival storage. Partial-evict summarization (30% oldest messages). LLM-driven retrieval is surprisingly effective.
- Used in: Object Store design (SCHEMA.md), multi-fidelity concept, backing store architecture (Phase 3)
xMemory — Hierarchical Structured Retrieval
- Paper: Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation
- Venue: ICML 2026
- Key findings: Standard RAG on agent memory fails due to correlated content. Hierarchical retrieval (messages -> episodes -> semantics -> themes) prevents redundant retrieval. Sparsity-semantics objective for segmentation. Top-down retrieval reduces retrieved tokens while improving relevance.
- Used in: Phase 6 (xMemory hierarchy), Phase 4a (segmentation concept)
L-RAG — Entropy-Based Lazy Context Loading
- Paper: L-RAG: Balancing Context and Retrieval with Entropy-Based Lazy Loading
- Date: January 2026
- Key findings: Token entropy reliably predicts model uncertainty (H=1.72 correct vs H=2.20 errors, p<0.001). 26% retrieval reduction at balanced threshold. Training-free. Works with any model.
- Used in: Phase 4e (entropy-gated faulting)
A-MAC — Adaptive Memory Admission Control
- Paper: Adaptive Memory Admission Control for LLM Agents
- Authors: Workday AI
- Date: March 2026
- Repo: https://github.com/GuilinDev/Adaptive_Memory_Admission_Control_LLM_Agents
- Key findings: 5-factor admission scorer (Utility, Confidence, Novelty, Recency, TypePrior). TypePrior is most influential factor. Uses local LLM (Ollama/qwen2.5) for utility scoring. F1=0.583 on LoCoMo. 31% faster than LLM-native memory.
- Used in: Phase 4d (admission control)
Supporting Papers
Factory — Anchored Iterative Summarization
- Source: Factory's evaluation across 36,000 engineering sessions
- Key findings: Anchored summarization (persistent state with intent/changes/decisions/ next_steps) outperforms rolling reconstruction. Scores: Factory 4.04 vs Anthropic 3.74 vs OpenAI 3.43 on accuracy/completeness/continuity.
- Used in: Phase 2 (multi-fidelity compression design)
SWE-Pruner — Neural Context Pruning for Coding
- Authors: Wang et al., 2026
- Key findings: 0.6B-parameter neural skimmer for task-aware pruning. 23-54% token reduction on SWE-bench. Maintains solve rates.
- Referenced for: Alternative approach to context reduction (learned pruning vs semantic objects)
ACON — Failure-Driven Compression Optimization
- Paper: arXiv, October 2025
- Key findings: Unified history + observation compression. 26-54% peak context reduction. Gradient-free, works with API models. Iteratively refines compression prompt based on failure cases.
- Referenced for: Compression strategy comparison
Neural Paging — Learned Page Controller
- Paper: Neural Paging: Learning Context Management Policies for Turing-Complete Agents
- Date: February 2026
- Key findings: Differentiable page controller. Semantic Belady's optimality. Reduces O(N^2) to O(N*K^2) complexity. Theoretical framework.
- Referenced for: Future work (learned eviction policy)
CMV — DAG-Based Session History Trimming
- Author: Santoni, 2026
- Key findings: DAG-based session history structure. Structurally lossless trimming. Up to 86% reduction for tool-heavy sessions.
- Referenced for: Alternative structural approach
MemOS — Memory Operating System for AGI
- Authors: Li et al., 2025
- Key findings: Full "Memory OS" with lifecycle control and persistent representations.
- Referenced for: Long-term architecture vision
SideQuest — KV Cache Eviction via Parallel Reasoning
- Authors: Kariyappa & Suh, 2026
- Key findings: Fine-tuned parallel reasoning thread for KV cache eviction. 56-65% peak memory reduction. Irreversible eviction.
- Referenced for: KV-cache-level optimization (complementary to our message-level approach)
Quest — Query-Aware KV Cache Sparsity
- Paper: Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
- Venue: ICML 2024, MIT Han Lab
- Repo: https://github.com/mit-han-lab/Quest
- Key findings: 2.23x self-attention speedup, 7.03x inference latency reduction. Query-aware page selection within KV cache.
- Referenced for: Within-model context selection (different layer than our system)
SpeContext — Speculative Context Sparsity
- Paper: SpeContext: Enabling Efficient Long-context Reasoning
- Authors: SJTU / Infinigence-AI, November 2025
- Key findings: Small draft model predicts important KV cache tokens before main model runs. Analogous to speculative decoding but for context selection.
- Referenced for: Helper model concept (similar philosophy at different layer)
SoK: Agentic RAG
- Paper: SoK: Agentic RAG: Taxonomy, Architectures, Evaluation
- Date: March 2026
- Key findings: Definitive 2026 survey. Taxonomy of planning, retrieval, memory, and tool coordination patterns. Identifies risks: compounding hallucination, memory poisoning, retrieval misalignment.
- Referenced for: Taxonomy and risk awareness
Mem0 — Fact Extraction + Merge Pipeline
- Paper: Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
- Repo: https://github.com/mem0ai/mem0 (49,561 stars)
- Key findings: 2-LLM-call pipeline (extract facts -> diff/merge with existing). +26% accuracy over OpenAI Memory on LOCOMO. 91% faster, 90% fewer tokens. 20+ vector store backends.
- Referenced for: Future cross-session memory (Phase 6+)
Key Repositories
Direct Dependencies
| Repo | What We Use | Phase |
|---|---|---|
| fsgeek/pichay | Fork as starting point for proxy | Phase 1 |
| pgvector/pgvector | PostgreSQL vector similarity | Phase 3+ |
| sentence-transformers | all-MiniLM-L6-v2 embeddings | Phase 3+ |
Reference Implementations
| Repo | What We Learn From | Stars |
|---|---|---|
| letta-ai/letta | 3-tier memory architecture, archival search | 15k+ |
| mem0ai/mem0 | Fact extraction pipeline, multi-backend vector store | 49k+ |
| alibaizhanov/mengram | 3-memory-type system (semantic/episodic/procedural) | 86 |
| PavanVkAlapati/memory_orchestration | Layered memory with Qdrant + Redis + MongoDB | - |
| GuilinDev/Adaptive_Memory_Admission_Control_LLM_Agents | A-MAC admission scoring | - |
| vivek-tiwari-vt/agmem | Git-like version control for agent memories | - |
| lm-sys/RouteLLM | BERT classifier router for model selection | - |
MCP Servers (reference for Phase 5)
| Repo | What It Does |
|---|---|
| adamrdrew/agent-memory-mcp | Hybrid BM25 + vector search, local embeddings, 12 memory categories |
| Parswanadh/memory-mcp-server | 3-tier hierarchical memory (working/short-term/long-term) |
| vbcherepanov/claude-total-memory | 4-tier search, 20 tools, ChromaDB + SQLite |
| van-reflect/Reflect-Memory | Cross-agent memory, vendor-neutral |
OpenCode / Oh-My-OpenCode Integration Points
OpenCode Plugin Hooks (from sst/opencode)
| Hook | Location | Purpose for Mnemosyne |
|---|---|---|
experimental.chat.messages.transform |
packages/opencode/src/session/prompt.ts:652 |
Modify message array before LLM call (context assembly) |
experimental.session.compacting |
packages/opencode/src/session/compaction.ts:169 |
Custom compaction prompt/context |
experimental.chat.system.transform |
packages/opencode/src/session/llm.ts:84 |
Modify system prompt (inject memory instructions) |
tool.execute.before |
packages/plugin/src/index.ts:184 |
Intercept tool args before execution |
tool.execute.after |
packages/plugin/src/index.ts:192 |
Process tool results for object creation |
chat.params |
packages/opencode/src/session/llm.ts:114 |
Modify temperature, options |
Oh-My-OpenCode Hooks (from omc-sh/oh-my-opencode)
| Hook | Purpose for Mnemosyne |
|---|---|
context-window-monitor |
Existing hook -- can extend or replace |
preemptive-compaction |
Existing hook -- integrate with our pressure system |
tool-output-truncator |
Existing hook -- our fidelity system supersedes this |
compaction-context-injector |
Inject our memory state into compaction prompt |
Benchmark Datasets
For evaluating memory quality:
| Dataset | What It Tests | URL |
|---|---|---|
| LoCoMo | Long-conversation memory (QA over multi-session chat) | https://github.com/letta-ai/letta/tree/main/tests |
| PerLTQA | Personalized long-term QA | Referenced in xMemory paper |
| SWE-bench | Coding task completion (for measuring quality impact) | https://github.com/princeton-nlp/SWE-bench |
| Terminal-Bench | CLI agent task completion | Referenced in Letta Code evaluation |
Key Metrics from Literature
| System | Context Reduction | Quality Impact | Cost |
|---|---|---|---|
| Pichay (baseline eviction) | 37% token, up to 93% extreme | 0.0254% fault rate | Zero (proxy only) |
| SWE-Pruner | 23-54% | Maintains solve rates | Training cost for 0.6B model |
| ACON | 26-54% peak | 95%+ task accuracy preserved | Multiple LLM calls for training |
| Factory summarization | High | 4.04/5 accuracy score | 1 LLM call per eviction |
| Cursor lazy MCP loading | 46.9% | No degradation | Zero (lazy loading) |
| Cline file deduplication | Variable | None (lossless) | Zero (dedup only) |
| Simple observation masking | ~50% | Matches LLM summarization | Zero |
| L-RAG entropy gating | 26% retrieval reduction | Marginal impact | Logprob monitoring |
| RouteLLM model routing | 85% cost reduction | 95% quality maintained | <10ms per route |