Joey Yakimowich-Payne 7c6a3dbe4a docs: add architecture and reference documentation

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

2026-03-13 11:41:41 -06:00

12 KiB

Raw Permalink Blame History

Research References

All papers, repositories, and prior art that informed this design.

Core Papers

Pichay — Demand Paging for LLM Context Windows (PRIMARY)

Paper: The Missing Memory Hierarchy: Demand Paging for LLM Context Windows
Author: Tony Mason (UBC / Georgia Tech)
Date: March 2026, accepted ACM SIGOPS
Repo: https://github.com/fsgeek/pichay (tag: v0.1.0-paper, commit b56701a)
Archival: https://doi.org/10.5281/zenodo.18930122
Key findings: 21.8% structural waste across 857 sessions / 4.45B tokens. 93% context reduction in live deployment. 0.0254% fault rate over 1.4M evictions. Cooperative eviction via phantom tools and cleanup tags. FIFO eviction with pressure zones. Transparent HTTP proxy architecture.
Used in: Phase 1 (fork baseline), Phase 2 (pressure zones, cleanup tags), Phase 3 (phantom tools)

MemGPT / Letta — Virtual Memory for LLMs

Paper: MemGPT: Towards LLMs as Operating Systems
Authors: Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, Joseph E. Gonzalez (UC Berkeley)
Date: October 2023 (revised February 2024)
Repo: https://github.com/letta-ai/letta (SHA: 4cb2f21c)
Key findings: Three-tier memory hierarchy (core/recall/archival). Agent-initiated paging via tool calls. PostgreSQL + pgvector for archival storage. Partial-evict summarization (30% oldest messages). LLM-driven retrieval is surprisingly effective.
Used in: Object Store design (SCHEMA.md), multi-fidelity concept, backing store architecture (Phase 3)

xMemory — Hierarchical Structured Retrieval

Paper: Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation
Venue: ICML 2026
Key findings: Standard RAG on agent memory fails due to correlated content. Hierarchical retrieval (messages -> episodes -> semantics -> themes) prevents redundant retrieval. Sparsity-semantics objective for segmentation. Top-down retrieval reduces retrieved tokens while improving relevance.
Used in: Phase 6 (xMemory hierarchy), Phase 4a (segmentation concept)

L-RAG — Entropy-Based Lazy Context Loading

Paper: L-RAG: Balancing Context and Retrieval with Entropy-Based Lazy Loading
Date: January 2026
Key findings: Token entropy reliably predicts model uncertainty (H=1.72 correct vs H=2.20 errors, p<0.001). 26% retrieval reduction at balanced threshold. Training-free. Works with any model.
Used in: Phase 4e (entropy-gated faulting)

A-MAC — Adaptive Memory Admission Control

Paper: Adaptive Memory Admission Control for LLM Agents
Authors: Workday AI
Date: March 2026
Repo: https://github.com/GuilinDev/Adaptive_Memory_Admission_Control_LLM_Agents
Key findings: 5-factor admission scorer (Utility, Confidence, Novelty, Recency, TypePrior). TypePrior is most influential factor. Uses local LLM (Ollama/qwen2.5) for utility scoring. F1=0.583 on LoCoMo. 31% faster than LLM-native memory.
Used in: Phase 4d (admission control)

Supporting Papers

Factory — Anchored Iterative Summarization

Source: Factory's evaluation across 36,000 engineering sessions
Key findings: Anchored summarization (persistent state with intent/changes/decisions/ next_steps) outperforms rolling reconstruction. Scores: Factory 4.04 vs Anthropic 3.74 vs OpenAI 3.43 on accuracy/completeness/continuity.
Used in: Phase 2 (multi-fidelity compression design)

SWE-Pruner — Neural Context Pruning for Coding

Authors: Wang et al., 2026
Key findings: 0.6B-parameter neural skimmer for task-aware pruning. 23-54% token reduction on SWE-bench. Maintains solve rates.
Referenced for: Alternative approach to context reduction (learned pruning vs semantic objects)

ACON — Failure-Driven Compression Optimization

Paper: arXiv, October 2025
Key findings: Unified history + observation compression. 26-54% peak context reduction. Gradient-free, works with API models. Iteratively refines compression prompt based on failure cases.
Referenced for: Compression strategy comparison

Neural Paging — Learned Page Controller

Paper: Neural Paging: Learning Context Management Policies for Turing-Complete Agents
Date: February 2026
Key findings: Differentiable page controller. Semantic Belady's optimality. Reduces O(N^2) to O(N*K^2) complexity. Theoretical framework.
Referenced for: Future work (learned eviction policy)

CMV — DAG-Based Session History Trimming

Author: Santoni, 2026
Key findings: DAG-based session history structure. Structurally lossless trimming. Up to 86% reduction for tool-heavy sessions.
Referenced for: Alternative structural approach

MemOS — Memory Operating System for AGI

Authors: Li et al., 2025
Key findings: Full "Memory OS" with lifecycle control and persistent representations.
Referenced for: Long-term architecture vision

SideQuest — KV Cache Eviction via Parallel Reasoning

Authors: Kariyappa & Suh, 2026
Key findings: Fine-tuned parallel reasoning thread for KV cache eviction. 56-65% peak memory reduction. Irreversible eviction.
Referenced for: KV-cache-level optimization (complementary to our message-level approach)

Quest — Query-Aware KV Cache Sparsity

Paper: Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Venue: ICML 2024, MIT Han Lab
Repo: https://github.com/mit-han-lab/Quest
Key findings: 2.23x self-attention speedup, 7.03x inference latency reduction. Query-aware page selection within KV cache.
Referenced for: Within-model context selection (different layer than our system)

SpeContext — Speculative Context Sparsity

Paper: SpeContext: Enabling Efficient Long-context Reasoning
Authors: SJTU / Infinigence-AI, November 2025
Key findings: Small draft model predicts important KV cache tokens before main model runs. Analogous to speculative decoding but for context selection.
Referenced for: Helper model concept (similar philosophy at different layer)

SoK: Agentic RAG

Paper: SoK: Agentic RAG: Taxonomy, Architectures, Evaluation
Date: March 2026
Key findings: Definitive 2026 survey. Taxonomy of planning, retrieval, memory, and tool coordination patterns. Identifies risks: compounding hallucination, memory poisoning, retrieval misalignment.
Referenced for: Taxonomy and risk awareness

Mem0 — Fact Extraction + Merge Pipeline

Paper: Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Repo: https://github.com/mem0ai/mem0 (49,561 stars)
Key findings: 2-LLM-call pipeline (extract facts -> diff/merge with existing). +26% accuracy over OpenAI Memory on LOCOMO. 91% faster, 90% fewer tokens. 20+ vector store backends.
Referenced for: Future cross-session memory (Phase 6+)

Key Repositories

Direct Dependencies

Repo	What We Use	Phase
fsgeek/pichay	Fork as starting point for proxy	Phase 1
pgvector/pgvector	PostgreSQL vector similarity	Phase 3+
sentence-transformers	all-MiniLM-L6-v2 embeddings	Phase 3+

Reference Implementations

Repo	What We Learn From	Stars
letta-ai/letta	3-tier memory architecture, archival search	15k+
mem0ai/mem0	Fact extraction pipeline, multi-backend vector store	49k+
alibaizhanov/mengram	3-memory-type system (semantic/episodic/procedural)	86
PavanVkAlapati/memory_orchestration	Layered memory with Qdrant + Redis + MongoDB	-
GuilinDev/Adaptive_Memory_Admission_Control_LLM_Agents	A-MAC admission scoring	-
vivek-tiwari-vt/agmem	Git-like version control for agent memories	-
lm-sys/RouteLLM	BERT classifier router for model selection	-

MCP Servers (reference for Phase 5)

Repo	What It Does
adamrdrew/agent-memory-mcp	Hybrid BM25 + vector search, local embeddings, 12 memory categories
Parswanadh/memory-mcp-server	3-tier hierarchical memory (working/short-term/long-term)
vbcherepanov/claude-total-memory	4-tier search, 20 tools, ChromaDB + SQLite
van-reflect/Reflect-Memory	Cross-agent memory, vendor-neutral

OpenCode / Oh-My-OpenCode Integration Points

OpenCode Plugin Hooks (from sst/opencode)

Hook	Location	Purpose for Mnemosyne
`experimental.chat.messages.transform`	`packages/opencode/src/session/prompt.ts:652`	Modify message array before LLM call (context assembly)
`experimental.session.compacting`	`packages/opencode/src/session/compaction.ts:169`	Custom compaction prompt/context
`experimental.chat.system.transform`	`packages/opencode/src/session/llm.ts:84`	Modify system prompt (inject memory instructions)
`tool.execute.before`	`packages/plugin/src/index.ts:184`	Intercept tool args before execution
`tool.execute.after`	`packages/plugin/src/index.ts:192`	Process tool results for object creation
`chat.params`	`packages/opencode/src/session/llm.ts:114`	Modify temperature, options

Oh-My-OpenCode Hooks (from omc-sh/oh-my-opencode)

Hook	Purpose for Mnemosyne
`context-window-monitor`	Existing hook -- can extend or replace
`preemptive-compaction`	Existing hook -- integrate with our pressure system
`tool-output-truncator`	Existing hook -- our fidelity system supersedes this
`compaction-context-injector`	Inject our memory state into compaction prompt

Benchmark Datasets

For evaluating memory quality:

Dataset	What It Tests	URL
LoCoMo	Long-conversation memory (QA over multi-session chat)	https://github.com/letta-ai/letta/tree/main/tests
PerLTQA	Personalized long-term QA	Referenced in xMemory paper
SWE-bench	Coding task completion (for measuring quality impact)	https://github.com/princeton-nlp/SWE-bench
Terminal-Bench	CLI agent task completion	Referenced in Letta Code evaluation

Key Metrics from Literature

System	Context Reduction	Quality Impact	Cost
Pichay (baseline eviction)	37% token, up to 93% extreme	0.0254% fault rate	Zero (proxy only)
SWE-Pruner	23-54%	Maintains solve rates	Training cost for 0.6B model
ACON	26-54% peak	95%+ task accuracy preserved	Multiple LLM calls for training
Factory summarization	High	4.04/5 accuracy score	1 LLM call per eviction
Cursor lazy MCP loading	46.9%	No degradation	Zero (lazy loading)
Cline file deduplication	Variable	None (lossless)	Zero (dedup only)
Simple observation masking	~50%	Matches LLM summarization	Zero
L-RAG entropy gating	26% retrieval reduction	Marginal impact	Logprob monitoring
RouteLLM model routing	85% cost reduction	95% quality maintained	<10ms per route

12 KiB Raw Permalink Blame History

Research References

Core Papers

Pichay — Demand Paging for LLM Context Windows (PRIMARY)

MemGPT / Letta — Virtual Memory for LLMs

xMemory — Hierarchical Structured Retrieval

L-RAG — Entropy-Based Lazy Context Loading

A-MAC — Adaptive Memory Admission Control

Supporting Papers

Factory — Anchored Iterative Summarization

SWE-Pruner — Neural Context Pruning for Coding

ACON — Failure-Driven Compression Optimization

Neural Paging — Learned Page Controller

CMV — DAG-Based Session History Trimming

MemOS — Memory Operating System for AGI

SideQuest — KV Cache Eviction via Parallel Reasoning

Quest — Query-Aware KV Cache Sparsity

SpeContext — Speculative Context Sparsity

SoK: Agentic RAG

Mem0 — Fact Extraction + Merge Pipeline

Key Repositories

Direct Dependencies

Reference Implementations

MCP Servers (reference for Phase 5)

OpenCode / Oh-My-OpenCode Integration Points

OpenCode Plugin Hooks (from sst/opencode)

Oh-My-OpenCode Hooks (from omc-sh/oh-my-opencode)

Benchmark Datasets

Key Metrics from Literature

12 KiB

Raw Permalink Blame History