Models occasionally misspell yuyay as yuyar since it's a made-up word. Use [yr] character class in all regexes and partial tag prefixes. |
||
|---|---|---|
| deploy | ||
| opencode-plugin | ||
| sql | ||
| src/mnemosyne | ||
| tests | ||
| .gitignore | ||
| .python-version | ||
| ARCHITECTURE.md | ||
| docker-compose.yml | ||
| mnemosyne.toml | ||
| pyproject.toml | ||
| README.md | ||
| REFERENCES.md | ||
| ROADMAP.md | ||
| SCHEMA.md | ||
Mnemosyne
Mnemosyne is a transparent HTTP proxy designed to manage LLM context memory. It sits between LLM clients and providers like Anthropic or OpenAI, providing semantic segmentation, multi-fidelity compression, and live monitoring to reduce context costs and improve agent performance.
Overview
Mnemosyne acts as a drop-in proxy. It intercepts API requests, segments content into labeled blocks, and applies multi-fidelity compression to manage the context window. As context grows, Mnemosyne automatically degrades content fidelity to keep the most relevant information available while minimizing token usage.
Architecture
Client → Mnemosyne Proxy → Provider API
|
+------+------+
| |
BlockStore PageStore
| |
FidelityManager |
| |
SSECleanupFilter |
| |
ObjectStore <-----+
Features
- Transparent Proxy: Operates as a drop-in replacement for LLM API endpoints. No client-side changes are required.
- Semantic Segmentation: Automatically labels conversation content with tensor/block IDs, allowing the model to reference specific segments.
- Multi-Fidelity Compression: Supports five levels of fidelity (L0-L4):
- L0: Full content (no compression)
- L1: Detailed summary (~30% size)
- L2: Compact summary (~5% size)
- L3: Metadata stub (~50-100 tokens)
- L4: Evicted (not in context)
- Pressure-Based Degradation: Automatically downgrades object fidelity based on context window pressure zones (Normal, Caution, Warning, Critical, Emergency).
- SSE Cleanup Filter: Intercepts streaming responses to strip internal memory management tags (e.g.,
<memory_cleanup>,<yuyay-response>) and executes cleanup operations in real-time. - Live Monitoring: Includes an embedded dashboard at
/dashboardproviding real-time metrics on context reduction, fidelity distribution, admission control, and latency. - REST API: Exposes endpoints for session management, benchmarking, and memory state inspection.
- OpenCode Integration: Includes a plugin for OpenCode to enable memory-aware context injection and status reporting.
Installation
Mnemosyne is a Python project managed with uv.
- Clone the repository:
git clone https://github.com/jyapayne/mnemosyne cd mnemosyne - Install dependencies:
uv sync
Configuration
Mnemosyne is configured via environment variables:
ANTHROPIC_API_KEY: API key for the provider.MNEMOSYNE_PORT: Port for the proxy server (default: 8080).MNEMOSYNE_HOST: Host for the proxy server (default: 127.0.0.1).
Usage
Run the proxy server:
uv run mnemosyne
Point your LLM client (e.g., Claude Code) to the proxy URL (e.g., http://127.0.0.1:8080/v1).
API Reference
/api/sessions: List active conversation sessions./api/benchmark: Retrieve performance and token usage metrics./api/memory: Inspect the current state of the object store./api/blocks: View tracked conversation blocks.
How It Works
Request Lifecycle
- Inbound:
label_messages: Injects[block:xxxx]markers into message content.apply_to_messages: Drops or collapses blocks based on cleanup operations.compact_messages: Compresses remaining content.apply_fidelity: Adjusts content based on current fidelity levels.
- Outbound:
SSECleanupFilter: Strips cleanup tags from streaming responses, executes operations, and forwards the clean stream.
- Post-Response:
- Updates fidelity pressure, schedules summaries, and records benchmarks.
License
MIT