Second time disabling. The auto-stub replacement produces content
that opencode rejects, triggering 16+ retries per second. Needs
deeper investigation into which content types can be safely stubbed
vs which ones break the model.
The rate limiter's time.sleep() blocked the single uvicorn worker
thread, deadlocking the entire server (health endpoint, dashboard,
all requests). Removed acquire() from both streaming and non-streaming
paths. The rate limiter still records 429s for circuit breaker stats
but no longer blocks.
When proxying with OAuth (no auth plugin), applies the same body
transforms as opencode-anthropic-auth:
- Prepend 'You are Claude Code' identity to system prompt
- Replace 'OpenCode' with 'Claude Code' in system text
- Prefix tool names with 'mcp_' in tools and tool_use blocks
These are only applied when ANTHROPIC_AUTH_TOKEN is set.
Hardcoded 200K window caused 101% pressure at 201K tokens on 1M
models. Now detects model from request payload and sets window_size
accordingly (1M for opus-4-6/sonnet-4-6/sonnet-4-5, 200K for others).
Falls back to 200K for unknown models.
Root cause: 306 embedding calls at 61ms each blocked the request
thread for ~19s before forwarding to Anthropic.
- Batch all admitted objects into single embed_batch() call
- Run in background thread (non-blocking)
- store_object accepts pre-computed embeddings
- Goal detection uses turn heuristic instead of blocking embed
When the SSE filter suppresses text deltas (buffering inside a
memory_cleanup/yuyay-response tag), no bytes reached the client,
causing opencode's SSE read timeout to fire. Now emits ':keepalive'
SSE comments during suppression to keep the connection alive.
With 200k+ token contexts, Anthropic can take 60+ seconds for time
to first token. The 300s timeout was too aggressive for SSE reads
during long thinking phases.
Token bucket at 40 RPM to stay under Max 5x plan ceilings (~50 RPM).
Reads retry-after header from 429 responses to pause precisely.
Circuit breaker trips after 3 consecutive 429s, pausing 30s before
retrying. Stats exposed in /health endpoint.
Root cause of retry loops: mass degradation stubbed 85% of context at
once, confusing the model into infinite retries.
Fixes:
- Cap degradations at 20 per turn (gradual compression)
- Protect objects accessed within last 10 turns from degradation
- Estimate L1/L2 token counts (30%/10% of L0) so FM pressure tracks
correctly after degradation
- Improved stubs: '[compressed 3.4KB -> stub] first 200 chars...'
- Re-enabled _apply_fidelity
L1/L2 stub replacement was producing responses that opencode
rejected, triggering rapid retries and rate limiting. Disabled
until stub format is validated for tool_result compatibility.
L1/L2 objects without LLM summaries kept full content as fallback,
increasing context instead of reducing it. Now uses auto-stub
(truncated preview) when no summary exists, ensuring degraded
objects always produce smaller content.
The fidelity distribution chart was always showing 0 for L1-L4
because it queried the ObjectStore (never populated) instead of
the FidelityManager (which actually tracks degradation state).
Window was restored before degrade() was called, so FM always saw
NORMAL pressure internally. Now keeps scaled window through the
degrade call. Adds /api/fidelity debug endpoint showing FM state,
object counts, pressure ratios, and fidelity distribution per session.
When the cleanup filter suppresses a text delta (buffering inside a
tag), the preceding 'event: content_block_delta' header was left in
the output, producing malformed SSE that caused opencode to retry
rapidly and freeze. Now removes the event header alongside the data
line.
Blocks were all getting turn=1 because label_messages used a single
global counter. Now derives turn from message position (each user msg
increments the turn). Also updates turn on already-labeled blocks.
Adds /api/blocks endpoint to inspect BlockStore state per session.
This enables collapse_range(1,72) to correctly target early turns.
Long cleanup tags (e.g. collapse summaries) can span 30+ SSE deltas.
The safety valve was flushing after 6 deltas regardless, dumping
incomplete tags into the output. Now only flushes when buffering a
partial opener (<m, <y) that never resolved — never when inside a
confirmed tag.
The model emits cleanup ops as XML elements (<drop>block:x</drop>,
<release handle="x"/>, <collapse>turns N-M "summary"</collapse>)
but the parser only handled prose format (drop: block:x). Add XML
regex matchers alongside the existing prose parser so both formats
are recognized, executed, and stripped from the streaming output.
The FidelityManager's internal pressure calculation uses its own tracked
object tokens divided by window_size, which is always tiny compared to
the real context. Temporarily scale window_size so the FM's pressure
matches the actual API input_tokens/window ratio, triggering L0→L1→L2
degradations when context exceeds 50%.
Measure incoming_bytes before _preprocess() so bytes_saved reflects true
reduction. Add SSECleanupFilter that intercepts memory_cleanup/yuyay-response
tags in streaming responses, strips them from output, and executes ops
(drops, collapses, releases) in real-time. Handles partial tags split across
SSE chunks with a safety valve to flush stale buffers for prose.
The Anthropic SDK appends /messages to baseURL, so the gateway
baseURL must include /v1. Also removes the static baseURL from
opencode.json — the plugin now injects it dynamically only when the
gateway health check passes, so requests fall through directly to
api.anthropic.com when Mnemosyne is not running.
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
The /api/benchmark response uses total_admitted, total_rejected,
total_attempts etc. but the dashboard JS was reading admitted, rejected,
attempts. Also fixed fidelity bar reading fidelity_distribution key and
the session table to derive context reduction from byte ratio.
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Strip <yuyay-response>, <yuyay-manifest>, and <yuyay-query> tags from
the SSE stream before forwarding to the client. The cooperative memory
protocol tags are still processed by the gateway on the next inbound
request — they just no longer leak into the user's visible output.
Handles tags spanning across multiple text_delta SSE events.
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
opencode validates tool calls against its own registry and rejects
unknown tools like memory_query. The opencode plugin provides
mnemosyne_query/mnemosyne_status as registered MCP tools instead.
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
TypeScript plugin that injects baseURL to route Anthropic API calls
through the Mnemosyne gateway, enriches compaction with memory context,
and provides mnemosyne_status/mnemosyne_query custom tools.
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Admission control, entropy-based micro-faulting, phantom tool
injection for backing store queries, and xMemory session hierarchy
for long conversations (50+ turns).
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Object-addressed memory: segment messages into semantic objects,
embed with sentence-transformers, store in pgvector-backed store,
and reassemble context via goal-aware retrieval.
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
5-level fidelity manager (L0-Full to L4-Evicted) with helper LLM
(Haiku 4.5) for intelligent summarization during degradation.
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>