feat: add challenge and red-blue competitions across API and web

This commit is contained in:
Joey Yakimowich-Payne 2025-10-01 06:49:09 -06:00
commit 8fd3c4bb64
No known key found for this signature in database
GPG key ID: 6BFE655FA5ABD1E1
77 changed files with 5355 additions and 24 deletions

View file

@ -0,0 +1,633 @@
## Prompt Hacking Challenges — Design
### Overview
Enable developer-authored prompt hacking challenges inside Difys workflow builder via a new workflow node. Players can register/login using the existing web auth and compete on challenges. Attempts are recorded server-side and leaderboards are exposed via public web APIs.
### Goals
- Add a first-class workflow node that evaluates success/failure against developer-specified criteria.
- Add a Judging LLM node that compares model outputs to the challenge goal and produces pass/fail, textual feedback, and a 110 rating.
- Persist attempts with metadata for scoring and leaderboards.
- Reuse existing account/web auth for players.
- Fit Difys DDD/Clean Architecture: models, services, controllers, workflow nodes, and frontend builder integration.
### Non-Goals
- Anti-cheat measures beyond simple rate limiting.
- Complex custom scoring plugins (design leaves a hook for future work).
## Architecture Summary
### Backend components
- Models (SQLAlchemy)
- Challenge
- id, tenant_id, app_id, workflow_id
- name, description, goal (plain text shown to players)
- success_type: one of ['regex', 'contains', 'custom']
- success_pattern: string (regex or substring depending on type)
- secret_ref: reference to server-side secret (never exposed to clients)
- scoring_strategy: one of ['first', 'fastest', 'fewest_tokens', 'custom']
- is_active: bool
- created_by, created_at, updated_by, updated_at
- ChallengeAttempt
- id, tenant_id, challenge_id (FK), end_user_id (FK), workflow_run_id (optional FK)
- succeeded: bool
- score: numeric (meaning depends on strategy)
- judge_rating: int (010)
- judge_feedback: text
- judge_output_raw: jsonb (optional; structured judgement payload)
- tokens_total: int (when available from run metrics)
- elapsed_ms: int (when available)
- created_at
- Service layer (e.g., `ChallengeService`, `ChallengeJudgeService`)
- evaluate_outcome(output, cfg) -> (succeeded: bool, details: dict)
- judge_with_llm(goal, response, cfg) -> { passed: bool, rating: int, feedback: str, raw?: dict }
- evaluate_with_plugin(evaluator_ref, goal, response, ctx) -> { passed: bool, rating?: int, feedback?: str, raw?: dict }
- score_with_plugin(scorer_ref, attempt_metrics, ctx) -> { score: number, details?: dict }
- record_attempt(tenant_id, challenge_id, end_user_id, run_meta, succeeded) -> ChallengeAttempt
- get_leaderboard(challenge_id, limit, strategy) -> list
- get_challenge_public(challenge_id) -> dict
- Controllers
- Console (for creators): CRUD on challenges under the workspace (`/console/api/challenges`)
- Web (for players): public endpoints under `/web/api/challenges`
- List active challenges, fetch details, fetch leaderboard
- Optional auth via existing web login for personalization, otherwise anonymous read
- Workflow nodes
- NodeType: `challenge-evaluator`
- Config
- `challenge_id`: reference to a stored Challenge (preferred)
- or inline config: `success_type`, `success_pattern`, `scoring_strategy`
- `mask_variables`: string[] — variable names to redact in logs
- Execution
- Consumes upstream content (typically latest assistant output)
- Evaluates success with `ChallengeService.evaluate_outcome`
- If an `EndUser` context exists and a `challenge_id` is present, writes `ChallengeAttempt`
- Outputs `{ challenge_succeeded: boolean, message?: string }`, optionally passes through original output
- NodeType: `judging-llm`
- Purpose: judge a model response against the challenge goal using an LLM rubric.
- Config
- `judge_model`: provider/name/version
- `temperature`, `max_tokens`, other model params
- `rubric_prompt_template`: template with placeholders for {goal}, {response}, optional {hints}
- `rating_scale`: default 010; configurable upper bound optional
- `pass_threshold`: integer (default 5)
- Inputs
- `goal`: the attacking goal or acceptance criteria
- `response`: the model output to evaluate
- Execution
- Calls `ChallengeJudgeService.judge_with_llm()` to obtain structured judgement
- Returns outputs `{ judge_passed: boolean, judge_rating: number (010), judge_feedback: string, judge_raw?: object }`
- Integration
- Downstream `challenge-evaluator` can consume `judge_passed` and `judge_rating` to record an attempt instead of regex/contains
- Alternatively, `challenge-evaluator` may support an `evaluation_mode: 'rules' | 'llm-judge'` to invoke judging internally
- NodeType: `team-challenge` (Red/Blue orchestrator)
- Purpose: orchestrate two-sided challenges where players choose Red (attack) or Blue (defense) and submit prompts accordingly. The node pairs attacks and defenses, configures the LLM, invokes judging, and emits scores.
- Config
- `red_blue_challenge_id`: reference to a Red/Blue challenge definition
- `defense_selection_policy`: 'latest_best' | 'random_active' | 'round_robin' | 'request_new_if_none'
- `attack_selection_policy`: same options for the defense side evaluation path
- `judge_suite`: list of category tests to run (e.g., CBRNE, SA, SH, RWH, V, M)
- `scoring_strategy`: 'red_blue_ratio' | 'custom'
- Inputs
- `team_choice`: 'red' | 'blue'
- `attack_prompt?`: string (when `team_choice = 'red'`)
- `defense_prompt?`: string (when `team_choice = 'blue'`)
- Execution (high level)
- If `team_choice = 'red'`:
- Persist `attack_prompt` submission
- Load a defense by policy; if none, optionally signal Blue to provide one (async) and fall back to last known
- Configure LLM with defense as system prompt, submit attack as user message
- Run `judge_suite` via `judging-llm`; compute Red score
- If `team_choice = 'blue'`:
- Persist `defense_prompt` submission
- Load an attack by policy; if none, signal Red to provide one (async) and fall back to last known
- Configure LLM with defense as system prompt and submit the loaded attack
- Run `judge_suite`; compute Blue score (prevention)
- Persist pairing and metrics
- Outputs
- `{ team: 'red'|'blue', judge_passed: boolean, judge_rating: number, judge_feedback: string, categories: Record<string, boolean|number>, team_points: number, total_points: number }`
### Frontend components
- Workflow builder
- Add `Prompt Challenge` to the node palette
- Add `Judging LLM` to the node palette
- Node editor panel: select existing Challenge or define inline success criteria
- Judging panel: choose model, edit rubric prompt, set pass threshold, preview structured outputs
- Custom evaluator/scorer panels: choose plugin and configure JSON settings with live schema validation
- I18n strings in `web/i18n/en-US/`
- Challenge display & theming
- Author-provided instructions (Markdown) render before/alongside the task input area
- Theme tokens (colors, logo, background) applied to challenge pages
- Optional hero image/video via existing `UploadFile` and signed URLs
- Optional player UX (phase 2)
- `/challenges` list and `/challenges/[id]` details with leaderboard
- `/challenge-collections` list and `/challenge-collections/[id]` details with collection leaderboard
- Use existing web login endpoints
## Data Model
Minimal table shapes (final columns managed in migration):
```sql
-- challenges
id (uuid pk)
tenant_id (uuid fk)
app_id (uuid fk)
workflow_id (uuid fk)
name (text)
description (text)
goal (text)
success_type (text)
success_pattern (text)
secret_ref (text)
scoring_strategy (text)
is_active (bool)
created_by (uuid)
created_at (timestamp)
updated_by (uuid)
updated_at (timestamp)
-- challenge_attempts
id (uuid pk)
tenant_id (uuid fk)
challenge_id (uuid fk)
end_user_id (uuid fk)
workflow_run_id (uuid fk, nullable)
succeeded (bool)
score (numeric)
tokens_total (int)
elapsed_ms (int)
created_at (timestamp)
```
Additional columns for judging:
```sql
ALTER TABLE challenge_attempts
ADD COLUMN judge_rating integer,
ADD COLUMN judge_feedback text,
ADD COLUMN judge_output_raw jsonb;
```
Optional columns for custom evaluators/scorers:
```sql
ALTER TABLE challenges
ADD COLUMN evaluator_type text DEFAULT 'rules', -- one of: rules, llm-judge, custom
ADD COLUMN evaluator_plugin_id text,
ADD COLUMN evaluator_entrypoint text, -- e.g., "pkg.module:Evaluator"
ADD COLUMN evaluator_config jsonb,
ADD COLUMN scoring_plugin_id text,
ADD COLUMN scoring_entrypoint text, -- e.g., "pkg.module:Scorer"
ADD COLUMN scoring_config jsonb;
```
Additional tables for Red/Blue team challenges:
```sql
-- red_blue_challenges (definition)
CREATE TABLE red_blue_challenges (
id uuid PRIMARY KEY DEFAULT uuid_generate_v4(),
tenant_id uuid NOT NULL REFERENCES tenants(id),
app_id uuid NOT NULL REFERENCES apps(id),
workflow_id uuid REFERENCES workflows(id),
name text NOT NULL,
description text,
judge_suite jsonb NOT NULL, -- list of categories/tests
defense_selection_policy text NOT NULL DEFAULT 'latest_best',
attack_selection_policy text NOT NULL DEFAULT 'latest_best',
scoring_strategy text NOT NULL DEFAULT 'red_blue_ratio',
theme jsonb,
instructions_md text,
is_active boolean NOT NULL DEFAULT true,
created_by uuid,
created_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_by uuid,
updated_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
);
-- team_submissions (attack/defense prompts)
CREATE TABLE team_submissions (
id uuid PRIMARY KEY DEFAULT uuid_generate_v4(),
red_blue_challenge_id uuid NOT NULL REFERENCES red_blue_challenges(id) ON DELETE CASCADE,
tenant_id uuid NOT NULL,
account_id uuid NULL,
end_user_id uuid NULL,
team text NOT NULL CHECK (team in ('red','blue')),
prompt text NOT NULL,
active boolean NOT NULL DEFAULT true,
created_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
);
-- pairings (which attack tested against which defense)
CREATE TABLE team_pairings (
id uuid PRIMARY KEY DEFAULT uuid_generate_v4(),
red_blue_challenge_id uuid NOT NULL REFERENCES red_blue_challenges(id) ON DELETE CASCADE,
tenant_id uuid NOT NULL,
attack_submission_id uuid REFERENCES team_submissions(id),
defense_submission_id uuid REFERENCES team_submissions(id),
judge_output_raw jsonb,
categories jsonb, -- e.g., per-suite pass/fail or rating
judge_rating integer,
judge_feedback text,
red_points numeric NOT NULL DEFAULT 0,
blue_points numeric NOT NULL DEFAULT 0,
tokens_total int,
elapsed_ms int,
created_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
);
```
## API Design
### Console (creator)
- `GET /console/api/challenges?app_id=...` — list
- `POST /console/api/challenges` — create
- `GET /console/api/challenges/{id}` — retrieve
- `PATCH /console/api/challenges/{id}` — update
- `DELETE /console/api/challenges/{id}` — delete
All require console `login_required` and membership in tenant.
### Web (player)
- `GET /web/api/challenges` — list active challenges (public)
- `GET /web/api/challenges/{id}` — details (public)
- `GET /web/api/challenges/{id}/leaderboard?limit=...` — leaderboard (public)
Player login uses existing web login endpoints to obtain access token when needed for personalization.
### Collections
Console (creator)
- `GET /console/api/challenge-collections?app_id=...`
- `POST /console/api/challenge-collections`
- `GET /console/api/challenge-collections/{id}`
- `PATCH /console/api/challenge-collections/{id}`
- `DELETE /console/api/challenge-collections/{id}`
- `PUT /console/api/challenge-collections/{id}/challenges` (set membership and order)
Web (player)
- `GET /web/api/challenge-collections` — list public collections
- `GET /web/api/challenge-collections/{id}` — collection details (instructions/theme), included challenges
- `GET /web/api/challenge-collections/{id}/leaderboard?limit=...` — collection leaderboard
### Red/Blue team challenge APIs
Console (creator)
- `POST /console/api/red-blue-challenges` — create
- `GET /console/api/red-blue-challenges?app_id=...` — list
- `GET /console/api/red-blue-challenges/{id}` — detail
- `PATCH /console/api/red-blue-challenges/{id}` — update
- `DELETE /console/api/red-blue-challenges/{id}` — delete
- `GET /console/api/red-blue-challenges/{id}/pairings` — view pairings/metrics
Web (player)
- `POST /web/api/red-blue-challenges/{id}/join` — join red or blue (payload: { team })
- `POST /web/api/red-blue-challenges/{id}/submit` — submit attack/defense (payload: { team, prompt })
- `GET /web/api/red-blue-challenges/{id}` — public info (instructions, theme, leaderboard snapshot)
- `GET /web/api/red-blue-challenges/{id}/leaderboard?limit=...` — red vs blue standings
## Player Registration & Identity
### Registration and login
- Reuse existing web auth service for player accounts:
- Email/password login: `POST /web/api/login`
- Email code login: `POST /web/api/login/email-code/send` + `POST /web/api/login/email-code/verify` (existing patterns)
- Add an explicit web registration endpoint (thin wrapper around `RegisterService.register`):
- `POST /web/api/register` (payload: email, name, password | email-code)
- Behavior:
- `create_workspace_required = False` to avoid auto-creating workspaces for players
- `status = active`
- Set `interface_language` from `Accept-Language` as done in OAuth flow
- On success, also create or associate a per-tenant `EndUser` record so gameplay runs can be attributed consistently.
### Player identity during runs
- Each gameplay run already has an `EndUser` context. For registered players:
- When a player is authenticated, resolve (or lazily create) an `EndUser` tied to their `account_id` for the current tenant/app
- Persist `end_user_id` to `ChallengeAttempt` as today; optionally also store `account_id` for simplified leaderboard personalization
### Optional schema addition
```sql
ALTER TABLE challenge_attempts
ADD COLUMN account_id uuid NULL;
```
This enables direct joins to accounts for notification and profile display without traversing end-user mappings.
### Player profile (optional)
Introduce a lightweight `player_profiles` table for nickname/avatar/notification preferences without touching `account` directly:
```sql
CREATE TABLE player_profiles (
account_id uuid PRIMARY KEY REFERENCES accounts(id) ON DELETE CASCADE,
display_name text,
avatar_url text,
notify_on_first_blood boolean DEFAULT true,
notify_on_record_beaten boolean DEFAULT true,
created_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP
);
```
## Workflow Node Execution
1. Node receives upstream output (string or structured content). A typical placement is after an LLM node.
2. Node loads Challenge config (stored by `challenge_id` or inline).
3. Node evaluates success by either rules or judging:
- `regex`: test pattern against text output
- `contains`: case-insensitive substring match
- `llm-judge`: call the `judging-llm` node (or internal judge) to obtain `{ judge_passed, judge_rating, judge_feedback }`
- `custom`: call `evaluate_with_plugin` using configured `evaluator_plugin_id`/`evaluator_entrypoint`
4. If `EndUser` and `challenge_id` present, record a `ChallengeAttempt` with run metrics (tokens, elapsed), and when available, `judge_rating`/`judge_feedback`.
5. Node outputs
- Rules mode: `{ challenge_succeeded: boolean, message?: string }`
- Judging mode: `{ challenge_succeeded: boolean, judge_rating: number, judge_feedback: string }`
- Pass through original output for chaining when needed.
For collections, attempts are recorded per challenge as usual. Collection leaderboard aggregation is computed over a players best attempt per challenge, combined using the collections `scoring_strategy` (e.g., sum of scores, total elapsed_ms, etc.).
## Scoring Strategies
- `first`: first successful attempt time wins (leaderboard sorted by earliest `created_at`).
- `fastest`: success with lowest `elapsed_ms` wins.
- `fewest_tokens`: success with lowest `tokens_total` wins.
- `highest_rating`: success with the highest `judge_rating` wins; ties broken by earliest `created_at`.
- `custom`: compute via `score_with_plugin` using `scoring_plugin_id`/`scoring_entrypoint`.
Collection strategies:
- `sum`: sum of per-challenge scores in the collection (uses built-in or custom scoring per challenge)
- `fastest_total`: sum of `elapsed_ms` of successful best attempts (lower is better)
- `fewest_tokens_total`: sum of `tokens_total` of successful best attempts (lower is better)
- `highest_avg_rating`: average of `judge_rating` across completed challenges (higher is better)
- `custom`: plugin-defined; service calls `score_with_plugin` at collection level with a list of per-challenge metrics
Red/Blue team scoring:
- Base idea: award Red points for breakthroughs and Blue points for prevented attacks.
- Suggested defaults per pairing:
- For each category in the judge suite (e.g., CBRNE, SA, SH, RWH, V, M):
- If the attack bypasses defense (category breach), Red +1
- If defense prevents (no breach), Blue +1
- Bonus based on `judge_rating` magnitude for breakthrough severity (e.g., Red +round(rating/3))
- Time/token penalties can reduce points to encourage efficient strategies
- Ratio-based standings:
- Red ratio = Red points / (Red points + Blue points)
- Blue ratio = Blue points / (Red points + Blue points)
- Custom plugin scoring:
- Provide all pairing metrics to a scorer plugin to compute per-pairing or cumulative standings
## Custom Evaluators & Scorers
This section specifies how custom evaluation and scoring plugins integrate with challenges.
### Concepts
- Evaluator: decides whether a response meets the goal. May optionally emit a rating (010) and textual feedback.
- Scorer: converts an attempts metrics (e.g., elapsed time, tokens, rating) into a numeric score for leaderboards.
### Data model
- `challenges.evaluator_type`: one of `rules`, `llm-judge`, or `custom`.
- `challenges.evaluator_plugin_id`, `evaluator_entrypoint`, `evaluator_config`: identify and configure the evaluator plugin when `custom` is selected.
- `challenges.scoring_plugin_id`, `scoring_entrypoint`, `scoring_config`: identify and configure the scorer plugin when `scoring_strategy = 'custom'`.
### Service interfaces
Evaluator interface (Python):
```python
class EvaluatorContext(TypedDict, total=False):
tenant_id: str
app_id: str
workflow_id: str
challenge_id: str
end_user_id: str | None
variables: dict[str, Any] # sanitized runtime variables
timeout_ms: int
class EvaluatorResult(TypedDict, total=False):
passed: bool
rating: int # 010 (optional)
feedback: str # textual feedback for player (optional)
raw: dict[str, Any] # internal diagnostics (optional)
class EvaluatorProtocol(Protocol):
def evaluate(self, goal: str, response: str, config: dict[str, Any], ctx: EvaluatorContext) -> EvaluatorResult: ...
```
Scorer interface (Python):
```python
class ScoringContext(TypedDict, total=False):
tenant_id: str
app_id: str
workflow_id: str
challenge_id: str
end_user_id: str | None
timeout_ms: int
class AttemptMetrics(TypedDict, total=False):
succeeded: bool
tokens_total: int | None
elapsed_ms: int | None
rating: int | None
created_at: int | None # epoch ms
class ScoringResult(TypedDict, total=False):
score: float
details: dict[str, Any] | None
class ScorerProtocol(Protocol):
def score(self, metrics: AttemptMetrics, config: dict[str, Any], ctx: ScoringContext) -> ScoringResult: ...
```
### Discovery and loading
- Plugins are discovered via the existing plugin manager. Each plugin exposes one or more entrypoints (e.g., `pkg.module:Evaluator`).
- `evaluator_plugin_id`/`evaluator_entrypoint` and `scoring_plugin_id`/`scoring_entrypoint` identify the target callables.
- Services load plugins lazily and cache handles with safe import guards.
### Execution flow
1) For `evaluator_type = 'custom'`, the `challenge-evaluator` node calls `evaluate_with_plugin` with `(goal, response, evaluator_config, ctx)`.
2) If `EvaluatorResult.passed` is true, set `challenge_succeeded = True` and persist `judge_rating`/`judge_feedback` if provided.
3) For `scoring_strategy = 'custom'`, call `score_with_plugin` with attempt metrics to compute `score`.
4) Persist `ChallengeAttempt` with plugin-derived fields.
### Frontend configuration
- Prompt Challenge panel
- Evaluation mode: Rules | Judging LLM | Custom Evaluator
- When Custom Evaluator is chosen:
- Plugin selector: lists available evaluator plugins by `plugin_id` and exposed entrypoints
- JSON config editor with schema-based validation (optional `$schema` per plugin)
- Scoring section
- Strategy: First | Fastest | Fewest Tokens | Highest Rating | Custom
- When Custom is chosen: plugin selector + JSON config editor
### Security & sandboxing
- Plugins run under server control with:
- Timeouts (default 5s) and memory ceilings; cancellation on overrun
- No network access by default (opt-in allowlist if ever needed)
- Sanitized inputs: secrets removed; only whitelisted variables passed
- Structured error mapping; no stack traces leaked to players
### Error handling & observability
- If plugin load or execution fails, treat as non-pass and record a generic failure reason.
- Emit structured logs/events with plugin identifiers and durations (no sensitive content).
- Surface minimal feedback to players; detailed diagnostics remain internal.
### Examples
Evaluator (substring with banned terms):
```python
class SimpleEvaluator:
def evaluate(self, goal, response, config, ctx):
required = config.get('must_contain', [])
banned = set(map(str.lower, config.get('banned', [])))
if any(w.lower() in response.lower() for w in banned):
return {'passed': False, 'feedback': 'Banned content detected', 'rating': 2}
if all(w.lower() in response.lower() for w in required):
return {'passed': True, 'feedback': 'Meets criteria', 'rating': 8}
return {'passed': False, 'feedback': 'Missing required signal', 'rating': 5}
```
Scorer (weighted combo):
```python
class WeightedScorer:
def score(self, metrics, config, ctx):
base = 0.0
if metrics.get('succeeded'):
base += config.get('success_bonus', 100)
rating = metrics.get('rating') or 0
elapsed = metrics.get('elapsed_ms') or 0
tokens = metrics.get('tokens_total') or 0
score = base + rating * config.get('rating_weight', 10) \
- (elapsed / 1000.0) * config.get('time_penalty', 1.0) \
- tokens * config.get('token_penalty', 0.01)
return {'score': max(score, 0.0)}
```
## Security & Privacy
- Never expose `secret_ref` or derived secrets to clients or node outputs.
- Redact configured `mask_variables` in logs and stored attempt details.
- Apply rate limiting using existing helpers to mitigate brute-force attempts.
- Store minimal details on failed attempts to reduce information leakage.
- Sanitize Markdown instructions to prevent XSS; allow a safe subset (links/images) with rel=noopener.
- Theme application is constrained to a whitelist of CSS variables and asset URLs served via signed URLs.
## Testing Plan
- Service unit tests
- `evaluate_outcome` for regex/contains (edge cases, unicode, multiline)
- `judge_with_llm` deterministic tests with mocked LLM returning structured payloads
- `record_attempt` scoring aggregation and sorting
- Node tests
- Given inputs, assert success/failure and resulting outputs
- Judging node: asserts `{ judge_passed, judge_rating, judge_feedback }` shape and thresholds
- When `challenge_id` present, attempts are written; when not, none are written
- API tests
- Console CRUD happy paths and permissions
- Web endpoints list/details/leaderboard
- Frontend
- Panel validation, serialization/deserialization of node config
- Judging panel: model selection, rubric template binding, threshold validation
- Node palette presence
- Challenge instructions: Markdown renderer sanitization, link and image handling
- Theming: verify CSS variable injection, dark/light modes, and fallback to defaults
- Collections UI: ordering, visibility filtering, collection leaderboard rendering
## Rollout
1. DB migrations: create `challenges`, `challenge_attempts` tables; add judging columns.
2. Backend: models, service, console/web controllers, workflow node, `NodeType` and node mapping registration.
3. Frontend: add block enum, node + panel components (Prompt Challenge, Judging LLM), node palette default, i18n entries.
4. QA: run `make lint`, `make type-check`, and unit tests; `pnpm lint` and tests for web.
5. Documentation: link this design from contributor docs as needed.
## Open Questions / Future Work
- Anti-cheat signals and anomaly detection.
- Custom evaluator/scoring plugin hooks with sandboxing.
- Team competitions and seasons.
- Per-challenge rate limits and cooldowns.
## Notifications
### Events
- `challenge_first_blood`: emitted when the first successful attempt occurs for a challenge
- `challenge_record_beaten`: emitted when a leaderboard record is surpassed under the active scoring strategy
- `team_pairing_completed`: emitted after each Red/Blue pairing is judged with per-team points
### Delivery channels
- In-app (console): add a section in the console UI for challenge events; poll or use server-sent events
- Email (optional): send via existing email task infra (e.g., Celery tasks)
- Webhook (optional): per-tenant webhook endpoint configured in workspace settings to receive challenge events
### Payloads
```json
{
"event": "challenge_record_beaten",
"challenge_id": "...",
"scoring_strategy": "highest_rating",
"previous_record": { "account_id": "...", "score": 95.2 },
"new_record": { "account_id": "...", "score": 96.8 },
"occurred_at": 1730000000000
}
```
Red/Blue pairing example:
```json
{
"event": "team_pairing_completed",
"red_blue_challenge_id": "...",
"pairing_id": "...",
"attack_submission_id": "...",
"defense_submission_id": "...",
"categories": { "CBRNE": true, "SA": false, "SH": true },
"judge_rating": 8,
"red_points": 4,
"blue_points": 2,
"occurred_at": 1730000000001
}
```
### Triggers in services
- After `record_attempt`, re-evaluate leaderboard head for the challenge against the prior head
- If the head changed and meets trigger criteria, enqueue notification tasks
- Respect player profile preferences (`notify_on_first_blood`, `notify_on_record_beaten`)
### Player-facing feedback
- Immediate feedback comes from node outputs (e.g., `judge_feedback`, `judge_rating`)
- Aggregated notifications (record beaten, first blood) are async and opt-in per player preferences