History

Joey Yakimowich-Payne 8fd3c4bb64 feat: add challenge and red-blue competitions across API and web		2025-10-02 20:11:57 -06:00
..
__init__.py	feat: add challenge and red-blue competitions across API and web	2025-10-02 20:11:57 -06:00
README.md	feat: add challenge and red-blue competitions across API and web	2025-10-02 20:11:57 -06:00
weighted.py	feat: add challenge and red-blue competitions across API and web	2025-10-02 20:11:57 -06:00

README.md

Custom Scorer Plugins

This directory contains custom scorer plugins for challenge leaderboards.

Overview

Scorers compute numeric scores from challenge attempt metrics (tokens, time, rating, success) for ranking on leaderboards when scoring_strategy = 'custom'.

Built-in Scorers

WeightedScorer

Entrypoint: services.scorers.weighted:WeightedScorer

Computes a weighted score combining multiple metrics with configurable bonuses and penalties.

Formula:

score = success_bonus
        + (rating × rating_weight)
        - (elapsed_seconds × time_penalty)
        - (tokens × token_penalty)

Configuration:

success_bonus (float, default: 100): Base points for successful attempts
rating_weight (float, default: 10): Multiplier for judge rating (0-10)
time_penalty (float, default: 1.0): Penalty per second elapsed
token_penalty (float, default: 0.01): Penalty per token used

Example Configuration:

{
  "success_bonus": 100.0,
  "rating_weight": 10.0,
  "time_penalty": 1.0,
  "token_penalty": 0.01
}

Example Challenge Setup (via API):

{
  "name": "Advanced Prompt Challenge",
  "scoring_strategy": "custom",
  "scoring_plugin_id": "builtin.weighted_scorer",
  "scoring_entrypoint": "services.scorers.weighted:WeightedScorer",
  "scoring_config": {
    "success_bonus": 100.0,
    "rating_weight": 15.0,
    "time_penalty": 0.5,
    "token_penalty": 0.02
  }
}

Creating Custom Scorers

1. Implement the ScorerProtocol

Create a new file in this directory (e.g., custom.py):

from typing import Any
from services.challenge_scorer_protocol import AttemptMetrics, ScoringContext, ScoringResult

class MyCustomScorer:
    def score(self, metrics: AttemptMetrics, config: dict[str, Any], ctx: ScoringContext) -> ScoringResult:
        # Access metrics
        succeeded = metrics.get('succeeded', False)
        tokens = metrics.get('tokens_total', 0)
        elapsed_ms = metrics.get('elapsed_ms', 0)
        rating = metrics.get('rating', 0)

        # Access configuration
        multiplier = config.get('multiplier', 1.0)

        # Compute score
        score = (rating * multiplier) if succeeded else 0.0

        return {
            'score': score,
            'details': {  # optional
                'multiplier_used': multiplier
            }
        }

2. Register in Challenge

Set the challenge's scoring fields:

challenge.scoring_strategy = 'custom'
challenge.scoring_plugin_id = 'my_custom_scorer'
challenge.scoring_entrypoint = 'services.scorers.custom:MyCustomScorer'
challenge.scoring_config = {
    'multiplier': 2.0
}

3. Testing

Create tests in api/tests/unit_tests/services/ following the pattern in test_challenge_scorer_service.py.

Protocol Reference

Input Types

AttemptMetrics:

succeeded (bool): Whether the challenge was passed
tokens_total (int | None): Total tokens used
elapsed_ms (int | None): Time taken in milliseconds
rating (int | None): Judge rating (0-10)
created_at (int | None): Timestamp in epoch milliseconds

ScoringContext:

tenant_id (str): Tenant identifier
app_id (str): Application identifier
workflow_id (str): Workflow identifier
challenge_id (str): Challenge identifier
end_user_id (str | None): End user identifier (if available)
timeout_ms (int): Maximum execution time

Output Type

ScoringResult:

score (float, required): Computed numeric score
details (dict[str, Any] | None, optional): Additional scoring details

Error Handling

Scorers must return a dict with a score key
Exceptions are caught and logged; the attempt is recorded with score=None
Scorers are executed with a timeout (default: 5s)
Scorers should never return negative scores; use max(score, 0.0) to clamp

Best Practices

Keep it simple: Scoring should be fast and deterministic
Validate config: Check configuration values and provide defaults
Clamp scores: Ensure scores are non-negative
Document formula: Clearly explain how your scorer works
Test edge cases: Test with missing metrics, zeros, nulls

README.md Unescape Escape