| .. | ||
| __init__.py | ||
| README.md | ||
| weighted.py | ||
Custom Scorer Plugins
This directory contains custom scorer plugins for challenge leaderboards.
Overview
Scorers compute numeric scores from challenge attempt metrics (tokens, time, rating, success) for ranking on leaderboards when scoring_strategy = 'custom'.
Built-in Scorers
WeightedScorer
Entrypoint: services.scorers.weighted:WeightedScorer
Computes a weighted score combining multiple metrics with configurable bonuses and penalties.
Formula:
score = success_bonus
+ (rating × rating_weight)
- (elapsed_seconds × time_penalty)
- (tokens × token_penalty)
Configuration:
success_bonus(float, default: 100): Base points for successful attemptsrating_weight(float, default: 10): Multiplier for judge rating (0-10)time_penalty(float, default: 1.0): Penalty per second elapsedtoken_penalty(float, default: 0.01): Penalty per token used
Example Configuration:
{
"success_bonus": 100.0,
"rating_weight": 10.0,
"time_penalty": 1.0,
"token_penalty": 0.01
}
Example Challenge Setup (via API):
{
"name": "Advanced Prompt Challenge",
"scoring_strategy": "custom",
"scoring_plugin_id": "builtin.weighted_scorer",
"scoring_entrypoint": "services.scorers.weighted:WeightedScorer",
"scoring_config": {
"success_bonus": 100.0,
"rating_weight": 15.0,
"time_penalty": 0.5,
"token_penalty": 0.02
}
}
Creating Custom Scorers
1. Implement the ScorerProtocol
Create a new file in this directory (e.g., custom.py):
from typing import Any
from services.challenge_scorer_protocol import AttemptMetrics, ScoringContext, ScoringResult
class MyCustomScorer:
def score(self, metrics: AttemptMetrics, config: dict[str, Any], ctx: ScoringContext) -> ScoringResult:
# Access metrics
succeeded = metrics.get('succeeded', False)
tokens = metrics.get('tokens_total', 0)
elapsed_ms = metrics.get('elapsed_ms', 0)
rating = metrics.get('rating', 0)
# Access configuration
multiplier = config.get('multiplier', 1.0)
# Compute score
score = (rating * multiplier) if succeeded else 0.0
return {
'score': score,
'details': { # optional
'multiplier_used': multiplier
}
}
2. Register in Challenge
Set the challenge's scoring fields:
challenge.scoring_strategy = 'custom'
challenge.scoring_plugin_id = 'my_custom_scorer'
challenge.scoring_entrypoint = 'services.scorers.custom:MyCustomScorer'
challenge.scoring_config = {
'multiplier': 2.0
}
3. Testing
Create tests in api/tests/unit_tests/services/ following the pattern in test_challenge_scorer_service.py.
Protocol Reference
Input Types
AttemptMetrics:
succeeded(bool): Whether the challenge was passedtokens_total(int | None): Total tokens usedelapsed_ms(int | None): Time taken in millisecondsrating(int | None): Judge rating (0-10)created_at(int | None): Timestamp in epoch milliseconds
ScoringContext:
tenant_id(str): Tenant identifierapp_id(str): Application identifierworkflow_id(str): Workflow identifierchallenge_id(str): Challenge identifierend_user_id(str | None): End user identifier (if available)timeout_ms(int): Maximum execution time
Output Type
ScoringResult:
score(float, required): Computed numeric scoredetails(dict[str, Any] | None, optional): Additional scoring details
Error Handling
- Scorers must return a dict with a
scorekey - Exceptions are caught and logged; the attempt is recorded with
score=None - Scorers are executed with a timeout (default: 5s)
- Scorers should never return negative scores; use
max(score, 0.0)to clamp
Best Practices
- Keep it simple: Scoring should be fast and deterministic
- Validate config: Check configuration values and provide defaults
- Clamp scores: Ensure scores are non-negative
- Document formula: Clearly explain how your scorer works
- Test edge cases: Test with missing metrics, zeros, nulls