144 lines
4.1 KiB
Markdown
144 lines
4.1 KiB
Markdown
# Custom Scorer Plugins
|
||
|
||
This directory contains custom scorer plugins for challenge leaderboards.
|
||
|
||
## Overview
|
||
|
||
Scorers compute numeric scores from challenge attempt metrics (tokens, time, rating, success) for ranking on leaderboards when `scoring_strategy = 'custom'`.
|
||
|
||
## Built-in Scorers
|
||
|
||
### WeightedScorer
|
||
|
||
**Entrypoint:** `services.scorers.weighted:WeightedScorer`
|
||
|
||
Computes a weighted score combining multiple metrics with configurable bonuses and penalties.
|
||
|
||
**Formula:**
|
||
```
|
||
score = success_bonus
|
||
+ (rating × rating_weight)
|
||
- (elapsed_seconds × time_penalty)
|
||
- (tokens × token_penalty)
|
||
```
|
||
|
||
**Configuration:**
|
||
- `success_bonus` (float, default: 100): Base points for successful attempts
|
||
- `rating_weight` (float, default: 10): Multiplier for judge rating (0-10)
|
||
- `time_penalty` (float, default: 1.0): Penalty per second elapsed
|
||
- `token_penalty` (float, default: 0.01): Penalty per token used
|
||
|
||
**Example Configuration:**
|
||
```json
|
||
{
|
||
"success_bonus": 100.0,
|
||
"rating_weight": 10.0,
|
||
"time_penalty": 1.0,
|
||
"token_penalty": 0.01
|
||
}
|
||
```
|
||
|
||
**Example Challenge Setup (via API):**
|
||
```python
|
||
{
|
||
"name": "Advanced Prompt Challenge",
|
||
"scoring_strategy": "custom",
|
||
"scoring_plugin_id": "builtin.weighted_scorer",
|
||
"scoring_entrypoint": "services.scorers.weighted:WeightedScorer",
|
||
"scoring_config": {
|
||
"success_bonus": 100.0,
|
||
"rating_weight": 15.0,
|
||
"time_penalty": 0.5,
|
||
"token_penalty": 0.02
|
||
}
|
||
}
|
||
```
|
||
|
||
## Creating Custom Scorers
|
||
|
||
### 1. Implement the ScorerProtocol
|
||
|
||
Create a new file in this directory (e.g., `custom.py`):
|
||
|
||
```python
|
||
from typing import Any
|
||
from services.challenge_scorer_protocol import AttemptMetrics, ScoringContext, ScoringResult
|
||
|
||
class MyCustomScorer:
|
||
def score(self, metrics: AttemptMetrics, config: dict[str, Any], ctx: ScoringContext) -> ScoringResult:
|
||
# Access metrics
|
||
succeeded = metrics.get('succeeded', False)
|
||
tokens = metrics.get('tokens_total', 0)
|
||
elapsed_ms = metrics.get('elapsed_ms', 0)
|
||
rating = metrics.get('rating', 0)
|
||
|
||
# Access configuration
|
||
multiplier = config.get('multiplier', 1.0)
|
||
|
||
# Compute score
|
||
score = (rating * multiplier) if succeeded else 0.0
|
||
|
||
return {
|
||
'score': score,
|
||
'details': { # optional
|
||
'multiplier_used': multiplier
|
||
}
|
||
}
|
||
```
|
||
|
||
### 2. Register in Challenge
|
||
|
||
Set the challenge's scoring fields:
|
||
|
||
```python
|
||
challenge.scoring_strategy = 'custom'
|
||
challenge.scoring_plugin_id = 'my_custom_scorer'
|
||
challenge.scoring_entrypoint = 'services.scorers.custom:MyCustomScorer'
|
||
challenge.scoring_config = {
|
||
'multiplier': 2.0
|
||
}
|
||
```
|
||
|
||
### 3. Testing
|
||
|
||
Create tests in `api/tests/unit_tests/services/` following the pattern in `test_challenge_scorer_service.py`.
|
||
|
||
## Protocol Reference
|
||
|
||
### Input Types
|
||
|
||
**AttemptMetrics:**
|
||
- `succeeded` (bool): Whether the challenge was passed
|
||
- `tokens_total` (int | None): Total tokens used
|
||
- `elapsed_ms` (int | None): Time taken in milliseconds
|
||
- `rating` (int | None): Judge rating (0-10)
|
||
- `created_at` (int | None): Timestamp in epoch milliseconds
|
||
|
||
**ScoringContext:**
|
||
- `tenant_id` (str): Tenant identifier
|
||
- `app_id` (str): Application identifier
|
||
- `workflow_id` (str): Workflow identifier
|
||
- `challenge_id` (str): Challenge identifier
|
||
- `end_user_id` (str | None): End user identifier (if available)
|
||
- `timeout_ms` (int): Maximum execution time
|
||
|
||
### Output Type
|
||
|
||
**ScoringResult:**
|
||
- `score` (float, required): Computed numeric score
|
||
- `details` (dict[str, Any] | None, optional): Additional scoring details
|
||
|
||
## Error Handling
|
||
|
||
- Scorers must return a dict with a `score` key
|
||
- Exceptions are caught and logged; the attempt is recorded with `score=None`
|
||
- Scorers are executed with a timeout (default: 5s)
|
||
- Scorers should never return negative scores; use `max(score, 0.0)` to clamp
|
||
|
||
## Best Practices
|
||
|
||
1. **Keep it simple**: Scoring should be fast and deterministic
|
||
2. **Validate config**: Check configuration values and provide defaults
|
||
3. **Clamp scores**: Ensure scores are non-negative
|
||
4. **Document formula**: Clearly explain how your scorer works
|
||
5. **Test edge cases**: Test with missing metrics, zeros, nulls
|