dify/api/services/scorers/README.md

144 lines
4.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Custom Scorer Plugins
This directory contains custom scorer plugins for challenge leaderboards.
## Overview
Scorers compute numeric scores from challenge attempt metrics (tokens, time, rating, success) for ranking on leaderboards when `scoring_strategy = 'custom'`.
## Built-in Scorers
### WeightedScorer
**Entrypoint:** `services.scorers.weighted:WeightedScorer`
Computes a weighted score combining multiple metrics with configurable bonuses and penalties.
**Formula:**
```
score = success_bonus
+ (rating × rating_weight)
- (elapsed_seconds × time_penalty)
- (tokens × token_penalty)
```
**Configuration:**
- `success_bonus` (float, default: 100): Base points for successful attempts
- `rating_weight` (float, default: 10): Multiplier for judge rating (0-10)
- `time_penalty` (float, default: 1.0): Penalty per second elapsed
- `token_penalty` (float, default: 0.01): Penalty per token used
**Example Configuration:**
```json
{
"success_bonus": 100.0,
"rating_weight": 10.0,
"time_penalty": 1.0,
"token_penalty": 0.01
}
```
**Example Challenge Setup (via API):**
```python
{
"name": "Advanced Prompt Challenge",
"scoring_strategy": "custom",
"scoring_plugin_id": "builtin.weighted_scorer",
"scoring_entrypoint": "services.scorers.weighted:WeightedScorer",
"scoring_config": {
"success_bonus": 100.0,
"rating_weight": 15.0,
"time_penalty": 0.5,
"token_penalty": 0.02
}
}
```
## Creating Custom Scorers
### 1. Implement the ScorerProtocol
Create a new file in this directory (e.g., `custom.py`):
```python
from typing import Any
from services.challenge_scorer_protocol import AttemptMetrics, ScoringContext, ScoringResult
class MyCustomScorer:
def score(self, metrics: AttemptMetrics, config: dict[str, Any], ctx: ScoringContext) -> ScoringResult:
# Access metrics
succeeded = metrics.get('succeeded', False)
tokens = metrics.get('tokens_total', 0)
elapsed_ms = metrics.get('elapsed_ms', 0)
rating = metrics.get('rating', 0)
# Access configuration
multiplier = config.get('multiplier', 1.0)
# Compute score
score = (rating * multiplier) if succeeded else 0.0
return {
'score': score,
'details': { # optional
'multiplier_used': multiplier
}
}
```
### 2. Register in Challenge
Set the challenge's scoring fields:
```python
challenge.scoring_strategy = 'custom'
challenge.scoring_plugin_id = 'my_custom_scorer'
challenge.scoring_entrypoint = 'services.scorers.custom:MyCustomScorer'
challenge.scoring_config = {
'multiplier': 2.0
}
```
### 3. Testing
Create tests in `api/tests/unit_tests/services/` following the pattern in `test_challenge_scorer_service.py`.
## Protocol Reference
### Input Types
**AttemptMetrics:**
- `succeeded` (bool): Whether the challenge was passed
- `tokens_total` (int | None): Total tokens used
- `elapsed_ms` (int | None): Time taken in milliseconds
- `rating` (int | None): Judge rating (0-10)
- `created_at` (int | None): Timestamp in epoch milliseconds
**ScoringContext:**
- `tenant_id` (str): Tenant identifier
- `app_id` (str): Application identifier
- `workflow_id` (str): Workflow identifier
- `challenge_id` (str): Challenge identifier
- `end_user_id` (str | None): End user identifier (if available)
- `timeout_ms` (int): Maximum execution time
### Output Type
**ScoringResult:**
- `score` (float, required): Computed numeric score
- `details` (dict[str, Any] | None, optional): Additional scoring details
## Error Handling
- Scorers must return a dict with a `score` key
- Exceptions are caught and logged; the attempt is recorded with `score=None`
- Scorers are executed with a timeout (default: 5s)
- Scorers should never return negative scores; use `max(score, 0.0)` to clamp
## Best Practices
1. **Keep it simple**: Scoring should be fast and deterministic
2. **Validate config**: Check configuration values and provide defaults
3. **Clamp scores**: Ensure scores are non-negative
4. **Document formula**: Clearly explain how your scorer works
5. **Test edge cases**: Test with missing metrics, zeros, nulls