# Custom Scorer Plugins

This directory contains custom scorer plugins for challenge leaderboards.

## Overview

Scorers compute numeric scores from challenge attempt metrics (tokens, time, rating, success) for ranking on leaderboards when `scoring_strategy = 'custom'`.

## Built-in Scorers

### WeightedScorer

**Entrypoint:** `services.scorers.weighted:WeightedScorer`

Computes a weighted score combining multiple metrics with configurable bonuses and penalties.

**Formula:**
```
score = success_bonus
        + (rating × rating_weight)
        - (elapsed_seconds × time_penalty)
        - (tokens × token_penalty)
```

**Configuration:**
- `success_bonus` (float, default: 100): Base points for successful attempts
- `rating_weight` (float, default: 10): Multiplier for judge rating (0-10)
- `time_penalty` (float, default: 1.0): Penalty per second elapsed
- `token_penalty` (float, default: 0.01): Penalty per token used

**Example Configuration:**
```json
{
  "success_bonus": 100.0,
  "rating_weight": 10.0,
  "time_penalty": 1.0,
  "token_penalty": 0.01
}
```

**Example Challenge Setup (via API):**
```python
{
  "name": "Advanced Prompt Challenge",
  "scoring_strategy": "custom",
  "scoring_plugin_id": "builtin.weighted_scorer",
  "scoring_entrypoint": "services.scorers.weighted:WeightedScorer",
  "scoring_config": {
    "success_bonus": 100.0,
    "rating_weight": 15.0,
    "time_penalty": 0.5,
    "token_penalty": 0.02
  }
}
```

## Creating Custom Scorers

### 1. Implement the ScorerProtocol

Create a new file in this directory (e.g., `custom.py`):

```python
from typing import Any
from services.challenge_scorer_protocol import AttemptMetrics, ScoringContext, ScoringResult

class MyCustomScorer:
    def score(self, metrics: AttemptMetrics, config: dict[str, Any], ctx: ScoringContext) -> ScoringResult:
        # Access metrics
        succeeded = metrics.get('succeeded', False)
        tokens = metrics.get('tokens_total', 0)
        elapsed_ms = metrics.get('elapsed_ms', 0)
        rating = metrics.get('rating', 0)

        # Access configuration
        multiplier = config.get('multiplier', 1.0)

        # Compute score
        score = (rating * multiplier) if succeeded else 0.0

        return {
            'score': score,
            'details': {  # optional
                'multiplier_used': multiplier
            }
        }
```

### 2. Register in Challenge

Set the challenge's scoring fields:

```python
challenge.scoring_strategy = 'custom'
challenge.scoring_plugin_id = 'my_custom_scorer'
challenge.scoring_entrypoint = 'services.scorers.custom:MyCustomScorer'
challenge.scoring_config = {
    'multiplier': 2.0
}
```

### 3. Testing

Create tests in `api/tests/unit_tests/services/` following the pattern in `test_challenge_scorer_service.py`.

## Protocol Reference

### Input Types

**AttemptMetrics:**
- `succeeded` (bool): Whether the challenge was passed
- `tokens_total` (int | None): Total tokens used
- `elapsed_ms` (int | None): Time taken in milliseconds
- `rating` (int | None): Judge rating (0-10)
- `created_at` (int | None): Timestamp in epoch milliseconds

**ScoringContext:**
- `tenant_id` (str): Tenant identifier
- `app_id` (str): Application identifier
- `workflow_id` (str): Workflow identifier
- `challenge_id` (str): Challenge identifier
- `end_user_id` (str | None): End user identifier (if available)
- `timeout_ms` (int): Maximum execution time

### Output Type

**ScoringResult:**
- `score` (float, required): Computed numeric score
- `details` (dict[str, Any] | None, optional): Additional scoring details

## Error Handling

- Scorers must return a dict with a `score` key
- Exceptions are caught and logged; the attempt is recorded with `score=None`
- Scorers are executed with a timeout (default: 5s)
- Scorers should never return negative scores; use `max(score, 0.0)` to clamp

## Best Practices

1. **Keep it simple**: Scoring should be fast and deterministic
2. **Validate config**: Check configuration values and provide defaults
3. **Clamp scores**: Ensure scores are non-negative
4. **Document formula**: Clearly explain how your scorer works
5. **Test edge cases**: Test with missing metrics, zeros, nulls