dify/api/services/scorers
2025-10-02 20:11:57 -06:00
..
__init__.py feat: add challenge and red-blue competitions across API and web 2025-10-02 20:11:57 -06:00
README.md feat: add challenge and red-blue competitions across API and web 2025-10-02 20:11:57 -06:00
weighted.py feat: add challenge and red-blue competitions across API and web 2025-10-02 20:11:57 -06:00

Custom Scorer Plugins

This directory contains custom scorer plugins for challenge leaderboards.

Overview

Scorers compute numeric scores from challenge attempt metrics (tokens, time, rating, success) for ranking on leaderboards when scoring_strategy = 'custom'.

Built-in Scorers

WeightedScorer

Entrypoint: services.scorers.weighted:WeightedScorer

Computes a weighted score combining multiple metrics with configurable bonuses and penalties.

Formula:

score = success_bonus
        + (rating × rating_weight)
        - (elapsed_seconds × time_penalty)
        - (tokens × token_penalty)

Configuration:

  • success_bonus (float, default: 100): Base points for successful attempts
  • rating_weight (float, default: 10): Multiplier for judge rating (0-10)
  • time_penalty (float, default: 1.0): Penalty per second elapsed
  • token_penalty (float, default: 0.01): Penalty per token used

Example Configuration:

{
  "success_bonus": 100.0,
  "rating_weight": 10.0,
  "time_penalty": 1.0,
  "token_penalty": 0.01
}

Example Challenge Setup (via API):

{
  "name": "Advanced Prompt Challenge",
  "scoring_strategy": "custom",
  "scoring_plugin_id": "builtin.weighted_scorer",
  "scoring_entrypoint": "services.scorers.weighted:WeightedScorer",
  "scoring_config": {
    "success_bonus": 100.0,
    "rating_weight": 15.0,
    "time_penalty": 0.5,
    "token_penalty": 0.02
  }
}

Creating Custom Scorers

1. Implement the ScorerProtocol

Create a new file in this directory (e.g., custom.py):

from typing import Any
from services.challenge_scorer_protocol import AttemptMetrics, ScoringContext, ScoringResult

class MyCustomScorer:
    def score(self, metrics: AttemptMetrics, config: dict[str, Any], ctx: ScoringContext) -> ScoringResult:
        # Access metrics
        succeeded = metrics.get('succeeded', False)
        tokens = metrics.get('tokens_total', 0)
        elapsed_ms = metrics.get('elapsed_ms', 0)
        rating = metrics.get('rating', 0)

        # Access configuration
        multiplier = config.get('multiplier', 1.0)

        # Compute score
        score = (rating * multiplier) if succeeded else 0.0

        return {
            'score': score,
            'details': {  # optional
                'multiplier_used': multiplier
            }
        }

2. Register in Challenge

Set the challenge's scoring fields:

challenge.scoring_strategy = 'custom'
challenge.scoring_plugin_id = 'my_custom_scorer'
challenge.scoring_entrypoint = 'services.scorers.custom:MyCustomScorer'
challenge.scoring_config = {
    'multiplier': 2.0
}

3. Testing

Create tests in api/tests/unit_tests/services/ following the pattern in test_challenge_scorer_service.py.

Protocol Reference

Input Types

AttemptMetrics:

  • succeeded (bool): Whether the challenge was passed
  • tokens_total (int | None): Total tokens used
  • elapsed_ms (int | None): Time taken in milliseconds
  • rating (int | None): Judge rating (0-10)
  • created_at (int | None): Timestamp in epoch milliseconds

ScoringContext:

  • tenant_id (str): Tenant identifier
  • app_id (str): Application identifier
  • workflow_id (str): Workflow identifier
  • challenge_id (str): Challenge identifier
  • end_user_id (str | None): End user identifier (if available)
  • timeout_ms (int): Maximum execution time

Output Type

ScoringResult:

  • score (float, required): Computed numeric score
  • details (dict[str, Any] | None, optional): Additional scoring details

Error Handling

  • Scorers must return a dict with a score key
  • Exceptions are caught and logged; the attempt is recorded with score=None
  • Scorers are executed with a timeout (default: 5s)
  • Scorers should never return negative scores; use max(score, 0.0) to clamp

Best Practices

  1. Keep it simple: Scoring should be fast and deterministic
  2. Validate config: Check configuration values and provide defaults
  3. Clamp scores: Ensure scores are non-negative
  4. Document formula: Clearly explain how your scorer works
  5. Test edge cases: Test with missing metrics, zeros, nulls