Skip to content

Judge Protocol

Evaluates and compares agent submissions in competitive orchestration.


Protocol Definition

dynabots_core.protocols.judge.Judge

Bases: Protocol

Protocol for evaluating agent submissions.

Judges are used in competitive orchestration frameworks (like Orc!!) to determine which agent performed better on a given task.

Implementations can use various strategies: - LLM evaluation (ask another model to judge) - Metric-based evaluation (latency, accuracy, cost) - Consensus voting (multiple judges) - Domain-specific rules

Example implementation

class MetricsJudge: async def evaluate(self, task: str, submissions: list[Submission]) -> Verdict: scores = {} for sub in submissions: score = 0.0 # Accuracy bonus if self._check_correctness(sub.result): score += 50 # Speed bonus if sub.latency_ms and sub.latency_ms < 1000: score += 25 # Cost efficiency bonus if sub.cost and sub.cost < 0.01: score += 25 scores[sub.agent] = score

    winner = max(scores, key=scores.get)
    return Verdict(
        winner=winner,
        reasoning=f"Highest combined score: {scores[winner]}",
        scores=scores
    )
Source code in packages/core/dynabots_core/protocols/judge.py
@runtime_checkable
class Judge(Protocol):
    """
    Protocol for evaluating agent submissions.

    Judges are used in competitive orchestration frameworks (like Orc!!)
    to determine which agent performed better on a given task.

    Implementations can use various strategies:
    - LLM evaluation (ask another model to judge)
    - Metric-based evaluation (latency, accuracy, cost)
    - Consensus voting (multiple judges)
    - Domain-specific rules

    Example implementation:
        class MetricsJudge:
            async def evaluate(self, task: str, submissions: list[Submission]) -> Verdict:
                scores = {}
                for sub in submissions:
                    score = 0.0
                    # Accuracy bonus
                    if self._check_correctness(sub.result):
                        score += 50
                    # Speed bonus
                    if sub.latency_ms and sub.latency_ms < 1000:
                        score += 25
                    # Cost efficiency bonus
                    if sub.cost and sub.cost < 0.01:
                        score += 25
                    scores[sub.agent] = score

                winner = max(scores, key=scores.get)
                return Verdict(
                    winner=winner,
                    reasoning=f"Highest combined score: {scores[winner]}",
                    scores=scores
                )
    """

    async def evaluate(
        self,
        task: str,
        submissions: List[Submission],
    ) -> Verdict:
        """
        Evaluate multiple submissions and determine a winner.

        Args:
            task: The original task description.
            submissions: List of agent submissions to evaluate.

        Returns:
            Verdict indicating the winner and reasoning.

        Note:
            - Submissions should have at least 2 entries for meaningful comparison
            - If submissions are equivalent, return a tie (winner="tie")
            - Include detailed reasoning for transparency
        """
        ...

evaluate(task, submissions) async

Evaluate multiple submissions and determine a winner.

Parameters:

Name Type Description Default
task str

The original task description.

required
submissions List[Submission]

List of agent submissions to evaluate.

required

Returns:

Type Description
Verdict

Verdict indicating the winner and reasoning.

Note
  • Submissions should have at least 2 entries for meaningful comparison
  • If submissions are equivalent, return a tie (winner="tie")
  • Include detailed reasoning for transparency
Source code in packages/core/dynabots_core/protocols/judge.py
async def evaluate(
    self,
    task: str,
    submissions: List[Submission],
) -> Verdict:
    """
    Evaluate multiple submissions and determine a winner.

    Args:
        task: The original task description.
        submissions: List of agent submissions to evaluate.

    Returns:
        Verdict indicating the winner and reasoning.

    Note:
        - Submissions should have at least 2 entries for meaningful comparison
        - If submissions are equivalent, return a tie (winner="tie")
        - Include detailed reasoning for transparency
    """
    ...

dynabots_core.protocols.judge.Verdict dataclass

The result of a judge's evaluation.

Attributes:

Name Type Description
winner str

Name of the winning agent

reasoning str

Explanation of why this agent won

scores Dict[str, float]

Optional per-agent scores (agent_name -> score)

confidence float

Judge's confidence in the verdict (0.0-1.0)

metadata Dict[str, Any]

Additional evaluation metadata

timestamp datetime

When the verdict was rendered

Example

verdict = Verdict( winner="DataAgent", reasoning="DataAgent provided more complete results with proper error handling.", scores={"DataAgent": 0.85, "ReportAgent": 0.72}, confidence=0.9 )

Source code in packages/core/dynabots_core/protocols/judge.py
@dataclass
class Verdict:
    """
    The result of a judge's evaluation.

    Attributes:
        winner: Name of the winning agent
        reasoning: Explanation of why this agent won
        scores: Optional per-agent scores (agent_name -> score)
        confidence: Judge's confidence in the verdict (0.0-1.0)
        metadata: Additional evaluation metadata
        timestamp: When the verdict was rendered

    Example:
        verdict = Verdict(
            winner="DataAgent",
            reasoning="DataAgent provided more complete results with proper error handling.",
            scores={"DataAgent": 0.85, "ReportAgent": 0.72},
            confidence=0.9
        )
    """

    winner: str
    reasoning: str
    scores: Dict[str, float] = field(default_factory=dict)
    confidence: float = 1.0
    metadata: Dict[str, Any] = field(default_factory=dict)
    timestamp: datetime = field(default_factory=lambda: datetime.now(timezone.utc))

    @property
    def is_tie(self) -> bool:
        """Check if the verdict is a tie (no clear winner)."""
        return self.winner == "" or self.winner.lower() == "tie"

    def to_dict(self) -> Dict[str, Any]:
        """Convert to dictionary for logging/storage."""
        return {
            "winner": self.winner,
            "reasoning": self.reasoning,
            "scores": self.scores,
            "confidence": self.confidence,
            "metadata": self.metadata,
            "timestamp": self.timestamp.isoformat(),
            "is_tie": self.is_tie,
        }

is_tie property

Check if the verdict is a tie (no clear winner).

to_dict()

Convert to dictionary for logging/storage.

Source code in packages/core/dynabots_core/protocols/judge.py
def to_dict(self) -> Dict[str, Any]:
    """Convert to dictionary for logging/storage."""
    return {
        "winner": self.winner,
        "reasoning": self.reasoning,
        "scores": self.scores,
        "confidence": self.confidence,
        "metadata": self.metadata,
        "timestamp": self.timestamp.isoformat(),
        "is_tie": self.is_tie,
    }

dynabots_core.protocols.judge.Submission dataclass

An agent's submission for evaluation.

Attributes:

Name Type Description
agent str

Name of the agent that produced this submission

result Any

The agent's output (TaskResult or raw data)

latency_ms Optional[int]

How long the agent took (milliseconds)

cost Optional[float]

Cost of producing this result (e.g., API costs)

metadata Dict[str, Any]

Additional submission metadata

Source code in packages/core/dynabots_core/protocols/judge.py
@dataclass
class Submission:
    """
    An agent's submission for evaluation.

    Attributes:
        agent: Name of the agent that produced this submission
        result: The agent's output (TaskResult or raw data)
        latency_ms: How long the agent took (milliseconds)
        cost: Cost of producing this result (e.g., API costs)
        metadata: Additional submission metadata
    """

    agent: str
    result: Any
    latency_ms: Optional[int] = None
    cost: Optional[float] = None
    metadata: Dict[str, Any] = field(default_factory=dict)

    def to_dict(self) -> Dict[str, Any]:
        """Convert to dictionary."""
        return {
            "agent": self.agent,
            "result": self.result if not hasattr(self.result, "to_dict") else self.result.to_dict(),
            "latency_ms": self.latency_ms,
            "cost": self.cost,
            "metadata": self.metadata,
        }

to_dict()

Convert to dictionary.

Source code in packages/core/dynabots_core/protocols/judge.py
def to_dict(self) -> Dict[str, Any]:
    """Convert to dictionary."""
    return {
        "agent": self.agent,
        "result": self.result if not hasattr(self.result, "to_dict") else self.result.to_dict(),
        "latency_ms": self.latency_ms,
        "cost": self.cost,
        "metadata": self.metadata,
    }

dynabots_core.protocols.judge.ScoringJudge

Bases: Protocol

Extended judge protocol that provides individual scores.

Use this when you need to score submissions independently, not just compare them against each other.

Source code in packages/core/dynabots_core/protocols/judge.py
@runtime_checkable
class ScoringJudge(Protocol):
    """
    Extended judge protocol that provides individual scores.

    Use this when you need to score submissions independently,
    not just compare them against each other.
    """

    async def score(self, task: str, submission: Submission) -> float:
        """
        Score a single submission (0.0 to 1.0).

        Args:
            task: The original task description.
            submission: Single agent submission to score.

        Returns:
            Score between 0.0 (worst) and 1.0 (best).
        """
        ...

    async def evaluate(
        self,
        task: str,
        submissions: List[Submission],
    ) -> Verdict:
        """Evaluate by scoring each submission and comparing."""
        ...

evaluate(task, submissions) async

Evaluate by scoring each submission and comparing.

Source code in packages/core/dynabots_core/protocols/judge.py
async def evaluate(
    self,
    task: str,
    submissions: List[Submission],
) -> Verdict:
    """Evaluate by scoring each submission and comparing."""
    ...

score(task, submission) async

Score a single submission (0.0 to 1.0).

Parameters:

Name Type Description Default
task str

The original task description.

required
submission Submission

Single agent submission to score.

required

Returns:

Type Description
float

Score between 0.0 (worst) and 1.0 (best).

Source code in packages/core/dynabots_core/protocols/judge.py
async def score(self, task: str, submission: Submission) -> float:
    """
    Score a single submission (0.0 to 1.0).

    Args:
        task: The original task description.
        submission: Single agent submission to score.

    Returns:
        Score between 0.0 (worst) and 1.0 (best).
    """
    ...

Simple Implementation

from typing import Any, Dict, List
from dynabots_core import Judge, Verdict
from dynabots_core.protocols.judge import Submission

class FastestWinsJudge:
    """Judge that awards the fastest submission."""

    async def evaluate(
        self,
        task: str,
        submissions: List[Submission],
    ) -> Verdict:
        """Winner is the fastest agent."""
        if not submissions:
            return Verdict(
                winner="none",
                reasoning="No submissions to evaluate"
            )

        # Sort by latency
        fastest = min(submissions, key=lambda s: s.latency_ms or float('inf'))

        return Verdict(
            winner=fastest.agent,
            reasoning=f"{fastest.agent} executed fastest ({fastest.latency_ms}ms)",
            scores={
                s.agent: 1.0 if s.agent == fastest.agent else 0.0
                for s in submissions
            },
            confidence=1.0,
        )

LLM-Based Judge

Use an LLM to evaluate submissions:

from dynabots_core import Judge, Verdict, LLMMessage
from dynabots_core.protocols.judge import Submission
from dynabots_core.providers import OllamaProvider
from typing import List

class LLMJudge:
    """Judge using an LLM to evaluate submissions."""

    def __init__(self, llm):
        self.llm = llm

    async def evaluate(
        self,
        task: str,
        submissions: List[Submission],
    ) -> Verdict:
        """Use LLM to evaluate submissions."""
        # Format submissions for LLM
        submissions_text = "\n\n".join([
            f"Submission from {s.agent}:\n{s.result}"
            for s in submissions
        ])

        prompt = f"""You are evaluating agent submissions for a task.

Task: {task}

Submissions:
{submissions_text}

Evaluate each submission on:
- Accuracy and correctness
- Completeness of response
- Clarity and helpfulness
- Efficiency (consider latency if provided)

Provide your evaluation as JSON:
{{
    "winner": "name of winning agent",
    "reasoning": "brief explanation",
    "scores": {{"agent1": 0.9, "agent2": 0.7}}
}}"""

        response = await self.llm.complete([
            LLMMessage(role="user", content=prompt)
        ])

        # Parse LLM response
        import json
        try:
            result = json.loads(response.content)
        except json.JSONDecodeError:
            # Fallback: pick first
            result = {
                "winner": submissions[0].agent if submissions else "unknown",
                "reasoning": "Could not parse LLM response",
                "scores": {}
            }

        return Verdict(
            winner=result["winner"],
            reasoning=result["reasoning"],
            scores=result.get("scores", {}),
            confidence=0.8,  # LLM judges have inherent uncertainty
        )

Metrics-Based Judge

Score submissions using objective metrics:

from dynabots_core import Judge, Verdict
from dynabots_core.protocols.judge import Submission
from typing import List

class MetricsJudge:
    """Judge based on objective metrics."""

    async def evaluate(
        self,
        task: str,
        submissions: List[Submission],
    ) -> Verdict:
        """Score submissions using metrics."""
        scores = {}

        for submission in submissions:
            score = 0.0

            # Accuracy (simulated check)
            if self._is_correct(submission.result):
                score += 50

            # Speed bonus (faster = better)
            if submission.latency_ms and submission.latency_ms < 1000:
                speed_bonus = (1000 - submission.latency_ms) / 1000 * 30
                score += speed_bonus

            # Cost efficiency (lower cost = better)
            if submission.cost and submission.cost < 0.1:
                cost_bonus = (0.1 - submission.cost) / 0.1 * 20
                score += cost_bonus

            scores[submission.agent] = score

        # Determine winner
        winner = max(scores, key=scores.get) if scores else "unknown"
        max_score = scores.get(winner, 0) if scores else 0

        return Verdict(
            winner=winner,
            reasoning=f"Highest composite score: {max_score:.1f}",
            scores={k: round(v, 1) for k, v in scores.items()},
            confidence=1.0,  # Metrics are deterministic
        )

    def _is_correct(self, result) -> bool:
        """Check if result is correct (domain-specific)."""
        # Implement your correctness check
        return True

Consensus Judge

Multiple judges vote:

from dynabots_core import Judge, Verdict
from dynabots_core.protocols.judge import Submission
from typing import List
from collections import Counter

class ConsensusJudge:
    """Multiple judges vote on winner."""

    def __init__(self, judges: List[Judge]):
        self.judges = judges

    async def evaluate(
        self,
        task: str,
        submissions: List[Submission],
    ) -> Verdict:
        """Collect verdicts from all judges."""
        verdicts = []

        for judge in self.judges:
            verdict = await judge.evaluate(task, submissions)
            verdicts.append(verdict)

        # Consensus: most common winner
        winners = [v.winner for v in verdicts]
        winner = Counter(winners).most_common(1)[0][0]

        # Average scores
        avg_scores = {}
        for agent in {s.agent for s in submissions}:
            agent_scores = [
                v.scores.get(agent, 0)
                for v in verdicts
                if v.scores
            ]
            if agent_scores:
                avg_scores[agent] = sum(agent_scores) / len(agent_scores)

        # Confidence based on consensus
        confidence = winners.count(winner) / len(verdicts) if verdicts else 0

        return Verdict(
            winner=winner,
            reasoning=f"{len(verdicts)} judges voted; {winner} wins ({confidence:.1%} consensus)",
            scores=avg_scores,
            confidence=confidence,
        )

Tie Handling

What if submissions are equivalent?

class TieAwareJudge:
    async def evaluate(self, task: str, submissions: List[Submission]) -> Verdict:
        # ... evaluation logic ...

        if performance_is_equal:
            return Verdict(
                winner="tie",  # Signal a tie
                reasoning="Both submissions equal in all metrics",
                scores={s.agent: 0.5 for s in submissions},
                confidence=1.0,
            )

In orchestration frameworks, a tie usually means the current leader stays (no succession).


Verdict Properties

The result of evaluation:

verdict = await judge.evaluate(task, submissions)

# Core
print(verdict.winner)       # Winning agent name
print(verdict.reasoning)    # Why they won

# Metrics
print(verdict.scores)       # {"Agent1": 0.9, "Agent2": 0.7}
print(verdict.confidence)   # 0.0 to 1.0

# Metadata
print(verdict.metadata)     # Custom data
print(verdict.timestamp)    # When rendered

# Convenience
print(verdict.is_tie)       # Check if tie

Submission Properties

What judges receive:

submission = Submission(
    agent="DataAgent",
    result=TaskResult.success(...),
    latency_ms=250,           # How long it took
    cost=0.01,                # API cost
    metadata={"accuracy": 0.95}  # Custom data
)

print(submission.agent)       # Agent name
print(submission.result)      # TaskResult
print(submission.latency_ms)  # Execution time
print(submission.cost)        # Financial cost
print(submission.metadata)    # Additional data

Domain-Specific Judges

Create judges for your domain:

from dynabots_core import Judge, Verdict
from dynabots_core.protocols.judge import Submission
from typing import List

class CodeQualityJudge:
    """Judge code submissions on quality."""

    async def evaluate(
        self,
        task: str,
        submissions: List[Submission],
    ) -> Verdict:
        """Evaluate code quality."""
        scores = {}

        for submission in submissions:
            code = submission.result.data.get("code", "")

            quality = 0.0

            # Style
            if self._check_style(code):
                quality += 20

            # Tests
            if self._has_tests(code):
                quality += 30

            # Performance
            if self._is_performant(code):
                quality += 20

            # Documentation
            if self._has_docs(code):
                quality += 20

            # Correctness (if submitting results)
            if self._passes_tests(code):
                quality += 10

            scores[submission.agent] = quality

        winner = max(scores, key=scores.get) if scores else "unknown"

        return Verdict(
            winner=winner,
            reasoning=f"{winner} has best code quality ({scores.get(winner, 0):.0f}/100)",
            scores=scores,
            confidence=0.9,
        )

    def _check_style(self, code: str) -> bool:
        # Check PEP 8, etc.
        return True

    def _has_tests(self, code: str) -> bool:
        return "def test_" in code

    def _is_performant(self, code: str) -> bool:
        # Check time complexity, etc.
        return True

    def _has_docs(self, code: str) -> bool:
        return '"""' in code

    def _passes_tests(self, code: str) -> bool:
        # Run tests
        return True

Best Practices

  1. Clear reasoning: Always provide detailed reasoning. Used for learning and debugging.
  2. Consistent scoring: Use 0.0 to 1.0 for scores. Makes comparison easier.
  3. Confidence: Include confidence in your verdict. Uncertainty helps orchestration.
  4. Metadata: Attach evaluation details in metadata for analysis.
  5. Determinism: For testing, judges should be deterministic (or document randomness).

See Also