Use Cases¶
Real-world applications of competitive multi-agent orchestration.
1. Model Selection (Which LLM?)¶
Problem: You have multiple LLM options (GPT-4, Claude, Llama). Which one is best for YOUR tasks?
Benchmarks are generic. You need empirical evidence on your specific use cases.
Solution¶
Run ORC's Model Showdown:
import asyncio
from orc import TheArena, Warrior, Elder
from orc.judges import LLMJudge
from dynabots_core.providers import OllamaProvider, OpenAIProvider, AnthropicProvider
async def main():
# Judge model
judge_llm = OllamaProvider(model="qwen2.5:72b")
# Warriors using different models
gpt4 = Warrior(
name="GPT-4o",
llm_client=OpenAIProvider(model="gpt-4o"),
system_prompt="You are an expert...",
domains=["analysis"],
)
claude = Warrior(
name="Claude",
llm_client=AnthropicProvider(model="claude-3-opus-20250219"),
system_prompt="You are an expert...",
domains=["analysis"],
)
mistral = Warrior(
name="Mistral",
llm_client=OllamaProvider(model="mistral:latest"),
system_prompt="You are an expert...",
domains=["analysis"],
)
elder = Elder(judge=LLMJudge(judge_llm, criteria=["accuracy", "clarity"]))
arena = TheArena(
warriors=[gpt4, claude, mistral],
elder=elder,
challenge_probability=0.9,
)
# Your real tasks
tasks = [
"Analyze customer feedback...",
"Summarize financial report...",
"Evaluate product requirements...",
]
for task in tasks:
result = await arena.battle(task)
# Winner: clear choice for your domain
leaderboard = arena.get_leaderboard("analysis")
best_model = leaderboard[0]["agent"]
print(f"Deploy: {best_model}")
asyncio.run(main())
Benefit¶
- Empirical — Based on your real tasks, not generic benchmarks
- Clear winner — Leaderboard shows the best model for YOU
- Cost-aware — Pick the cheapest model that's "good enough"
- Reproducible — Run quarterly to see if new models are better
2. Prompt Engineering (Which Prompt?)¶
Problem: Different system prompts produce different results. Which system prompt is best?
ORC lets you compare prompts head-to-head.
Solution¶
Same model, different prompts:
from orc import Warrior, Elder, TheArena
from orc.judges import LLMJudge
from dynabots_core.providers import OpenAIProvider
judge_llm = OpenAIProvider(model="gpt-4o")
# Same model, different prompts
analytical = Warrior(
name="Analytical",
llm_client=OpenAIProvider(model="gpt-4o"),
system_prompt="""You are a data analyst. Focus on numbers and trends.
Be precise. Avoid speculation.""",
domains=["analysis"],
)
creative = Warrior(
name="Creative",
llm_client=OpenAIProvider(model="gpt-4o"),
system_prompt="""You are a creative analyst. Find novel insights.
Look for interesting patterns. Be imaginative.""",
domains=["analysis"],
)
concise = Warrior(
name="Concise",
llm_client=OpenAIProvider(model="gpt-4o"),
system_prompt="""You are a concise analyst. Be brief and direct.
Skip fluff. Deliver actionable insights fast.""",
domains=["analysis"],
)
elder = Elder(judge=LLMJudge(judge_llm))
arena = TheArena(
warriors=[analytical, creative, concise],
elder=elder,
)
# Run on your tasks...
# Winner: best system prompt for your use case
Benefit¶
- Optimize without cost — Try unlimited prompts (cheap compared to new models)
- Domain-specific — Find the prompt that works best for YOUR tasks
- Iterative — Refine winning prompt, test again
- Team-driven — Sales team suggests prompt, test it
3. Agent Routing (Self-Optimizing)¶
Problem: You have multiple agents (data team, code team, writing team). Which agent should handle each task?
Static routing is hard-coded. ORC's competitive system automatically routes to the best agent.
Solution¶
Multi-domain arena:
from orc import Warrior, Elder, TheArena
from orc.judges import LLMJudge
# Specialist agents
data_agent = Warrior(
name="DataAgent",
llm_client=...,
system_prompt="You specialize in data analysis and SQL...",
domains=["data_analysis", "sql", "metrics"],
)
code_agent = Warrior(
name="CodeAgent",
llm_client=...,
system_prompt="You specialize in Python development...",
domains=["backend", "python", "architecture"],
)
docs_agent = Warrior(
name="DocsAgent",
llm_client=...,
system_prompt="You specialize in technical writing...",
domains=["documentation", "copywriting", "communication"],
)
# Single judge
elder = Elder(judge=LLMJudge(...))
# Arena with multiple domains
arena = TheArena(
warriors=[data_agent, code_agent, docs_agent],
elder=elder,
)
# Incoming tasks
async def route_task(task_description):
# Let the arena decide
result = await arena.battle(task_description)
# Winner is the best agent for this task
return result.winner
# Over time, warchiefs emerge for each domain
data_warchief = arena.get_warchief("data_analysis")
code_warchief = arena.get_warchief("backend")
docs_warchief = arena.get_warchief("documentation")
Benefit¶
- No manual routing — Arena figures out the best agent automatically
- Adapts over time — If an agent improves, it naturally wins more domains
- Self-optimizing — Leadership changes as agents perform
- Fair competition — Every agent gets a chance to prove itself
4. Research (Emergent Behavior)¶
Problem: How do multiple agents interact? Can we study emergent hierarchies?
ORC provides a framework for multi-agent research.
Solution¶
import asyncio
from orc import Arena, ArenaConfig, MetricsJudge
from dynabots_core import Agent # Implement custom agents
class ResearchAgent(Agent):
"""Custom agent for research."""
def __init__(self, name, strategy):
self.name = name
self.strategy = strategy
async def process_task(self, task, context=None):
# Your research logic
pass
async def main():
# Create agents with different strategies
agents = [
ResearchAgent("Aggressive", strategy="always_challenge"),
ResearchAgent("Conservative", strategy="reputation_based"),
ResearchAgent("Patient", strategy="cooldown"),
ResearchAgent("Specialist", strategy="specialist"),
]
judge = MetricsJudge()
arena = Arena(
agents=agents,
judge=judge,
config=ArenaConfig(
challenge_probability=0.5,
max_consecutive_defenses=5,
),
)
# Run long trial
tasks = ["Task A", "Task B", "Task C"] * 100 # 300 tasks
for i, task in enumerate(tasks):
result = await arena.process(task)
# Track over time
if i % 30 == 0:
for domain in ["research"]:
lb = arena.get_leaderboard(domain)
print(f"Task {i}: Leaderboard")
for entry in lb:
print(f" {entry['agent']}: rep={entry['reputation']:.2f}")
# Analyze emergent patterns
print("\nFinal Leaderboard:")
for entry in arena.get_leaderboard("research"):
print(f"{entry['agent']}: {entry['reputation']:.3f} "
f"(W:{entry['wins']} L:{entry['losses']})")
asyncio.run(main())
Research Questions¶
- Do aggressive agents succeed or burn out?
- What strategy wins over time?
- Does leadership concentration (Zipfian distribution) emerge?
- How does diversity affect system performance?
- What causes leadership transitions?
Benefit¶
- Novel insights — Observe multi-agent dynamics
- Configurable — Adjust parameters, re-run, compare
- Reproducible — Same code, different seeds, different outcomes (or same)
- Publication-ready — Clear metrics, leaderboards, verdicts
5. Feature A/B Testing (Agile Decisions)¶
Problem: We built two versions of a feature. Which is better?
Instead of beta testing with users, pit them against each other in ORC.
Solution¶
# Version A: Current implementation
current = Warrior(
name="FeatureA-Current",
llm_client=...,
system_prompt="""Implement the feature using the current approach:
single database query, real-time updates.""",
domains=["feature_implementation"],
)
# Version B: Proposed implementation
proposed = Warrior(
name="FeatureB-Proposed",
llm_client=...,
system_prompt="""Implement the feature using the proposed approach:
caching layer, eventual consistency.""",
domains=["feature_implementation"],
)
elder = Elder(judge=LLMJudge(
llm,
criteria=[
"Performance",
"Maintainability",
"User Experience",
"Scalability",
],
))
arena = TheArena(warriors=[current, proposed], elder=elder)
# Test cases (user scenarios)
scenarios = [
"100 concurrent users...",
"10,000 user dataset...",
"Mobile client use case...",
"Offline then online scenario...",
]
for scenario in scenarios:
result = await arena.battle(scenario)
print(f"{scenario} -> Winner: {result.winner}")
# Leaderboard tells you which is better overall
winner = arena.get_leaderboard("feature_implementation")[0]["agent"]
print(f"Deploy: {winner}")
Benefit¶
- Quick decisions — No beta testing, get answer in minutes
- Objective comparison — Judge compares fairly
- Cheap — Running LLM scenarios costs less than beta
- Repeatable — Run again with new scenarios
Which Use Case Is For You?¶
| Use Case | Goal | Tools | Effort |
|---|---|---|---|
| Model Selection | Find best LLM | Multiple LLM providers + LLMJudge | Low |
| Prompt Engineering | Find best prompt | One LLM + custom prompts | Low |
| Agent Routing | Auto-route to best agent | Multi-domain arena | Medium |
| Research | Study multi-agent dynamics | Custom agents + metrics | High |
| Feature A/B Testing | Compare implementations | Domain-specific agents | Medium |
General Recipe¶
- Identify the competition — What are you comparing? (Models, prompts, agents, implementations)
- Create Warriors — One for each option
- Create Elder judge — Aligned with your evaluation criteria
- Run trials — On your real tasks/scenarios
- Read leaderboard — Clear winner emerges
- Deploy or iterate — Act on the results
Next Steps¶
- Pick a use case above
- Follow the pattern
- Adapt for your domain
- See results in minutes
ORC makes multi-agent competition simple, fast, and objective.