Smart Routing
Pick the cheapest / fastest / highest-quality model that meets your constraints, automatically. You describe what you care about; we pick.
Why smart routing
StellarCloud offers several LLMs with different price-quality-latency trade-offs. Hardcoding a specific model in every agent locks you in:
- Cost goes up because you’re over-provisioning for simple tasks
- Quality suffers because you’re using a small model for hard queries
- You have to rewrite dozens of configs when a better model appears
Smart routing solves this by picking per-request based on your declared preferences.
How it works
Instead of pinning a specific model, you specify a routing policy ("model": "@route:cheap-first"). The router looks at your request, weighs it against your policy’s constraints, picks the model that best satisfies them, and returns the completion plus metadata about what was used.
Built-in policies
| Policy | Behaviour |
|---|---|
@route:cheap-first | Pick the cheapest model that can plausibly handle the task. Escalate only if quality gates fail. |
@route:fast-first | Pick the lowest-latency model. Trades quality for speed. |
@route:best | Always pick the highest-quality model. Use when cost is secondary. |
@route:balanced | Default policy. A sensible mid-tier choice unless the request needs frontier reasoning, long context, or strong multilingual handling. |
@route:code | Optimised for code generation and tool-calling. |
@route:long-context | Pick the model with the largest context window. |
Custom policies
Define your own routing rules. Dimensions you can constrain:
- Budget — max cost per request, max daily / monthly spend
- Latency — P50 / P95 target
- Quality floor — a minimum benchmark score (from an internal eval set)
- Context size — route by expected prompt length
- Language — route certain languages to specific models
- Model allowlist — restrict to a subset of models
- Fallback chain — if primary fails or exceeds SLO, try next
Fallbacks
Every routing decision has an implicit fallback chain. If the chosen model is down, rate-limited, or returns an error, the router tries the next eligible model. You always get a response (or a definitive error) — you never see transient provider outages.
Observability
The response metadata tells you:
- Which model was chosen
- Why (policy + which constraint dominated)
- Alternatives considered
- Estimated cost vs. actual
Dashboards show distribution of model usage over time — useful for cost optimization and for spotting cases where a policy is under-routing or over-routing.
When not to use smart routing
- Regulatory fixation — if compliance requires a specific model for audit reasons, hardcode it
- Reproducibility — for workflows where the exact model is part of the “recipe” (e.g. a reproducible research pipeline), pick explicitly
- Fine-grained benchmarking — while A/B testing a model for a specific task, pick explicitly
Evaluating models
For your own workloads, we provide an eval harness:
- Build a test set from your real queries
- Run all candidate models against it
- Score with a metric of your choice (exact match, LLM-as-judge, human labels)
- See which model wins on your specific workload
Use the results to tune your routing policy — or to justify hardcoding.
