StellarBase
Docs StellarCloud Smart Routing
StellarCloud

Smart Routing

Pick the cheapest / fastest / highest-quality model that meets your constraints, automatically. You describe what you care about; we pick.

Why smart routing

StellarCloud offers several LLMs with different price-quality-latency trade-offs. Hardcoding a specific model in every agent locks you in:

  • Cost goes up because you’re over-provisioning for simple tasks
  • Quality suffers because you’re using a small model for hard queries
  • You have to rewrite dozens of configs when a better model appears

Smart routing solves this by picking per-request based on your declared preferences.

How it works

Instead of pinning a specific model, you specify a routing policy ("model": "@route:cheap-first"). The router looks at your request, weighs it against your policy’s constraints, picks the model that best satisfies them, and returns the completion plus metadata about what was used.

Built-in policies

PolicyBehaviour
@route:cheap-firstPick the cheapest model that can plausibly handle the task. Escalate only if quality gates fail.
@route:fast-firstPick the lowest-latency model. Trades quality for speed.
@route:bestAlways pick the highest-quality model. Use when cost is secondary.
@route:balancedDefault policy. A sensible mid-tier choice unless the request needs frontier reasoning, long context, or strong multilingual handling.
@route:codeOptimised for code generation and tool-calling.
@route:long-contextPick the model with the largest context window.

Custom policies

Define your own routing rules. Dimensions you can constrain:

  • Budget — max cost per request, max daily / monthly spend
  • Latency — P50 / P95 target
  • Quality floor — a minimum benchmark score (from an internal eval set)
  • Context size — route by expected prompt length
  • Language — route certain languages to specific models
  • Model allowlist — restrict to a subset of models
  • Fallback chain — if primary fails or exceeds SLO, try next

Fallbacks

Every routing decision has an implicit fallback chain. If the chosen model is down, rate-limited, or returns an error, the router tries the next eligible model. You always get a response (or a definitive error) — you never see transient provider outages.

Observability

The response metadata tells you:

  • Which model was chosen
  • Why (policy + which constraint dominated)
  • Alternatives considered
  • Estimated cost vs. actual

Dashboards show distribution of model usage over time — useful for cost optimization and for spotting cases where a policy is under-routing or over-routing.

When not to use smart routing

  • Regulatory fixation — if compliance requires a specific model for audit reasons, hardcode it
  • Reproducibility — for workflows where the exact model is part of the “recipe” (e.g. a reproducible research pipeline), pick explicitly
  • Fine-grained benchmarking — while A/B testing a model for a specific task, pick explicitly

Evaluating models

For your own workloads, we provide an eval harness:

  • Build a test set from your real queries
  • Run all candidate models against it
  • Score with a metric of your choice (exact match, LLM-as-judge, human labels)
  • See which model wins on your specific workload

Use the results to tune your routing policy — or to justify hardcoding.

Related