Large Language Models

Open-weights generative models served from EU GPUs. Different sizes, different strengths — pick one explicitly, or let smart routing decide.

Available models

Per-token pricing is on the StellarCloud pricing page.

GPT-OSS 120B

Open-weights LLM released by OpenAI. Broad general-purpose capability, instruction-tuned, reliable for most workloads. Apache 2.0 licence. 128K context.

Best for: general question-answering, summarisation, most agent workflows, drafting
Latency: depends on output length; fastest of the three on short responses

Devstral 2

Mistral’s coding-focused model. Strong on software engineering tasks, tool use, agentic workflows with multi-step reasoning. Apache 2.0. 128K context.

Best for: code generation + review, technical writing, workflows that invoke many tools, structured output
Latency: moderate. Strong tool-calling compliance reduces retry overhead.

Qwen 3.5 397B A17B

Frontier-tier open-weights model. Multilingual reasoning, long context, top-end quality. 256K context.

Best for: complex reasoning, multilingual workloads, long-context analysis (full contracts, long papers), anywhere you’d use GPT-4
Latency: highest. The model is the largest; responses take longest.

Choosing between them

You care about…	Pick
Cheapest inference	GPT-OSS 120B
Fastest response on short prompts	GPT-OSS 120B
Best coding / tool-calling quality	Devstral 2
Long-context (> 128K tokens)	Qwen 3.5 397B
Best multilingual (EU languages)	Qwen 3.5 397B
Best reasoning on hard problems	Qwen 3.5 397B
Budget-balanced default	Smart routing

Context windows

A “context window” is the maximum combined size of input + output. Larger is not always better — longer contexts cost more and are slower. Typical guidance:

Short interactions (chat, Q&A): any model at its default context is fine
Document analysis (one contract, one paper): 128K is enough for 300+ pages
Corpus analysis (multiple documents, long conversations): 256K (Qwen 3.5) or retrieval-augmented

Because StellarBase agents do retrieval-augmented generation by default, you rarely need to pass huge context — the knowledge base serves the relevant passages and the model operates on a curated subset. Long-context models are still useful for specific workflows (summarising a full regulation, analysing a 500-page monograph).

Structured output

All three models support structured output (JSON Schema or Pydantic-like definitions). Devstral 2 has the strongest schema compliance — we recommend it for workflows that must produce strict outputs (CSV rows, database inserts, fields for external systems).

Streaming

Token-by-token streaming is supported on all three models. Default for chat-like workloads, optional for agent workflows where you want the full response before acting.

Tool use / function calling

All three support OpenAI-compatible tool calling. Devstral 2 has the most reliable tool-calling behaviour across complex multi-step workflows.

Fine-tuning

Not offered as a managed service for these models. For customisation, use:

System prompts — for behavioural tuning
Retrieval — for knowledge adaptation (almost always the right answer)
On-premise deployment — if you need full weight control, self-host the model in your infrastructure and fine-tune there

Commercial LLMs (GPT-4, Claude, Gemini)

Not hosted on StellarCloud — they’re proprietary and hosted by their vendors. You can still use them from StellarBase via StellarGate, which anonymizes your prompts before forwarding them to the vendor.

Smart Routing — let the platform choose
API Reference — endpoint details
StellarGate — using commercial LLMs with privacy