Large Language Models
Open-weights generative models served from EU GPUs. Different sizes, different strengths — pick one explicitly, or let smart routing decide.
Available models
GPT-OSS 120B
Open-weights LLM released by OpenAI. Broad general-purpose capability, instruction-tuned, reliable for most workloads. Apache 2.0 licence. 128K context.
- Best for: general question-answering, summarisation, most agent workflows, drafting
- Price: €0.20 / 1M input tokens · €0.80 / 1M output tokens
- Latency: depends on output length; fastest of the three on short responses
Devstral 2
Mistral’s coding-focused model. Strong on software engineering tasks, tool use, agentic workflows with multi-step reasoning. Apache 2.0. 128K context.
- Best for: code generation + review, technical writing, workflows that invoke many tools, structured output
- Price: €0.50 / 1M input tokens · €2.00 / 1M output tokens
- Latency: moderate. Strong tool-calling compliance reduces retry overhead.
Qwen 3.5 397B A17B
Frontier-tier open-weights model. Multilingual reasoning, long context, top-end quality. 256K context.
- Best for: complex reasoning, multilingual workloads, long-context analysis (full contracts, long papers), anywhere you’d use GPT-4
- Price: €0.70 / 1M input tokens · €3.80 / 1M output tokens
- Latency: highest. The model is the largest; responses take longest.
Choosing between them
| You care about… | Pick |
|---|---|
| Cheapest inference | GPT-OSS 120B |
| Fastest response on short prompts | GPT-OSS 120B |
| Best coding / tool-calling quality | Devstral 2 |
| Long-context (> 128K tokens) | Qwen 3.5 397B |
| Best multilingual (EU languages) | Qwen 3.5 397B |
| Best reasoning on hard problems | Qwen 3.5 397B |
| Budget-balanced default | Smart routing |
Context windows
A “context window” is the maximum combined size of input + output. Larger is not always better — longer contexts cost more and are slower. Typical guidance:
- Short interactions (chat, Q&A): any model at its default context is fine
- Document analysis (one contract, one paper): 128K is enough for 300+ pages
- Corpus analysis (multiple documents, long conversations): 256K (Qwen 3.5) or retrieval-augmented
Because StellarBase agents do retrieval-augmented generation by default, you rarely need to pass huge context — the knowledge base serves the relevant passages and the model operates on a curated subset. Long-context models are still useful for specific workflows (summarising a full regulation, analysing a 500-page monograph).
Structured output
All three models support structured output (JSON Schema or Pydantic-like definitions). Devstral 2 has the strongest schema compliance — we recommend it for workflows that must produce strict outputs (CSV rows, database inserts, fields for external systems).
Streaming
Token-by-token streaming is supported on all three models. Default for chat-like workloads, optional for agent workflows where you want the full response before acting.
Tool use / function calling
All three support OpenAI-compatible tool calling. Devstral 2 has the most reliable tool-calling behaviour across complex multi-step workflows.
Fine-tuning
Not currently offered as a managed service for these models. For customisation, use:
- System prompts — for behavioural tuning
- Retrieval — for knowledge adaptation (almost always the right answer)
- On-premise deployment — if you need full weight control, self-host the model in your infrastructure and fine-tune there
Commercial LLMs (GPT-4, Claude, Gemini)
Not hosted on StellarCloud — they’re proprietary and hosted by their vendors. You can still use them from StellarBase via StellarGate, which anonymizes your prompts before forwarding them to the vendor.
Related
- Smart Routing — let the platform choose
- API Reference — endpoint details
- StellarGate — using commercial LLMs with privacy
