Multilingual
European knowledge lives in 24+ languages. Your search shouldn't stop at English. StellarBase ingests, indexes, and retrieves across languages by default — no configuration required.
What “multilingual” means here
Three distinct capabilities, often conflated:
1. Ingestion in any language
Every document is processed regardless of its language. Language is detected automatically — you don’t have to tag sources. OCR, layout detection, and entity extraction all work across 24+ European languages out of the box.
2. Cross-lingual retrieval
A query in Czech finds relevant German, French, and Polish documents. This is not a translation layer — it’s a shared embedding space. Multilingual embedding models encode content from any language into the same vector space, so semantic matches work across the corpus regardless of the language they were written in.
3. Inline translation for reading
When you click through to a document in a language you don’t speak, an inline translation appears alongside the original. The original passage is always preserved for citation — translation is a reading aid, not a replacement.
Supported languages
Full first-class support for the following, including OCR, entity extraction, cross-lingual embeddings, and translation:
| Region | Languages |
|---|---|
| Central + Eastern Europe | Czech, Slovak, Polish, Hungarian, Romanian, Bulgarian, Slovenian, Croatian, Serbian |
| Western Europe | English, German, French, Italian, Spanish, Portuguese, Dutch |
| Nordic | Swedish, Norwegian, Danish, Finnish |
| Baltic | Estonian, Latvian, Lithuanian |
| Other European | Greek, Maltese, Irish |
Lower-resource languages (handled but with potentially reduced accuracy): Icelandic, Luxembourgish, Welsh, Basque, Catalan, Galician.
Beyond European: global language support via the same embedding models (Arabic, Mandarin, Japanese, Hindi, Turkish, Hebrew, Russian, Ukrainian). OCR accuracy varies for non-Latin scripts.
How it works in practice
Search
When you type a query, StellarBase:
- Detects your query language
- Embeds the query using the multilingual embedder
- Retrieves the top-K passages from the entire corpus (any language)
- Presents results with language badges (de, fr, cs, etc.)
- Offers inline translation for non-matching-language hits
Agents
Agents work in any language. A single agent can read Czech input, pull German source material, and produce an English summary — all in one turn. Citations preserve the original language; the agent’s output adapts to your preference or to Base default.
Entity resolution
“Jana Nováková”, “J. Novák”, and “Novak, Jana” resolve to the same person — even across documents in different languages. The DSM engine normalizes transliteration, diacritics, and common variants.
Multilingual OCR
StellarOCR detects language per region within a document, so a page with English body text and a Latin footer, or a Czech patient record with German drug names, is handled correctly. See StellarOCR.
Translation tools
Two options depending on sensitivity:
- Local translation — local open-source translation models run inside your deployment. Private, no external call.
- External translation — DeepL, Google Translate, AWS Translate via connector. Faster for low-resource languages, but requires StellarGate anonymization first if the content is sensitive.
Localization of the UI
The StellarBase UI itself is available in Czech, English, German, French, Italian, Spanish, Polish, Slovak, Hungarian, and Dutch. Per-user language preference is respected. Additional UI languages are added based on customer demand.
Related
- StellarOCR — multilingual document processing
- Search — cross-lingual retrieval details
- Specialized Models — the embedding models used
