Docs StellarCloud Specialized Models

StellarCloud

Specialized Models

Small, fast, purpose-built models for language detection, entity recognition, lemmatization, embeddings, and re-ranking. These are the workhorses that power StellarBase internally — and they're all available as APIs.

Text processing

Language Detection

Identifies the language of a text across 1,000+ languages, including rare and low-resource ones. Lightweight, fast, CPU-only.

Use case: route multilingual corpora to the right downstream model; tag content in mixed-language pipelines
Accuracy: > 99% on well-formed text of 100+ characters
Unit: per 1M requests

Lemmatization

Reduces inflected words to their base forms across 60+ languages. Useful for keyword search, topic modeling, and improving downstream NER performance.

Use case: search over inflected languages (Czech, Polish, Hungarian, Finnish)
Unit: per 1K documents

Zero-shot Named Entity Recognition

You specify the entity types you want (“person, medication, trial_id”) and the model finds them — no fine-tuning required.

Use case: custom entity extraction on domain-specific corpora
Strengths: flexibility (any entity type via prompt), multilingual
Unit: per 1M tokens

Entity Linking

Takes detected entities and links them to canonical identifiers — Wikidata, your internal knowledge base, or any custom graph.

Use case: resolve “Dr. Nováková” to a specific person in your HR directory
Unit: per 1K documents

Document processing

StellarOCR

Composite document-processing engine — one endpoint that returns text, layout, tables, figures, and math formulas from any PDF or image. Detailed in StellarOCR docs.

Unit: per 1K pages
Billing: flat rate regardless of how many internal models run

Embeddings

Multilingual Embeddings — Fast

Fast multilingual embeddings for large-scale workloads.

Use case: large-scale semantic search, clustering, duplicate detection
Unit: per 1M tokens

Multilingual Embeddings — High-recall

Higher-quality multilingual embeddings where recall matters most.

Use case: retrieval for mission-critical workloads where a few extra points of recall matter
Unit: per 1M tokens

Long-context Embeddings

Multilingual embeddings tuned for long passages and whole-document retrieval.

Use case: embed an entire contract or paper as one vector
Unit: per 1M tokens

Image Embeddings

Visual embeddings that work directly on pixels — no text captions or labels needed.

Use case: image search, duplicate detection, cross-modal retrieval
Unit: per 1K images

Retrieval

Reranker

Given a query and a set of candidate passages, scores each for relevance. Dramatically sharper than embedding similarity alone.

Use case: re-rank the top results from a retrieval step before handing to the LLM
Unit: per 1K searches

Choosing

Task	Capability
Detect language	Language Detection
Extract entities (any type)	Zero-shot NER
Link entities to a KB	Entity Linking
Text → vector (fast, cheap)	Multilingual Embeddings — Fast
Text → vector (best recall)	Multilingual Embeddings — High-recall
Long documents → single vector	Long-context Embeddings
Image → vector	Image Embeddings
Re-rank search results	Reranker
Parse any document	StellarOCR

Pricing

Per-unit pricing in EUR, no tiers, no minimums. See the full pricing table.

Self-hosted

Every model on this page runs inside the StellarBase on-premise bundle. For air-gapped deployments, model weights ship as signed data packs and update on your schedule. See On-Premise.

Custom models

Bring your own — HuggingFace endpoints, REST APIs, gRPC services, local checkpoints. Register once as a tool and agents and workflows can call it. See Agents.