StellarBase
Docs StellarCloud Specialized Models
StellarCloud

Specialized Models

Small, fast, purpose-built models for language detection, entity recognition, lemmatization, embeddings, and re-ranking. These are the workhorses that power StellarBase internally — and they're all available as APIs.

Text processing

Language Detection

Identifies the language of a text across 1,000+ languages, including rare and low-resource ones. Lightweight, fast, CPU-only.

  • Use case: route multilingual corpora to the right downstream model; tag content in mixed-language pipelines
  • Accuracy: > 99% on well-formed text of 100+ characters
  • Unit: per 1M requests

Lemmatization

Reduces inflected words to their base forms across 60+ languages. Useful for keyword search, topic modeling, and improving downstream NER performance.

  • Use case: search over inflected languages (Czech, Polish, Hungarian, Finnish)
  • Unit: per 1K documents

Zero-shot Named Entity Recognition

You specify the entity types you want (“person, medication, trial_id”) and the model finds them — no fine-tuning required.

  • Use case: custom entity extraction on domain-specific corpora
  • Strengths: flexibility (any entity type via prompt), multilingual
  • Unit: per 1M tokens

Entity Linking

Takes detected entities and links them to canonical identifiers — Wikidata, your internal knowledge base, or any custom graph.

  • Use case: resolve “Dr. Nováková” to a specific person in your HR directory
  • Unit: per 1K documents

Document processing

StellarOCR

Composite document-processing engine — one endpoint that returns text, layout, tables, figures, and math formulas from any PDF or image. Detailed in StellarOCR docs.

  • Unit: per 1K pages
  • Billing: flat rate regardless of how many internal models run

Embeddings

Multilingual Embeddings — Fast

Fast multilingual embeddings for large-scale workloads.

  • Use case: large-scale semantic search, clustering, duplicate detection
  • Unit: per 1M tokens

Multilingual Embeddings — High-recall

Higher-quality multilingual embeddings where recall matters most.

  • Use case: retrieval for mission-critical workloads where a few extra points of recall matter
  • Unit: per 1M tokens

Long-context Embeddings

Multilingual embeddings tuned for long passages and whole-document retrieval.

  • Use case: embed an entire contract or paper as one vector
  • Unit: per 1M tokens

Image Embeddings

Visual embeddings that work directly on pixels — no text captions or labels needed.

  • Use case: image search, duplicate detection, cross-modal retrieval
  • Unit: per 1K images

Retrieval

Reranker

Given a query and a set of candidate passages, scores each for relevance. Dramatically sharper than embedding similarity alone.

  • Use case: re-rank the top results from a retrieval step before handing to the LLM
  • Unit: per 1K searches

Choosing

TaskCapability
Detect languageLanguage Detection
Extract entities (any type)Zero-shot NER
Link entities to a KBEntity Linking
Text → vector (fast, cheap)Multilingual Embeddings — Fast
Text → vector (best recall)Multilingual Embeddings — High-recall
Long documents → single vectorLong-context Embeddings
Image → vectorImage Embeddings
Re-rank search resultsReranker
Parse any documentStellarOCR

Pricing

Per-unit pricing in EUR, no tiers, no minimums. See the full pricing table.

Self-hosted

Every model on this page runs inside the StellarBase on-premise bundle. For air-gapped deployments, model weights ship as signed data packs and update on your schedule. See On-Premise.

Custom models

Bring your own — HuggingFace endpoints, REST APIs, gRPC services, local checkpoints. Register once as a tool and agents and workflows can call it. See Agents.

Related