What We Detect

15+ entity categories detected automatically out of the box. On top of that, add your own dictionaries and regex patterns for domain-specific terms. Every replacement is reversible.

Built-in categories

Category	Examples
People	Names, titles, roles, employee IDs
Contact	Emails, phone numbers, postal addresses
Financial	Amounts, account numbers, card numbers, IBAN, SWIFT
Identity	SSN, rodné číslo, passport, tax ID, driver licence
Medical	Patient IDs, diagnoses, conditions, medications
Organizations	Company names, departments, project codenames
Dates & times	Birth dates, transaction dates, appointments
Locations	Addresses, GPS coordinates, facility names
Network	URLs, IP addresses, MAC addresses, hostnames
Credentials	API keys, bearer tokens, passwords (if accidentally included)
Custom	Anything you define via regex or dictionary

How detection works

StellarGate combines deterministic rules with machine learning. Your allowlists, dictionaries, and regex patterns run as guarantees — if you’ve defined a term, it’s always replaced, never as a probability. Machine learning fills in the long tail: names and addresses that weren’t in any list, amounts in unusual formats, dates in local conventions.

Deterministic rules are the front line; machine learning is the safety net. Known terms always get replaced exactly the way you configured them, and the ML layer never overrides what you’ve explicitly defined.

Token format

Replacements look like [CATEGORY_N], where N is an index within that category for the request. Same entity → same token within a single request:

“John Smith” appears three times → all three become [PERSON_1]
“John Smith” and “Jane Doe” → [PERSON_1] and [PERSON_2]
Mapping is scoped to the request — the LLM can’t correlate tokens across calls

Custom categories get custom token names: [PROJECT], [CASE_ID], [GRAVE_ID] — whatever you configure.

Multi-entity handling

Multi-token entities (e.g. addresses) are detected as a single unit and replaced as one token. Nested entities are supported — “Dr. Novák, Charles University Hospital” is detected as one PERSON + one ORG, not as fragments.

Precision and recall

Trade-offs depend on your policy:

Strict mode — err on the side of masking. Higher recall, occasional false positives (masking something that didn’t need to be). Recommended for high-regulation workloads.
Balanced mode (default) — good balance between masking truly sensitive data and preserving the prompt’s usefulness.
Permissive mode — only mask very high-confidence detections. For workloads where utility matters more than absolute privacy.

Per-model policies let you set different modes for different LLM destinations — strict for external providers, permissive for local models you trust.

Language coverage

Detection works across all languages StellarBase supports (see Multilingual). Czech rodné číslo, German Personalausweis, Polish PESEL, French SIRET — all recognized.

What detection doesn’t catch

Be realistic. StellarGate is a strong first line, not a guarantee:

Indirect identifiers — “the 43-year-old from the Prague office” can identify a specific person without any name. Use your judgment about context.
Semantic leaks — asking “analyse the trial of [PERSON_1] vs [ORG_1] on 2024-03-14” can leak case identity through context to anyone who follows legal news.
Rare formats — a brand-new custom ID scheme your company invented last week. Add it as a dictionary entry or regex.

For maximum safety, combine StellarGate with Mode 3 self-hosted and a locally-hosted LLM.

Testing detection

A dry-run endpoint lets you paste sample text and see exactly what gets flagged, without making any external call. Use it whenever you add a new rule or dictionary. See Custom Dictionaries.

Custom Dictionaries
Anonymization — broader discussion of techniques