What We Detect
15+ entity categories detected automatically out of the box. On top of that, add your own dictionaries and regex patterns for domain-specific terms. Every replacement is reversible.
Built-in categories
| Category | Examples |
|---|---|
| People | Names, titles, roles, employee IDs |
| Contact | Emails, phone numbers, postal addresses |
| Financial | Amounts, account numbers, card numbers, IBAN, SWIFT |
| Identity | SSN, rodné číslo, passport, tax ID, driver licence |
| Medical | Patient IDs, diagnoses, conditions, medications |
| Organizations | Company names, departments, project codenames |
| Dates & times | Birth dates, transaction dates, appointments |
| Locations | Addresses, GPS coordinates, facility names |
| Network | URLs, IP addresses, MAC addresses, hostnames |
| Credentials | API keys, bearer tokens, passwords (if accidentally included) |
| Custom | Anything you define via regex or dictionary |
How detection works
StellarGate combines deterministic rules with machine learning. Your allowlists, dictionaries, and regex patterns run as guarantees — if you’ve defined a term, it’s always replaced, never as a probability. Machine learning fills in the long tail: names and addresses that weren’t in any list, amounts in unusual formats, dates in local conventions.
Deterministic rules are the front line; machine learning is the safety net. Known terms always get replaced exactly the way you configured them, and the ML layer never overrides what you’ve explicitly defined.
Token format
Replacements look like [CATEGORY_N], where N is an index within that category for the request. Same entity → same token within a single request:
- “John Smith” appears three times → all three become
[PERSON_1] - “John Smith” and “Jane Doe” →
[PERSON_1]and[PERSON_2] - Mapping is scoped to the request — the LLM can’t correlate tokens across calls
Custom categories get custom token names: [PROJECT], [CASE_ID], [GRAVE_ID] — whatever you configure.
Multi-entity handling
Multi-token entities (e.g. addresses) are detected as a single unit and replaced as one token. Nested entities are supported — “Dr. Novák, Charles University Hospital” is detected as one PERSON + one ORG, not as fragments.
Precision and recall
Trade-offs depend on your policy:
- Strict mode — err on the side of masking. Higher recall, occasional false positives (masking something that didn’t need to be). Recommended for high-regulation workloads.
- Balanced mode (default) — good balance between masking truly sensitive data and preserving the prompt’s usefulness.
- Permissive mode — only mask very high-confidence detections. For workloads where utility matters more than absolute privacy.
Per-model policies let you set different modes for different LLM destinations — strict for external providers, permissive for local models you trust.
Language coverage
Detection works across all languages StellarBase supports (see Multilingual). Czech rodné číslo, German Personalausweis, Polish PESEL, French SIRET — all recognized.
What detection doesn’t catch
Be realistic. StellarGate is a strong first line, not a guarantee:
- Indirect identifiers — “the 43-year-old from the Prague office” can identify a specific person without any name. Use your judgment about context.
- Semantic leaks — asking “analyse the trial of [PERSON_1] vs [ORG_1] on 2024-03-14” can leak case identity through context to anyone who follows legal news.
- Rare formats — a brand-new custom ID scheme your company invented last week. Add it as a dictionary entry or regex.
For maximum safety, combine StellarGate with Mode 3 self-hosted and a locally-hosted LLM.
Testing detection
A dry-run endpoint lets you paste sample text and see exactly what gets flagged, without making any external call. Use it whenever you add a new rule or dictionary. See Custom Dictionaries.
Related
- Custom Dictionaries
- Anonymization — broader discussion of techniques
