Anonymization
Sensitive data replaced with reversible tokens before it leaves your perimeter. The engine is [StellarGate](/docs/gate); this page covers the broader picture — when to use it, how it integrates, and the compliance implications.
When to anonymize
Two main scenarios:
Calling external LLMs
You want GPT-4, Claude, or Gemini. You can’t send real customer / patient / client data. Route through StellarGate — the LLM receives tokens instead of PII, the response is resolved before your app sees it.
Ingesting sensitive documents
Research interviews, patient records, legal depositions. Anonymize on ingest so that even internal searches and agents operate on tokenized content. Combined with a separate token vault accessible only to authorized users, this provides strong internal compartmentalization.
Two modes of use
Reversible tokenization (default)
Entities are replaced with tokens. A mapping between tokens and originals lives in an encrypted vault. Authorized users (or workflows) can resolve tokens back to originals. Most use cases want this — you need to present real data to end users eventually.
Irreversible anonymization
For research datasets that need to survive GDPR’s “right to be forgotten” indefinitely, irreversible anonymization replaces PII with generic placeholders and destroys the mapping. Once done, there’s no way back. Use with caution — this is permanent.
What gets detected
15+ entity categories out of the box. See StellarGate Detection for the full list.
Deterministic vs. probabilistic
A critical property: deterministic rules (dictionaries, regex) run before ML detection. If a term is in a dictionary, it’s replaced 100% of the time. ML fills in the long tail.
For compliance, dictionaries are what your auditor will ask about. ML is how we reach broad coverage.
Integration points
Within StellarBase
Anonymization is a built-in workflow node. Place it before any external LLM call; place it at ingest for sensitive sources. The platform provides sensible defaults — most customers don’t configure anonymization per-workflow.
As a standalone proxy
Route your OpenAI / Anthropic / Gemini SDK calls through StellarGate. No workflow configuration required; one URL change in your app. See Operating Modes.
In your own code
Call StellarGate directly via SDK for bespoke integrations. Useful when you’re not using StellarBase but want the anonymization layer.
GDPR implications
Pseudonymization (reversible tokenization with a separately-held key) is explicitly recommended by GDPR Article 32 as an appropriate technical measure. Anonymization (irreversible) takes data out of GDPR scope entirely — but the bar is high (“so that the natural person is not or no longer identifiable”).
StellarGate defaults to pseudonymization. For GDPR “anonymization” in the strict sense, combine token replacement with:
- Generalization of quasi-identifiers (age ranges instead of exact ages)
- k-anonymity enforcement across the dataset
- Permanent destruction of the mapping
We support these as extensions on request.
HIPAA-equivalent
For US healthcare partners, Safe Harbor method (removal of 18 specific identifiers) and Expert Determination are both supported via custom configuration. The 18 identifiers map directly onto StellarGate’s built-in categories plus a few additional patterns.
What anonymization doesn’t solve
- Quasi-identifiers — age + gender + ZIP code can re-identify an individual even without names. Use k-anonymity for datasets where this matters.
- Linkage attacks — a carefully-crafted prompt with unique context can identify an individual. For maximum safety, pair with on-prem inference.
- Metadata leaks — document timestamps, author fields, file paths can leak identity. Anonymize metadata too.
- Side-channel leaks — a rejected request, a visible error, or a timing difference can leak information. StellarGate mitigates but doesn’t eliminate.
