StellarBase
Docs StellarGate Custom Dictionaries
StellarGate

Custom Dictionaries

Deterministic replacement for domain-specific terms. Upload your client names, project codenames, internal IDs — those terms get masked 100% of the time, not 99.9%.

🛠 Ready in Q4 2026. Dictionary-based deterministic replacement, dry-run UI, and versioning are ready in Q4 2026. ML-based detection (the fallback referenced here) is available today.

Why dictionaries?

ML detection is excellent for general entities (names, emails, addresses) but makes probabilistic decisions. When your compliance team asks “are you sure client X’s name gets masked?”, “99.9% of the time” is not the answer they want. A dictionary entry gives a guarantee.

Dictionaries also catch things no general-purpose model knows:

  • Your internal project codenames (Operation Aurora, Polaris v2)
  • Your client’s legal name variations (“ACME Group”, “ACME AG”, “ACME” — all same entity)
  • Your own employee directory
  • Terms unique to your domain — legal case IDs, medical trial IDs, archaeological codes

Dictionary structure

Each dictionary has:

  • Name — for organization (e.g. clients-vip, internal-codenames)
  • Entries — term → replacement token → optional category
  • Scope — which Bases / policies this dictionary applies to
  • Priority — when multiple dictionaries match, higher priority wins

Example

TermReplacementCategory
Aurora[PROJECT_1]project
Polaris[PROJECT_2]project
Dr. Nováková[PERSON]person
XY-2026-ALPHA[CASE_ID]case
Oddělení hematologie[DEPT]dept

Creating a dictionary

Three paths:

Upload a file

Drag a CSV, JSON, or YAML file into the StellarGate admin UI. Column mapping is automatic if your file has recognisable headers. Supports thousands of entries.

Build manually

Click “Add entry” for a handful of terms. Useful when you’re seeding a new dictionary or adding ad-hoc entries.

Programmatic

The StellarGate admin API accepts dictionary CRUD. Synchronize with your HR / matter-management / project-management system so the dictionary stays current as your organization evolves.

Matching rules

Case sensitivity

Dictionaries are case-insensitive by default but can be made case-sensitive per entry. “ACME” would match “acme”, “Acme”, etc.

Word boundaries

Matches respect word boundaries — “ACME” won’t match inside “ACMEWARE”. Disable per entry if you want substring matching.

Variants & aliases

An entry can have multiple variants that all map to the same token:

  • “ACME Group”, “ACME AG”, “ACME” → all become [ORG_1]
  • “John Smith”, “J. Smith”, “Smith, John” → all become [PERSON_1]

This matters for consistency — if the LLM sees the same entity referenced multiple ways and they all map to the same token, its reasoning about that entity stays coherent.

Diacritics

Accent-insensitive matching by default. “Novák” matches “Novak” and vice versa. Useful for Czech, Slovak, Polish, Hungarian, French names written with or without diacritics.

Priorities

Dictionaries are evaluated in the priority order you configure — your high-priority dictionaries (per-client, per-matter) override broader firm-wide ones, allowlists are honoured first, and ML detection is always the safety net for anything no rule has matched.

Testing

The dry-run tool lets you paste any text and see exactly what each dictionary entry catches. Use it before deploying a new dictionary to production — especially important for regex rules which can have unintended matches.

Versioning

Every dictionary change is versioned. Roll back instantly if a new entry causes unexpected behaviour. Audit trail shows who added/modified/deleted each entry and when.

Sharing dictionaries across Bases

A dictionary can be attached to:

  • A single Base (most common)
  • A group of Bases (e.g. all corporate finance Bases)
  • The entire tenant (global dictionaries)

Permissions on who can edit vs. apply are separate — a compliance officer may be the only one who can modify a dictionary, but many analysts can use it.

Size & performance

Dictionaries scale to enterprise-size term lists with no perceptible per-request overhead. For very large catalogues, split by scope so the right rules are evaluated for each Base.

Related