AI Agents
A scoped AI specialist you configure. The agent's system prompt is where your domain expertise lives. Its tools, knowledge scope, and output schema make it predictable. Every answer is cited.
What an agent is
An agent is a configuration, not a fixed model. Five pieces define it:
| Piece | What it determines |
|---|---|
| Name + role | How users invoke it; what it presents itself as |
| System prompt | Behaviour, constraints, your playbook |
| Knowledge scope | Which collections it can read |
| Tool allowlist | What actions it can take |
| Output schema | Whether responses must follow a fixed structure |
The system prompt is the asset
Most of your domain expertise — your firm’s playbook, your hospital’s protocols, your senior engineer’s heuristics — encodes into the system prompt. This is the most valuable thing you create on the platform.
Good prompts share a structure:
- Role definition — “You are a [specific role].”
- Goal — what success looks like for one task
- Inputs — what the agent will receive
- Required outputs — exact fields, format, citation requirements
- Rules / heuristics — the rules from your playbook, named explicitly
- Edge cases — what to do when input is ambiguous, missing, conflicting
- Style — formal / informal, language, brevity
Resist the urge to write one giant agent that does everything. Specialists outperform generalists. Three focused agents (NDA Reviewer, MSA Reviewer, Employment Reviewer) beat one “Contract Reviewer” that tries to be all three.
Knowledge scope
Limit each agent to the collections it actually needs. A research agent doesn’t need access to HR records. Scoping matters for:
- Quality — agents perform better on focused corpora
- Privacy — even authorized users may want strict data partitioning
- Cost — smaller scope means smaller retrieval payloads
- Speed — less to search
Scopes can be defined at Base, collection, or document-tag level.
Tools
Tools are functions an agent can invoke. The platform ships with built-ins covering the basics — knowledge-base search, single-document search, full-document fetch, OCR on attached files, entity and graph lookup, and structured-output emission against a schema.
Beyond built-ins, register your own tools — REST endpoints, custom Python scripts, your own ML models, calendar APIs, anything addressable. See Zero-Trust for how tool allowlists work.
Output schema (optional but recommended)
For agents whose output feeds downstream systems (CSV exports, database inserts, workflow steps), define an output schema. The agent’s response must conform to it. Examples:
- Risk register row: id, type, counterparty, jurisdiction, risk_level, flag_text, citation
- Treatment option: regimen, evidence_grade, contraindications, citation_url
- Lit-review entry: paper_id, relevance_score, summary, methods, limitations
Schema enforcement uses native structured-output features of modern LLMs. The agent will retry up to N times if its output doesn’t conform; if it still doesn’t, the workflow flags the row for human review.
Multi-agent workflows
Agents can call other agents. A “Tumor Board Coordinator” agent might invoke “Case Summarizer”, “Guideline Agent”, and “Literature Agent” in sequence, then synthesize their outputs into a final packet.
This composition is preferred over one mega-agent. Each component agent is testable in isolation; the orchestration is its own concern.
Versioning
Every change to an agent’s prompt or configuration is versioned. Roll back instantly if a new version regresses. Compare two versions side-by-side on a held-out test set.
Testing
Build a test set of representative inputs (5–50 cases). Run the agent against the set after every prompt change. Track quality with a simple rubric:
- Did it produce the expected output structure?
- Did it cite correctly?
- Did it follow the playbook rules?
- Did it flag edge cases appropriately?
Test sets prevent prompt-tuning from regressing previously-working cases.
Cost & latency
Agent cost is dominated by LLM tokens. Tips:
- Tight knowledge scope reduces retrieval payload, which reduces context size
- Use smaller models (GPT-OSS 120B over Qwen 3.5 397B) where quality permits
- Cache deterministic sub-results
- Use smart routing to pick the cheapest model that meets your quality bar
Common mistakes
- Vague prompts — “review this and tell me if it’s good” → agent doesn’t know what good means
- Missing edge cases — agent confidently produces nonsense for inputs you didn’t anticipate
- No output schema — agents drift in format over time, breaking downstream parsers
- Too many tools — agents pick wrong tools when many are available; allowlist tightly
- No test set — every prompt edit becomes risky
Related
- Workflows — running agents on a schedule
- Chat — invoking agents conversationally
- LLMs — picking the right model for an agent
- Zero-Trust — tool security
