StellarBase
Back to blog
Engineering Apr 5, 2026 8 min read

How the Dynamic Semantic Module Actually Works

A deep dive into the engine behind StellarBase's intelligence. How DSM builds understanding across your data without depending on any LLM.

What Is DSM?

The Dynamic Semantic Module is StellarBase’s proprietary engine for understanding data. Unlike traditional search or RAG systems that treat documents as isolated chunks of text, DSM builds a living knowledge graph that captures the relationships, entities, and semantic connections across your entire data corpus.

Think of it this way: when you read a company’s documents over months, you naturally build a mental model of how things connect. You know that the “Q3 report” relates to the “Q2 report,” that “Project Atlas” involves the engineering team, and that a specific contract references a specific client. DSM does the same thing — automatically, at scale, and without any LLM involved in the understanding process.

This is a critical distinction. DSM doesn’t call GPT-4 or Claude to understand your documents. It uses its own purpose-built models for entity extraction, relationship detection, and semantic analysis. The LLM only enters the picture when a user asks a question — and by then, DSM has already done the hard work of understanding the data.

The Three Phases

DSM processes every document through three sequential phases. Each phase builds on the output of the previous one, progressively enriching the knowledge graph.

Phase 1: Content Analysis

When a document enters the system, DSM first performs deep content analysis. This goes far beyond simple text extraction. The engine identifies:

  • Named entities — people, organizations, locations, products, dates, monetary values
  • Key concepts — topics, themes, and domain-specific terminology
  • Structural elements — headers, sections, tables, lists, and their hierarchy
  • Sentiment and intent — is this a proposal, a report, a complaint, a decision?
  • Temporal markers — when was this written, what time period does it cover, are there deadlines?

Content analysis runs locally using a pipeline of specialized models — one for NER, one for topic modeling, one for structural parsing. Each model is small (under 500MB) and optimized for throughput. No data leaves your infrastructure during this phase.

Phase 2: Relationship Discovery

Once entities and concepts are extracted from individual documents, DSM looks for connections across the entire corpus. This is where the magic happens.

The relationship discovery engine uses a combination of exact matching, fuzzy matching, and semantic similarity to find connections that traditional search would miss. It identifies six types of relationships:

Semantic:    "cloud infrastructure" ↔ "AWS deployment guide"
Temporal:    "Q3 Report" → follows → "Q2 Report"
Structural:  Chapter 3 → part_of → "Architecture Manual"
Referential: Email → mentions → "Project Atlas"
Causal:      "Server migration" → caused → "Downtime incident"
Authorial:   "Design spec" → created_by → "Engineering team"

Relationship discovery is incremental. When a new document enters the system, DSM doesn’t reprocess the entire corpus. It identifies entities in the new document and checks them against the existing graph, adding new nodes and edges in real time. A new document is fully integrated into the graph in under 200 milliseconds.

Phase 3: Graph Construction

The final phase assembles all entities, concepts, and relationships into a queryable knowledge graph. This isn’t a static data structure — it’s a weighted, directed graph where edge weights represent connection strength and are continuously updated as new data arrives.

Document: "Q3 Revenue Report"
  ├── mentions: "ACME Corp" (Organization)
  ├── contains: "$2.4M ARR" (Metric)
  ├── references: "Enterprise Plan" (Product)
  ├── authored_by: "Sarah Chen" (Person)
  └── relates_to: "Q2 Revenue Report" (Document)
       ├── mentions: "ACME Corp" (Organization)
       ├── contains: "$1.8M ARR" (Metric)
       └── shows_trend: +33% growth

The graph supports multi-hop traversal, which means queries can follow chains of relationships to find answers that span multiple documents. Ask “How is ACME Corp doing?” and DSM can traverse from the company node to revenue reports, growth trends, team members, and active contracts — all in a single graph walk.

DSM vs. Traditional RAG

Retrieval-Augmented Generation has become the standard approach for grounding LLMs in enterprise data. But RAG has fundamental limitations that DSM was designed to overcome.

How RAG Works

Traditional RAG splits documents into chunks, embeds them as vectors, and retrieves the top-K most similar chunks for a given query. The retrieved chunks are then passed to an LLM as context. This works well for simple, direct questions but falls apart for anything requiring synthesis or multi-document reasoning.

Where RAG Falls Short

  • No relationship awareness — RAG treats every chunk as independent. It doesn’t know that two chunks from different documents are about the same project or that one contradicts the other
  • Chunk boundary problems — important context often spans multiple chunks. When a table starts in one chunk and ends in another, the embedding captures neither accurately
  • No temporal understanding — RAG can’t distinguish between a Q2 report and a Q3 report unless the query explicitly mentions the quarter. It doesn’t understand that one supersedes the other
  • Recall ceiling — top-K retrieval means you only see K chunks, even if the answer requires information from 20 different documents
  • Embedding drift — the same concept expressed differently in two documents may have very different vector representations, causing retrieval misses

How DSM Solves These

// Traditional RAG
query: "How is ACME Corp doing?"
→ vector search → top-5 chunks → LLM summarizes
→ "ACME Corp revenue is $2.4M" (no context, no trend)

// DSM-powered query
query: "How is ACME Corp doing?"
→ graph traversal → connected nodes → full context
→ "ACME Corp ARR grew 33% from $1.8M to $2.4M between Q2-Q3,
   driven by Enterprise Plan adoption. Sarah Chen's team
   manages the account." (rich, connected answer)

DSM doesn’t replace vector search — it augments it. When a query comes in, DSM first identifies the relevant entities and concepts, then traverses the knowledge graph to gather connected context. This context is far richer than what top-K retrieval alone would return, and it includes relationship information that the LLM can use to generate more accurate, more complete answers.

Why No LLM Dependency for Understanding

A common question we get: “Why not just use GPT-4 to analyze documents and extract relationships?” There are three reasons:

  • Cost — processing a million documents through GPT-4 would cost tens of thousands of dollars. DSM’s local models process the same corpus for the cost of compute alone
  • Privacy — sending every document through an external LLM for analysis defeats the purpose of a self-hosted knowledge platform. DSM keeps all understanding local
  • Consistency — LLMs are non-deterministic. The same document processed twice might yield different entity extractions. DSM’s models are deterministic — same input, same output, every time

LLMs are excellent at generating natural language responses. They’re not the right tool for systematic, large-scale data understanding. DSM handles the understanding; the LLM handles the conversation.

Performance Characteristics

DSM is designed to handle enterprise-scale data. Here are the numbers from our internal benchmarks:

Benchmark: 1M documents, 50K unique entities

Indexing throughput:     ~2,400 documents/minute
Graph construction:     ~45 minutes for full corpus
Incremental update:     <200ms per new document
Query (graph traversal): <15ms avg, <50ms p99
Memory footprint:       ~2.1 GB for full graph
Storage overhead:       ~12% of source document size

The graph is stored in a custom format optimized for traversal speed. It supports concurrent reads and writes, so queries aren’t blocked while new documents are being indexed. The entire graph can be serialized and backed up without downtime.

For most organizations, DSM runs comfortably on a single node with 8 CPU cores and 16 GB of RAM. For larger deployments (10M+ documents), we support horizontal sharding across multiple nodes with automatic rebalancing.

DSM is the foundation that makes everything else in StellarBase possible — from accurate AI agents to intelligent search to automated workflows. If you want to see it in action, visit stellarbase.ai/platform or try the interactive demo.