StellarBase
Docs StellarBase Platform Supported Formats
StellarBase Platform

Supported Formats

If it's a file, we read it. Office documents, scanned PDFs, photos, audio, video, structured data, code, and more β€” all parsed automatically by [StellarOCR](/docs/platform/ocr) and the ingestion pipeline.

Office documents

FormatWhat we extract
.pdf (born-digital)Text, layout, tables, figures, embedded fonts
.pdf (scanned)OCR’d text, layout reconstruction, table recognition
.docx, .docText, headings, lists, tables, images, comments, tracked changes
.pptx, .pptSlide text, speaker notes, embedded images
.xlsx, .xlsCells, formulas, sheet structure, charts
.odt, .ods, .odpOpenDocument equivalents
.rtfRich-text content with formatting
.txt, .md, .rstPlain text and markup, structure preserved

Images

FormatWhat we extract
.jpeg, .jpg, .png, .webpOCR’d text, captions, EXIF metadata, visual embeddings
.tiff, .heicSame as above, with multi-page handling for TIFF
.svgEmbedded text + structure
.bmp, .gifFrames extracted, OCR per frame

Audio

πŸ›  On the roadmap. Audio ingest and transcription are on the roadmap; no firm date yet.

FormatWhat we extract
.mp3, .wav, .m4a, .flac, .ogg, .opusTranscript with speaker diarization, language detection
.amr, .aacSame β€” typical voice memo formats

Audio is transcribed using a local model in self-hosted deployments and via StellarCloud in managed mode. Speaker diarization works for up to ~8 distinct speakers reliably.

Video

πŸ›  On the roadmap. Video ingest, frame sampling, and per-frame OCR are on the roadmap; no firm date yet.

FormatWhat we extract
.mp4, .mov, .mkv, .webm, .aviAudio transcript, frame samples (1 frame / 5–30 sec, configurable), OCR on each sampled frame, scene detection

Email

πŸ›  On the roadmap. `.eml` / `.msg` / `.mbox` ingest is on the roadmap; no firm date yet.

FormatWhat we extract
.emlHeaders, body (text + HTML), attachments (recursively parsed), inline images
.mboxSame, exploded per message
.msg (Outlook)Same, plus Outlook-specific properties

Chat exports

FormatSource
Slack export ZIPChannels, threads, files
Teams exportChannels, chats, transcripts
WhatsApp exportConversation logs (with consent)
Discord exportServers, channels, threads

Structured data

FormatWhat we extract
.csv, .tsvHeaders, rows, with type inference
.json, .jsonlHierarchical structure preserved
.xmlTree structure, attributes, namespaces
.yaml, .tomlConfiguration-style data
.parquetColumnar data, statistics
.feather, .arrowSame

Code & logs

Any plain-text source file (Python, JS/TS, Go, Rust, Java, C/C++, SQL, R, MATLAB, shell, etc.) is parsed with syntax-aware chunking. Comments, docstrings, and function signatures are extracted as separate searchable units.

Log files: standard formats (Apache, nginx, syslog, JSON logs, structured logs) are parsed into events with timestamp, severity, and message fields.

Domain-specific

FormatIndustry
DICOMMedical imaging β€” pixel data + structured metadata
HL7 v2 / FHIRHealthcare records
STEP, IGES, DXFCAD β€” text annotations + geometric structure
SHP, GeoJSON, KMLGeographic β€” features + attributes
BibTeX, RIS, EndNote XMLReference-manager exports
EDIFACT, X12Business EDI
SWIFT messagesFinancial transactions

Compressed archives

πŸ›  On the roadmap. Auto-extraction of `.zip` / `.rar` / `.7z` and recursive processing of contents are on the roadmap; no firm date yet.

Auto-extracted and recursively processed: .zip, .tar, .tar.gz, .tar.bz2, .7z, .rar.

What we don’t (yet) support

  • Encrypted documents β€” provide the password during ingest, or decrypt before upload
  • DRM-protected formats β€” by design
  • Proprietary formats without published specs β€” request via support; many can be added on demand

Quality expectations

Quality of extraction depends on the format:

FormatTypical quality
Born-digital PDFs, .docx, .xlsxNear-perfect
High-quality scans (300+ DPI)95%+ character accuracy
Phone photos of documents90–95% with corrections
Audio (clear, single speaker)95%+ word accuracy
Audio (multi-speaker, noisy)80–90%
Handwriting (clear)85–90%
Handwriting (cursive)60–80% β€” variable

Confidence scores are exposed for downstream filtering. Low-confidence regions can be routed to human review.

Related