Supported Formats

If it's a file, we read it. Office documents, scanned PDFs, photos, audio, video, structured data, code, and more — all parsed automatically by StellarOCR and the ingestion pipeline.

Office documents

Format	What we extract
.pdf (born-digital)	Text, layout, tables, figures, embedded fonts
.pdf (scanned)	OCR’d text, layout reconstruction, table recognition
.docx, .doc	Text, headings, lists, tables, images, comments, tracked changes
.pptx, .ppt	Slide text, speaker notes, embedded images — Q3 2026
.xlsx, .xls	Cells, formulas, sheet structure, charts
.odt, .ods, .odp	OpenDocument equivalents — on the roadmap, no firm date
.rtf	Rich-text content with formatting
.txt, .md, .rst	Plain text and markup, structure preserved

Images

Format	What we extract
.jpeg, .jpg, .png, .webp	OCR’d text, captions, EXIF metadata, visual embeddings
.tiff, .heic	Same as above, with multi-page handling for TIFF
.svg	Embedded text + structure
.bmp, .gif	Frames extracted, OCR per frame

Audio

🛠 On the roadmap. Audio ingest and transcription are on the roadmap; no firm date yet.

Format	What we extract
.mp3, .wav, .m4a, .flac, .ogg, .opus	Transcript with speaker diarization, language detection
.amr, .aac	Same — typical voice memo formats

Audio is transcribed using a local model in self-hosted deployments and via StellarCloud in managed mode. Speaker diarization works for up to ~8 distinct speakers reliably.

Video

🛠 On the roadmap. Video ingest, frame sampling, and per-frame OCR are on the roadmap; no firm date yet.

Format	What we extract
.mp4, .mov, .mkv, .webm, .avi	Audio transcript, frame samples (1 frame / 5–30 sec, configurable), OCR on each sampled frame, scene detection

Email

🛠 On the roadmap. `.eml` / `.msg` / `.mbox` ingest is on the roadmap; no firm date yet.

Format	What we extract
.eml	Headers, body (text + HTML), attachments (recursively parsed), inline images
.mbox	Same, exploded per message
.msg (Outlook)	Same, plus Outlook-specific properties

Chat exports

🛠 On the roadmap. Chat-export ingestion (Slack / Teams / WhatsApp / Discord export archives) is on the roadmap; no firm date yet.

Format	Source
Slack export ZIP	Channels, threads, files
Teams export	Channels, chats, transcripts
WhatsApp export	Conversation logs (with consent)
Discord export	Servers, channels, threads

Structured data

Format	What we extract
.csv, .tsv	Headers, rows, with type inference
.json, .jsonl	Hierarchical structure preserved
.xml	Tree structure, attributes, namespaces
.yaml, .toml	Configuration-style data
.parquet	Columnar data, statistics — on the roadmap, no firm date
.feather, .arrow	Same — on the roadmap, no firm date

Code & logs

Any plain-text source file (Python, JS/TS, Go, Rust, Java, C/C++, SQL, R, MATLAB, shell, etc.) is parsed with syntax-aware chunking. Comments, docstrings, and function signatures are extracted as separate searchable units.

Log files: standard formats (Apache, nginx, syslog, JSON logs, structured logs) are parsed into events with timestamp, severity, and message fields.

Domain-specific

🛠 On the roadmap. Domain-specific format parsing (DICOM, HL7 / FHIR, CAD, geospatial, EDI, SWIFT…) is on the roadmap; no firm date yet.

Format	Industry
DICOM	Medical imaging — pixel data + structured metadata
HL7 v2 / FHIR	Healthcare records
STEP, IGES, DXF	CAD — text annotations + geometric structure
SHP, GeoJSON, KML	Geographic — features + attributes
BibTeX, RIS, EndNote XML	Reference-manager exports
EDIFACT, X12	Business EDI
SWIFT messages	Financial transactions

Compressed archives

🛠 On the roadmap. Auto-extraction of `.zip` / `.rar` / `.7z` and recursive processing of contents are on the roadmap; no firm date yet.

Auto-extracted and recursively processed: .zip, .tar, .tar.gz, .tar.bz2, .7z, .rar.

What we don’t (yet) support

Encrypted documents — provide the password during ingest, or decrypt before upload
DRM-protected formats — by design
Proprietary formats without published specs — request via support; many can be added on demand

Quality expectations

Quality of extraction depends on the format:

Format	Typical quality
Born-digital PDFs, .docx, .xlsx	Near-perfect
High-quality scans (300+ DPI)	95%+ character accuracy
Phone photos of documents	90–95% with corrections
Audio (clear, single speaker)	95%+ word accuracy
Audio (multi-speaker, noisy)	80–90%
Handwriting (clear)	85–90%
Handwriting (cursive)	60–80% — variable

Confidence scores are exposed for downstream filtering. Low-confidence regions can be routed to human review.

StellarOCR — the engine behind document parsing
Data Sources — connectors that deliver these files

Supported Formats

Office documents

Images

Audio

Video

Email

Chat exports

Structured data

Code & logs

Domain-specific

Compressed archives

What we don’t (yet) support

Quality expectations

Related