Supported Formats
If it's a file, we read it. Office documents, scanned PDFs, photos, audio, video, structured data, code, and more β all parsed automatically by [StellarOCR](/docs/platform/ocr) and the ingestion pipeline.
Office documents
| Format | What we extract |
|---|---|
| .pdf (born-digital) | Text, layout, tables, figures, embedded fonts |
| .pdf (scanned) | OCRβd text, layout reconstruction, table recognition |
| .docx, .doc | Text, headings, lists, tables, images, comments, tracked changes |
| .pptx, .ppt | Slide text, speaker notes, embedded images |
| .xlsx, .xls | Cells, formulas, sheet structure, charts |
| .odt, .ods, .odp | OpenDocument equivalents |
| .rtf | Rich-text content with formatting |
| .txt, .md, .rst | Plain text and markup, structure preserved |
Images
| Format | What we extract |
|---|---|
| .jpeg, .jpg, .png, .webp | OCRβd text, captions, EXIF metadata, visual embeddings |
| .tiff, .heic | Same as above, with multi-page handling for TIFF |
| .svg | Embedded text + structure |
| .bmp, .gif | Frames extracted, OCR per frame |
Audio
π On the roadmap. Audio ingest and transcription are on the roadmap; no firm date yet.
| Format | What we extract |
|---|---|
| .mp3, .wav, .m4a, .flac, .ogg, .opus | Transcript with speaker diarization, language detection |
| .amr, .aac | Same β typical voice memo formats |
Audio is transcribed using a local model in self-hosted deployments and via StellarCloud in managed mode. Speaker diarization works for up to ~8 distinct speakers reliably.
Video
π On the roadmap. Video ingest, frame sampling, and per-frame OCR are on the roadmap; no firm date yet.
| Format | What we extract |
|---|---|
| .mp4, .mov, .mkv, .webm, .avi | Audio transcript, frame samples (1 frame / 5β30 sec, configurable), OCR on each sampled frame, scene detection |
π On the roadmap. `.eml` / `.msg` / `.mbox` ingest is on the roadmap; no firm date yet.
| Format | What we extract |
|---|---|
| .eml | Headers, body (text + HTML), attachments (recursively parsed), inline images |
| .mbox | Same, exploded per message |
| .msg (Outlook) | Same, plus Outlook-specific properties |
Chat exports
| Format | Source |
|---|---|
| Slack export ZIP | Channels, threads, files |
| Teams export | Channels, chats, transcripts |
| WhatsApp export | Conversation logs (with consent) |
| Discord export | Servers, channels, threads |
Structured data
| Format | What we extract |
|---|---|
| .csv, .tsv | Headers, rows, with type inference |
| .json, .jsonl | Hierarchical structure preserved |
| .xml | Tree structure, attributes, namespaces |
| .yaml, .toml | Configuration-style data |
| .parquet | Columnar data, statistics |
| .feather, .arrow | Same |
Code & logs
Any plain-text source file (Python, JS/TS, Go, Rust, Java, C/C++, SQL, R, MATLAB, shell, etc.) is parsed with syntax-aware chunking. Comments, docstrings, and function signatures are extracted as separate searchable units.
Log files: standard formats (Apache, nginx, syslog, JSON logs, structured logs) are parsed into events with timestamp, severity, and message fields.
Domain-specific
| Format | Industry |
|---|---|
| DICOM | Medical imaging β pixel data + structured metadata |
| HL7 v2 / FHIR | Healthcare records |
| STEP, IGES, DXF | CAD β text annotations + geometric structure |
| SHP, GeoJSON, KML | Geographic β features + attributes |
| BibTeX, RIS, EndNote XML | Reference-manager exports |
| EDIFACT, X12 | Business EDI |
| SWIFT messages | Financial transactions |
Compressed archives
π On the roadmap. Auto-extraction of `.zip` / `.rar` / `.7z` and recursive processing of contents are on the roadmap; no firm date yet.
Auto-extracted and recursively processed: .zip, .tar, .tar.gz, .tar.bz2, .7z, .rar.
What we donβt (yet) support
- Encrypted documents β provide the password during ingest, or decrypt before upload
- DRM-protected formats β by design
- Proprietary formats without published specs β request via support; many can be added on demand
Quality expectations
Quality of extraction depends on the format:
| Format | Typical quality |
|---|---|
| Born-digital PDFs, .docx, .xlsx | Near-perfect |
| High-quality scans (300+ DPI) | 95%+ character accuracy |
| Phone photos of documents | 90β95% with corrections |
| Audio (clear, single speaker) | 95%+ word accuracy |
| Audio (multi-speaker, noisy) | 80β90% |
| Handwriting (clear) | 85β90% |
| Handwriting (cursive) | 60β80% β variable |
Confidence scores are exposed for downstream filtering. Low-confidence regions can be routed to human review.
Related
- StellarOCR β the engine behind document parsing
- Data Sources β connectors that deliver these files
