Document Parsing
PDF and image parsing with the parse-once pattern and the pluggable parser registry.
When a user attaches a file to a chat, Orchid extracts its text exactly once. That extracted text feeds both the LLM's immediate context and the RAG index — no second parse, no drift between the two.
The parse-once pattern
from orchid_ai.documents.pipeline import extract_text, ingest_document
# Step 1: extract text exactly once
text = await extract_text(file_bytes=raw, filename="report.pdf", vision_model="ollama/minicpm-v")
# Step 2a: prepend to the user query so the LLM sees the content immediately
augmented_query = f"Document content:\n{text}\n\nUser query: {user_query}"
# Step 2b: ingest into RAG — reuse the extracted text, no second parse
await ingest_document(
file_bytes=raw,
filename="report.pdf",
scope=scope,
writer=writer,
pre_extracted_text=text, # ← critical: avoids parsing twice
)The pre_extracted_text parameter short-circuits the parse step inside ingest_document. This is especially important for images: the vision model call is slow and non-deterministic — a second call would produce different text, causing a mismatch between what the LLM saw and what was indexed.
The parser registry
Parsers are registered by file extension in PARSER_REGISTRY. Each parser implements async parse(file_bytes, filename) -> str.
| Extension | Parser | Notes |
|---|---|---|
.pdf | PDFParser | PyMuPDF (fitz) — per-page text extraction |
.docx | DOCXParser | python-docx — paragraph extraction |
.xlsx | XLSXParser | openpyxl — pipe-delimited rows per sheet |
.csv | CSVParser | stdlib csv |
.md, .txt | TextParser | Passthrough |
.png, .jpg, .jpeg | ImageParser | LiteLLM vision model |
Adding a new format
from orchid_ai.documents.parsers import DocumentParser, PARSER_REGISTRY
class MarkdownParser(DocumentParser):
async def parse(self, file_bytes: bytes, filename: str) -> str:
return file_bytes.decode("utf-8")
PARSER_REGISTRY[".md"] = MarkdownParserFor consumer projects, call register_parser(ext, cls) at startup rather than mutating PARSER_REGISTRY directly — this follows the Open/Closed Principle and keeps the framework's built-in defaults intact.
Chunking
After extraction, ingest_document splits text into overlapping chunks via RecursiveCharacterTextSplitter (defaults: chunk_size=1000, chunk_overlap=200, split on \n\n first). Each chunk is embedded and stored in the vector store with scope metadata — tenant_id, user_id, chat_id, and scope="chat_shared" — so it is visible to all agents in the same chat.