Document Parsing

PDF and image parsing with the parse-once pattern and the pluggable parser registry.

When a user attaches a file to a chat, Orchid extracts its text exactly once. That extracted text feeds both the LLM's immediate context and the RAG index — no second parse, no drift between the two.

The parse-once pattern

from orchid_ai.documents.pipeline import extract_text, ingest_document

# Step 1: extract text exactly once
text = await extract_text(file_bytes=raw, filename="report.pdf", vision_model="ollama/minicpm-v")

# Step 2a: prepend to the user query so the LLM sees the content immediately
augmented_query = f"Document content:\n{text}\n\nUser query: {user_query}"

# Step 2b: ingest into RAG — reuse the extracted text, no second parse
await ingest_document(
    file_bytes=raw,
    filename="report.pdf",
    scope=scope,
    writer=writer,
    pre_extracted_text=text,   # ← critical: avoids parsing twice
)

The pre_extracted_text parameter short-circuits the parse step inside ingest_document. This is especially important for images: the vision model call is slow and non-deterministic — a second call would produce different text, causing a mismatch between what the LLM saw and what was indexed.

The parser registry

Parsers are registered by file extension in PARSER_REGISTRY. Each parser implements async parse(file_bytes, filename) -> str.

ExtensionParserNotes
.pdfPDFParserPyMuPDF (fitz) — per-page text extraction
.docxDOCXParserpython-docx — paragraph extraction
.xlsxXLSXParseropenpyxl — pipe-delimited rows per sheet
.csvCSVParserstdlib csv
.md, .txtTextParserPassthrough
.png, .jpg, .jpegImageParserLiteLLM vision model

Adding a new format

from orchid_ai.documents.parsers import DocumentParser, PARSER_REGISTRY

class MarkdownParser(DocumentParser):
    async def parse(self, file_bytes: bytes, filename: str) -> str:
        return file_bytes.decode("utf-8")

PARSER_REGISTRY[".md"] = MarkdownParser

For consumer projects, call register_parser(ext, cls) at startup rather than mutating PARSER_REGISTRY directly — this follows the Open/Closed Principle and keeps the framework's built-in defaults intact.

Chunking

After extraction, ingest_document splits text into overlapping chunks via RecursiveCharacterTextSplitter (defaults: chunk_size=1000, chunk_overlap=200, split on \n\n first). Each chunk is embedded and stored in the vector store with scope metadata — tenant_id, user_id, chat_id, and scope="chat_shared" — so it is visible to all agents in the same chat.