Hierarchical RAG

OrchidRAGScope, the 5-level hierarchy, retrieval strategies, ingestion pipelines, and the parse-once pattern.

AnthropicOpenAIOllamaGoogle

Orchid's RAG subsystem is built around a five-level scope hierarchy and a pluggable strategy architecture. Every vector read and write carries an OrchidRAGScope — a frozen dataclass that encodes exactly which slice of the vector store a given operation is allowed to see or write to. Retrieval behaviour is controlled by the rag.retrieval.strategy field in agents.yaml, with five built-in strategies and an extension point for custom ones.

The scope hierarchy

__shared__                    ← visible to ALL tenants (root common data)
└── tenant_id
    └── user_id
        ├── scope="user"      ← visible across all of this user's chats
        └── chat_id
            ├── scope="chat_shared"   ← all agents in this chat
            └── scope="chat_agent"    ← private to one agent
                + agent_id

A retrieval query for a given scope sees every level above it: shared data, tenant data, user data, and chat data. This makes tenant-wide knowledge bases, user-uploaded files, and dynamically injected tool results all accessible through a single retrieval call — no manual filter logic required.

OrchidRAGScope

from orchid_ai.rag.scopes import OrchidRAGScope

scope = OrchidRAGScope(
    tenant_id=auth.tenant_key,   # required
    user_id=auth.user_id,        # optional — omit for tenant-wide queries
    chat_id=state.get("chat_id", ""),
    agent_id=self.name,
)

docs = await self.fetch_rag_context(query, scope, namespace="learning", k=5)

OrchidRAGScope is a frozen dataclass living in orchid_ai/core/scopes.py. Because it has zero external dependencies, it can be constructed anywhere — agents, indexers, tests — without importing backend-specific code.

Retrieval strategies

Orchid ships five retrieval strategies, each implementing the OrchidRetrievalStrategy ABC. Strategies are stateless and selected per-agent via rag.retrieval.strategy. Unknown strategy names fall back to simple with a warning — a typo never crashes the agent.

simple — single dense retrieval

The default. One reader.retrieve() call with the user query as-is. Fastest path, best for well-formed queries where the corpus contains a near-verbatim match.

rag:
retrieval:
  strategy: simple

When to use: Baseline retrieval, low-latency requirements, queries that closely match corpus vocabulary.

multi_query — fan-out with paraphrases

The LLM generates N paraphrases of the original query (default 3), retrieves against each independently in parallel, then merges results by score with deduplication. Improves recall for ambiguous, informal, or multi-intent questions.

rag:
retrieval:
  strategy: multi_query
  # multi_query:
  #   num_queries: 3      # default
  #   retrieval_timeout: 30.0

Without a chat_model the strategy degrades gracefully to single-query retrieval. External pre_strategy=False transformers compose orthogonally — pair with query_transformers: [hyde, decompose] for multi-query × HyDE fan-out.

When to use: Ambiguous queries, broad recall needs, user questions that may not match corpus terminology.

hyde — hypothetical document embeddings

Generates N plausible answer paragraphs via the LLM, then retrieves documents similar to those hypothetical answers in addition to the original query. Results are merged by document ID (highest score wins). Effective when user vocabulary differs significantly from corpus vocabulary.

rag:
retrieval:
  strategy: hyde
  hyde:
    n_hypothetical: 2     # default: 1

Like multi_query, HyDE composes with external transformers and degrades to dense-only when no chat model is available.

When to use: Domain-specific corpora where users ask questions in lay terms, conceptual queries where the answer structure matters more than keyword overlap.

hybrid — dense + sparse fusion

Issues two parallel retrievals — one against the dense embedding lane and one against the sparse/lexical lane (BM25 by default, SPLADE behind an optional extra) — then fuses rankings via Reciprocal Rank Fusion (RRF, default) or weighted-linear fusion. Each lane fetches k × lane_multiplier candidates so the fusion has enough headroom to surface dense-only or sparse-only matches.

rag:
retrieval:
  strategy: hybrid
  hybrid:
    sparse_encoder: bm25       # or splade (optional extra)
    sparse_weight: 0.4         # for linear fusion
    fusion: rrf                # or linear
    rrf_k: 60                  # RRF smoothing constant

When the backend lacks sparse support, the strategy logs a warning and degrades to dense-only. Sparse encoder failures also fall back to the dense lane.

When to use: Mixed corpora with both semantic content (best for dense) and exact-match terms like product codes, IDs, or technical jargon (best for sparse).

graph_rag — knowledge-graph-augmented retrieval

A four-step pipeline: (1) resolve seed entities from the query via graph_store.find_entities(), (2) walk the graph up to max_hops from every seed, (3) fetch text chunks via the standard vector lane, (4) serialise the visited sub-graph as a synthetic OrchidSearchResult and prepend it as context. When no graph_store is wired or no seed entities are found, falls back to SimpleRetrieval.

rag:
retrieval:
  strategy: graph_rag
  graph:
    max_hops: 2
    fuse_with_vectors: true
    # relation_filter: ["related_to", "depends_on"]  # optional

The sub-graph is serialised as plain text (Entity (type) [-relation-> Other]) and injected as a synthetic document with source: graph_rag. The entity_serializer constructor kwarg accepts a custom serialiser for domain-specific formats (RDF turtle, JSON-LD, …).

When to use: Corpora with rich relational structure — product catalogs with dependencies, organizational hierarchies, compliance frameworks with cross-references.

Custom strategies

Register at startup via register_retrieval_strategy() and reference by name in config:

from orchid_ai.rag.strategies import register_retrieval_strategy
from my_strategies import RecencyWeightedRetrieval

register_retrieval_strategy("recency_simple", RecencyWeightedRetrieval)
rag:
retrieval:
  strategy: recency_simple   # registered at startup

See the RAG Strategies example for a working recency_simple strategy registered from a startup hook.

Ingestion strategies

While retrieval controls how documents are fetched, ingestion controls how documents are chunked and indexed. Four built-in ingestion strategies implement OrchidIngestionStrategy:

StrategyBehaviourBest for
recursiveRecursiveCharacterTextSplitter with configurable chunk size/overlapGeneral-purpose documents
semanticSplits at semantic boundaries (paragraphs, sections)Prose-heavy content
hierarchicalParent-child chunking: large parent chunks with smaller child chunks for retrievalAuto-merging retrieval patterns
headeredPreserves document headers/structure in chunk metadataStructured documents with headings

Set via rag.ingestion.strategy. Custom ingestion strategies register via register_ingestion_strategy().

Query transformers

Transformers modify the query before retrieval runs. They can fire at agent entry (pre_strategy=True) or internally within a fan-out strategy (pre_strategy=False):

TransformerFlagBehaviour
reformulatepre_strategy=TrueLLM rewrites the query for clarity
multi_querypre_strategy=FalseGenerates N paraphrases
hydepre_strategy=FalseGenerates hypothetical answer paragraphs
decomposepre_strategy=FalseBreaks complex queries into sub-queries

Pre-strategy transformers run once at agent entry via apply_pre_strategy(). Strategy-internal transformers fan out inside the retrieval strategy itself.

Dynamic injection

After each tool call, GenericAgent optionally injects the result into the vector store. On the next turn the agent retrieves its own prior tool output alongside static knowledge — providing a short-term memory that survives across conversation turns within the same chat. Controlled by inject_to_rag() which is per-tool, not per-agent.

Use the exclude_dynamic: true retrieval flag to keep dynamically-injected tool output out of the retrieval path.

Metadata filtering

OrchidVectorReader.retrieve(...) accepts an optional metadata_filters parameter that operates alongside the scope filter. The mini-language supports:

FormMeaning
{"status": "published"}Exact match
{"language": ["en", "fr"]}Match-any (OR within the field)
{"view_count": {"gte": 100}}Range — gte/lte/gt/lt
{"published_at": {"gte": "2026-01"}}ISO-8601 strings → DatetimeRange
{"tags": {"contains": "alpha"}}Substring / list-contains
{"deprecated": {"not": True}}Negation (must_not)

The Qdrant backend translates these via build_metadata_filter_clauses; the ChromaDB backend translates them into MongoDB-style where clauses ($eq, $in, $gte, etc.) via _translate_metadata_filter. Both backends infer payload index types when the agent didn't declare them explicitly.

Vector store interfaces

The framework exposes three purpose-segregated ABCs:

ABCWho depends on itMethods
OrchidVectorReaderAgents (read-only)retrieve(query, namespace, k, scope)
OrchidVectorWriterIndexers (write-only)upsert(documents, namespace), delete(ids, namespace)
OrchidVectorStoreAdminAdmin operationsCollection management, scope promotion

The framework ships two concrete backends:

  • QdrantRepository (rag/backends/qdrant.py) — full-featured: dense+sparse hybrid search, scope promotion, dynamic cache lookups. Requires a running Qdrant container.
  • ChromadbRepository (orchid_cli/rag/backends/chroma.py) — zero-infrastructure: embedded, on-disk storage via chromadb.PersistentClient. The orchid-cli default. Dense-only in v1 (no sparse vectors). Stores data at ~/.orchid/chroma/.

Both implement all three ABCs. Never import qdrant_client or chromadb outside of their respective backend modules — all access goes through these ABCs. (The ChromaDB backend lives in orchid-cli/ — the CLI registers it in VECTOR_BACKEND_REGISTRY at startup so the library never imports it directly.)

Local development with cli_rag

When running Docker-based examples locally via the CLI, add a cli_rag: section to orchid.yml to use ChromaDB instead of Qdrant:

rag:
  vector_backend: qdrant           # used by orchid-api (Docker)
  embedding_model: gemini/gemini-embedding-001

cli_rag:
  vector_backend: chroma           # used by orchid-cli (local, on-disk)
  embedding_model: ollama/nomic-embed-text

The CLI automatically uses cli_rag: when present, falling back to rag: otherwise. See the orchid-cli package page for details.

Namespace conventions

Namespaces are vector store collection prefixes that separate different kinds of knowledge:

NamespaceContents
learningTenant knowledge base (documents, wikis)
notificationsAgent-generated structured data
uploadsUser-uploaded files (scoped to the uploading user's chat)

Set an agent's namespace via rag.namespace. Leave it empty to skip RAG retrieval.

External reading