Hierarchical RAG
OrchidRAGScope, the 5-level hierarchy, retrieval strategies, ingestion pipelines, and the parse-once pattern.
AnthropicOpenAIOllamaGoogleOrchid's RAG subsystem is built around a five-level scope hierarchy and a pluggable strategy architecture. Every vector read and write carries an OrchidRAGScope — a frozen dataclass that encodes exactly which slice of the vector store a given operation is allowed to see or write to. Retrieval behaviour is controlled by the rag.retrieval.strategy field in agents.yaml, with five built-in strategies and an extension point for custom ones.
The scope hierarchy
__shared__ ← visible to ALL tenants (root common data)
└── tenant_id
└── user_id
├── scope="user" ← visible across all of this user's chats
└── chat_id
├── scope="chat_shared" ← all agents in this chat
└── scope="chat_agent" ← private to one agent
+ agent_idA retrieval query for a given scope sees every level above it: shared data, tenant data, user data, and chat data. This makes tenant-wide knowledge bases, user-uploaded files, and dynamically injected tool results all accessible through a single retrieval call — no manual filter logic required.
OrchidRAGScope
from orchid_ai.rag.scopes import OrchidRAGScope
scope = OrchidRAGScope(
tenant_id=auth.tenant_key, # required
user_id=auth.user_id, # optional — omit for tenant-wide queries
chat_id=state.get("chat_id", ""),
agent_id=self.name,
)
docs = await self.fetch_rag_context(query, scope, namespace="learning", k=5)OrchidRAGScope is a frozen dataclass living in orchid_ai/core/scopes.py. Because it has zero external dependencies, it can be constructed anywhere — agents, indexers, tests — without importing backend-specific code.
Retrieval strategies
Orchid ships five retrieval strategies, each implementing the OrchidRetrievalStrategy ABC. Strategies are stateless and selected per-agent via rag.retrieval.strategy. Unknown strategy names fall back to simple with a warning — a typo never crashes the agent.
simple — single dense retrieval
The default. One reader.retrieve() call with the user query as-is. Fastest path, best for well-formed queries where the corpus contains a near-verbatim match.
rag:
retrieval:
strategy: simpleWhen to use: Baseline retrieval, low-latency requirements, queries that closely match corpus vocabulary.
multi_query — fan-out with paraphrases
The LLM generates N paraphrases of the original query (default 3), retrieves against each independently in parallel, then merges results by score with deduplication. Improves recall for ambiguous, informal, or multi-intent questions.
rag:
retrieval:
strategy: multi_query
# multi_query:
# num_queries: 3 # default
# retrieval_timeout: 30.0Without a chat_model the strategy degrades gracefully to single-query retrieval. External pre_strategy=False transformers compose orthogonally — pair with query_transformers: [hyde, decompose] for multi-query × HyDE fan-out.
When to use: Ambiguous queries, broad recall needs, user questions that may not match corpus terminology.
hyde — hypothetical document embeddings
Generates N plausible answer paragraphs via the LLM, then retrieves documents similar to those hypothetical answers in addition to the original query. Results are merged by document ID (highest score wins). Effective when user vocabulary differs significantly from corpus vocabulary.
rag:
retrieval:
strategy: hyde
hyde:
n_hypothetical: 2 # default: 1Like multi_query, HyDE composes with external transformers and degrades to dense-only when no chat model is available.
When to use: Domain-specific corpora where users ask questions in lay terms, conceptual queries where the answer structure matters more than keyword overlap.
hybrid — dense + sparse fusion
Issues two parallel retrievals — one against the dense embedding lane and one against the sparse/lexical lane (BM25 by default, SPLADE behind an optional extra) — then fuses rankings via Reciprocal Rank Fusion (RRF, default) or weighted-linear fusion. Each lane fetches k × lane_multiplier candidates so the fusion has enough headroom to surface dense-only or sparse-only matches.
rag:
retrieval:
strategy: hybrid
hybrid:
sparse_encoder: bm25 # or splade (optional extra)
sparse_weight: 0.4 # for linear fusion
fusion: rrf # or linear
rrf_k: 60 # RRF smoothing constantWhen the backend lacks sparse support, the strategy logs a warning and degrades to dense-only. Sparse encoder failures also fall back to the dense lane.
When to use: Mixed corpora with both semantic content (best for dense) and exact-match terms like product codes, IDs, or technical jargon (best for sparse).
graph_rag — knowledge-graph-augmented retrieval
A four-step pipeline: (1) resolve seed entities from the query via graph_store.find_entities(), (2) walk the graph up to max_hops from every seed, (3) fetch text chunks via the standard vector lane, (4) serialise the visited sub-graph as a synthetic OrchidSearchResult and prepend it as context. When no graph_store is wired or no seed entities are found, falls back to SimpleRetrieval.
rag:
retrieval:
strategy: graph_rag
graph:
max_hops: 2
fuse_with_vectors: true
# relation_filter: ["related_to", "depends_on"] # optionalThe sub-graph is serialised as plain text (Entity (type) [-relation-> Other]) and injected as a synthetic document with source: graph_rag. The entity_serializer constructor kwarg accepts a custom serialiser for domain-specific formats (RDF turtle, JSON-LD, …).
When to use: Corpora with rich relational structure — product catalogs with dependencies, organizational hierarchies, compliance frameworks with cross-references.
Custom strategies
Register at startup via register_retrieval_strategy() and reference by name in config:
from orchid_ai.rag.strategies import register_retrieval_strategy
from my_strategies import RecencyWeightedRetrieval
register_retrieval_strategy("recency_simple", RecencyWeightedRetrieval)rag:
retrieval:
strategy: recency_simple # registered at startupSee the RAG Strategies example for a working recency_simple strategy registered from a startup hook.
Ingestion strategies
While retrieval controls how documents are fetched, ingestion controls how documents are chunked and indexed. Four built-in ingestion strategies implement OrchidIngestionStrategy:
| Strategy | Behaviour | Best for |
|---|---|---|
recursive | RecursiveCharacterTextSplitter with configurable chunk size/overlap | General-purpose documents |
semantic | Splits at semantic boundaries (paragraphs, sections) | Prose-heavy content |
hierarchical | Parent-child chunking: large parent chunks with smaller child chunks for retrieval | Auto-merging retrieval patterns |
headered | Preserves document headers/structure in chunk metadata | Structured documents with headings |
Set via rag.ingestion.strategy. Custom ingestion strategies register via register_ingestion_strategy().
Query transformers
Transformers modify the query before retrieval runs. They can fire at agent entry (pre_strategy=True) or internally within a fan-out strategy (pre_strategy=False):
| Transformer | Flag | Behaviour |
|---|---|---|
reformulate | pre_strategy=True | LLM rewrites the query for clarity |
multi_query | pre_strategy=False | Generates N paraphrases |
hyde | pre_strategy=False | Generates hypothetical answer paragraphs |
decompose | pre_strategy=False | Breaks complex queries into sub-queries |
Pre-strategy transformers run once at agent entry via apply_pre_strategy(). Strategy-internal transformers fan out inside the retrieval strategy itself.
Dynamic injection
After each tool call, GenericAgent optionally injects the result into the vector store. On the next turn the agent retrieves its own prior tool output alongside static knowledge — providing a short-term memory that survives across conversation turns within the same chat. Controlled by inject_to_rag() which is per-tool, not per-agent.
Use the exclude_dynamic: true retrieval flag to keep dynamically-injected tool output out of the retrieval path.
Metadata filtering
OrchidVectorReader.retrieve(...) accepts an optional metadata_filters parameter that operates alongside the scope filter. The mini-language supports:
| Form | Meaning |
|---|---|
{"status": "published"} | Exact match |
{"language": ["en", "fr"]} | Match-any (OR within the field) |
{"view_count": {"gte": 100}} | Range — gte/lte/gt/lt |
{"published_at": {"gte": "2026-01"}} | ISO-8601 strings → DatetimeRange |
{"tags": {"contains": "alpha"}} | Substring / list-contains |
{"deprecated": {"not": True}} | Negation (must_not) |
The Qdrant backend translates these via build_metadata_filter_clauses; the ChromaDB backend translates them into MongoDB-style where clauses ($eq, $in, $gte, etc.) via _translate_metadata_filter. Both backends infer payload index types when the agent didn't declare them explicitly.
Vector store interfaces
The framework exposes three purpose-segregated ABCs:
| ABC | Who depends on it | Methods |
|---|---|---|
OrchidVectorReader | Agents (read-only) | retrieve(query, namespace, k, scope) |
OrchidVectorWriter | Indexers (write-only) | upsert(documents, namespace), delete(ids, namespace) |
OrchidVectorStoreAdmin | Admin operations | Collection management, scope promotion |
The framework ships two concrete backends:
QdrantRepository(rag/backends/qdrant.py) — full-featured: dense+sparse hybrid search, scope promotion, dynamic cache lookups. Requires a running Qdrant container.ChromadbRepository(orchid_cli/rag/backends/chroma.py) — zero-infrastructure: embedded, on-disk storage viachromadb.PersistentClient. Theorchid-clidefault. Dense-only in v1 (no sparse vectors). Stores data at~/.orchid/chroma/.
Both implement all three ABCs. Never import qdrant_client or chromadb outside of their respective backend modules — all access goes through these ABCs. (The ChromaDB backend lives in orchid-cli/ — the CLI registers it in VECTOR_BACKEND_REGISTRY at startup so the library never imports it directly.)
Local development with cli_rag
When running Docker-based examples locally via the CLI, add a cli_rag: section to orchid.yml to use ChromaDB instead of Qdrant:
rag:
vector_backend: qdrant # used by orchid-api (Docker)
embedding_model: gemini/gemini-embedding-001
cli_rag:
vector_backend: chroma # used by orchid-cli (local, on-disk)
embedding_model: ollama/nomic-embed-textThe CLI automatically uses cli_rag: when present, falling back to rag: otherwise. See the orchid-cli package page for details.
Namespace conventions
Namespaces are vector store collection prefixes that separate different kinds of knowledge:
| Namespace | Contents |
|---|---|
learning | Tenant knowledge base (documents, wikis) |
notifications | Agent-generated structured data |
uploads | User-uploaded files (scoped to the uploading user's chat) |
Set an agent's namespace via rag.namespace. Leave it empty to skip RAG retrieval.