RAG Architecture for Industrial AI: A Technical Overview
MuVeraAI Technical Whitepaper Series | Phase 1 | Document P1-04
Version: 1.0 Date: January 2026 Classification: Technical Reference Audience: Technical Architects, AI/ML Engineers, Platform Engineers
Abstract
Retrieval-Augmented Generation (RAG) has emerged as the dominant paradigm for building knowledge-intensive AI applications that require factual grounding. While RAG systems have proven effective in general domains, industrial applications present unique challenges: complex multi-format documents, domain-specific terminology, safety-critical accuracy requirements, and the need for structured reasoning over equipment relationships.
This paper presents the MuVeraAI RAG architecture, a production system designed for industrial workforce training and decision support in data center operations. We describe a four-mode RAG pipeline that progressively adds sophistication based on query complexity: Simple (direct retrieval), Advanced (hybrid search with reranking), Agentic (multi-step reasoning with tool use), and Graph-Enhanced (knowledge graph integration). The architecture addresses industrial-specific challenges including semantic chunking for technical documents, domain-aware entity extraction, multiple fusion algorithms for hybrid search, and graph-based reasoning for equipment diagnostics.
Our implementation leverages a microservices architecture with nine specialized services orchestrated through an asynchronous pipeline. Key contributions include a hybrid search system combining vector similarity with keyword matching using four fusion algorithms (RRF, Weighted, Convex, DBSF), a five-model reranking pipeline with ensemble capabilities, and GraphRAG integration that enables multi-hop reasoning over equipment relationships. We detail our approach to hallucination mitigation, citation injection, and confidence-based escalation for safety-critical industrial contexts.
1. Introduction
1.1 The Case for RAG in Industrial AI
Large Language Models have demonstrated remarkable capabilities in understanding and generating human language. However, their application to industrial domains faces three fundamental limitations:
Knowledge Currency: LLMs are trained on static datasets with knowledge cutoffs that may exclude recent equipment models, updated procedures, or regulatory changes. A data center technician troubleshooting a 2025 chiller model cannot rely on knowledge frozen in 2023.
Factual Grounding: LLMs generate plausible-sounding but potentially incorrect information, a phenomenon colloquially termed "hallucination." In industrial contexts, suggesting an incorrect refrigerant charge or skipping a safety lockout step can result in equipment damage, injury, or death.
Organizational Knowledge: Tribal knowledge, site-specific procedures, and institutional memory exist in documents, maintenance logs, and expert minds rather than in public training data. Fine-tuning cannot efficiently capture this constantly-evolving organizational knowledge.
Retrieval-Augmented Generation addresses these limitations by decoupling the knowledge store from the reasoning engine. The LLM's role shifts from knowledge repository to reasoning engine, synthesizing answers from retrieved evidence rather than parametric memory. This architectural separation enables knowledge updates without model retraining, traceable citations for verification, and integration of proprietary organizational knowledge.
1.2 Why Fine-Tuning Is Insufficient
Organizations often consider fine-tuning as an alternative to RAG for domain adaptation. While fine-tuning can improve model behavior and tone, it presents significant drawbacks for industrial knowledge applications:
Update Latency: Fine-tuning requires collecting training data, formatting it appropriately, running training jobs, and deploying updated models. This cycle typically takes days to weeks, making it unsuitable for frequently-updated procedural knowledge.
Cost at Scale: Fine-tuning costs scale with both dataset size and update frequency. Organizations with thousands of documents requiring monthly updates face prohibitive training costs.
Catastrophic Forgetting: Fine-tuning on domain data can degrade performance on general capabilities, requiring careful balance between specialization and general reasoning.
No Attribution: Fine-tuned models cannot cite sources for their outputs, eliminating the ability to verify claims against authoritative documents.
RAG provides a more practical approach: new documents can be indexed in minutes, costs scale with storage rather than compute, general capabilities remain intact, and every claim can be traced to source documents.
1.3 Scope and Prerequisites
This paper targets technical architects evaluating RAG architectures and ML engineers implementing industrial AI systems. We assume familiarity with:
- Vector embeddings and similarity search
- Transformer-based language models
- Microservices architecture patterns
- Basic graph database concepts
The MuVeraAI platform focuses on data center operations, specifically HVAC/R (Heating, Ventilation, Air Conditioning, and Refrigeration) systems. Examples throughout this paper reflect this domain, though the architecture generalizes to other industrial contexts including manufacturing, energy, and facilities management.
2. RAG Fundamentals
2.1 The Core Concept
At its essence, RAG combines two distinct phases:
Retrieval Phase: Given a user query, find relevant documents or passages from a knowledge corpus. This typically involves encoding the query into a vector representation and finding documents with similar vectors.
Generation Phase: Given the query and retrieved context, generate a response using a language model. The model synthesizes information from multiple retrieved passages rather than relying solely on parametric knowledge.
The canonical RAG pipeline proceeds as follows:
User Query
|
v
[1. Embed Query] --> Query Vector
|
v
[2. Vector Search] --> Top-K Similar Documents
|
v
[3. Context Assembly] --> Formatted Context + Query
|
v
[4. LLM Generation] --> Response with Citations
|
v
Final Answer
This simple pipeline, while effective for straightforward queries, proves insufficient for industrial applications where queries may require multi-step reasoning, integration of structured data, or traversal of equipment relationships.
2.2 Component Architecture
A production RAG system comprises multiple specialized components:
Document Processing Pipeline
- Ingestion: Accept documents in multiple formats (PDF, DOCX, HTML, plain text)
- Parsing: Extract text while preserving structure (headings, tables, lists)
- Chunking: Segment documents into retrieval units of appropriate size
- Enrichment: Extract entities, relationships, and metadata
Embedding Infrastructure
- Model Selection: Choose embedding models appropriate for the domain
- Batch Processing: Efficiently embed large document collections
- Index Management: Create, update, and query vector indices
- Dimension Optimization: Balance embedding quality against storage costs
Retrieval System
- Vector Search: Find semantically similar documents
- Keyword Search: Find exact term matches for technical terminology
- Hybrid Fusion: Combine multiple retrieval signals
- Reranking: Refine initial results using cross-encoder models
Generation Layer
- Prompt Construction: Format context and query for the LLM
- Model Routing: Select appropriate models based on query characteristics
- Response Streaming: Deliver partial responses for improved UX
- Citation Injection: Link claims to source passages
Each component requires careful design decisions that affect system quality, latency, and cost. The following sections detail our approach to each.
3. Industrial RAG Challenges
Industrial applications present challenges that distinguish them from general-purpose RAG systems. Understanding these challenges informs our architectural decisions.
3.1 Document Complexity
Industrial knowledge bases contain documents that defy naive processing approaches:
PDFs with Embedded Tables: Equipment specifications, troubleshooting matrices, and pressure-temperature charts encode critical information in tabular formats. Standard PDF extraction often corrupts table structure, converting rows into unrelated text fragments.
Original Table:
| Symptom | Likely Cause | Resolution |
|-------------------|--------------------|--------------------|
| High head pressure| Dirty condenser | Clean condenser |
| Low suction | Low refrigerant | Check for leaks |
Naive Extraction:
"Symptom Likely Cause Resolution High head pressure Dirty condenser
Clean condenser Low suction Low refrigerant Check for leaks"
The extracted text loses the relational structure essential for accurate retrieval and reasoning.
Multi-Document Procedures: A complete work procedure may span multiple documents: a general maintenance checklist, equipment-specific addenda, site-specific safety requirements, and current work orders. Answering "How do I perform PM on Chiller 3?" requires synthesizing information across these sources.
Scanned Documents: Legacy documentation often exists only as scanned images. OCR introduces errors, particularly for technical terminology and model numbers. "R-410A" may become "R-4lOA" or "R-41DA", fragmenting retrieval accuracy.
Drawing and Diagram Annotations: P&ID (Piping and Instrumentation Diagrams) and electrical schematics contain text labels that require spatial reasoning to interpret correctly. The label "COMP-1" next to a symbol carries meaning only in relation to the diagram structure.
3.2 Query Understanding
Industrial queries present linguistic challenges absent from general domains:
Technical Terminology: Domain vocabulary includes abbreviations (CRAC, CRAH, TXV, EEV), model numbers (Liebert DS112A), refrigerant designations (R-410A, R-454B), and measurement units (psig, kPa, CFM). Embedding models trained on general text may fail to capture semantic relationships between technical terms.
Implicit Context: Technicians ask questions assuming shared context: "What's the superheat supposed to be?" implicitly references the specific equipment they're working on, the operating conditions, and the refrigerant type. A helpful system must either infer this context or request clarification.
Multi-Step Queries: Troubleshooting queries often require sequential reasoning: "The chiller is short-cycling and the discharge pressure is high. What should I check?" requires understanding symptom-cause relationships and diagnostic prioritization rather than simple document retrieval.
Ambiguity Resolution: Technical terms may have multiple meanings depending on context. "Discharge" might refer to discharge pressure (refrigeration), electrical discharge (safety), or patient discharge (in medical contexts). The system must disambiguate based on domain and query context.
3.3 Safety Requirements
Industrial AI systems operate in high-stakes environments where errors carry real consequences:
Hallucination Consequences: A hallucinated refrigerant charge recommendation could damage a compressor. A fabricated lockout procedure could result in electrocution. The cost of a confident but incorrect answer far exceeds the cost of admitting uncertainty.
Confidence Calibration: The system must accurately assess its own certainty. High-confidence answers with low actual accuracy are more dangerous than low-confidence answers, as users may not independently verify confident statements.
Escalation Paths: When the system lacks sufficient information or encounters queries outside its competence, it must escalate to human experts rather than guess. This requires mechanisms to detect knowledge boundaries and route appropriately.
Audit Trail: Regulated industries require the ability to review what information the system provided and why. Complete logging of queries, retrieved documents, and generated responses enables post-hoc analysis of any incidents.
4. The MuVeraAI RAG Architecture
Our architecture addresses industrial challenges through a four-mode pipeline that adapts sophistication to query complexity. This section details each component and their interactions.
4.1 Four RAG Modes
The MuVeraAI platform implements four distinct RAG modes, selected based on query characteristics and user preferences:
Mode 1: Simple RAG
Query --> Embed --> Vector Search --> Top-K Chunks --> LLM --> Response
Simple mode provides the fastest response path for straightforward factual queries. It performs direct embedding of the query, retrieves the top-K most similar document chunks, and generates a response. This mode suits queries like "What is the refrigerant charge for a Carrier 30XA chiller?" where a single document section likely contains the answer.
Mode 2: Advanced RAG
Query --> Expand --> Hybrid Search --> Fusion --> Rerank --> LLM --> Response
Advanced mode adds query expansion, hybrid search (combining vector and keyword retrieval), result fusion, and neural reranking. This mode handles queries requiring multiple retrieval signals, such as "How do I troubleshoot high head pressure on a scroll compressor?" where both semantic similarity and exact term matching improve retrieval.
Mode 3: Agentic RAG
Query --> Plan --> [Tool Call --> Result]* --> Synthesize --> Response
Agentic mode enables multi-step reasoning through tool use. The LLM generates a plan, executes tools (search, calculation, lookup), observes results, and iterates until sufficient information is gathered. This mode handles complex queries like "Compare the energy efficiency of replacing our 20-year-old chillers versus retrofitting with VFDs" that require multiple searches and calculations.
Mode 4: Graph-Enhanced RAG
Query --> Entity Extract --> Graph Traverse --> Vector Search --> Fuse --> LLM --> Response
Graph-enhanced mode integrates knowledge graph traversal with vector retrieval. Entities mentioned in the query seed graph traversal to find related equipment, procedures, and diagnostic paths. This mode excels at queries involving relationships: "What components might fail if Pump-3A trips?" requires understanding equipment dependencies rather than just document similarity.
4.2 Document Processing Pipeline
Documents enter the system through a multi-stage processing pipeline:
Stage 1: Format Handling
The Document Processor service accepts multiple input formats:
SUPPORTED_FORMATS = {
'application/pdf': process_pdf, # PDF with text and tables
'application/vnd.openxmlformats...': process_docx, # Word documents
'application/vnd.ms-excel': process_xlsx, # Excel spreadsheets
'text/html': process_html, # HTML pages
'text/plain': process_text, # Plain text
}
PDF processing deserves special attention due to its prevalence in industrial documentation. We employ a multi-strategy approach:
- Text Extraction: Extract text using PyPDF2 or pdfplumber
- Table Detection: Identify tables using tabular structure recognition
- Table Extraction: Extract tables preserving row/column relationships
- OCR Fallback: For scanned pages, apply Tesseract OCR
- Layout Analysis: Preserve heading hierarchy and section structure
Stage 2: Semantic Chunking
Naive fixed-size chunking ignores document structure, potentially splitting sentences mid-thought or separating procedure steps from their context. Our Semantic Chunker implements four strategies:
Fixed Chunking: Simple character/token-based splitting with overlap. Fastest but lowest quality.
def _fixed_chunking(self, text: str) -> List[Chunk]:
chunks = []
chunk_size = self.settings.max_chunk_size # e.g., 512 tokens
overlap = self.settings.overlap_size # e.g., 50 tokens
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunk_text = text[start:end]
# Avoid mid-word breaks
if end < len(text) and not text[end].isspace():
last_space = chunk_text.rfind(' ')
if last_space > chunk_size // 2:
end = start + last_space
chunk_text = text[start:end]
chunks.append(Chunk(content=chunk_text.strip(), ...))
start = end - overlap
return chunks
Recursive Chunking: Respects document structure by splitting on semantic boundaries (paragraph breaks, section headers, sentence ends) before falling back to smaller units.
SEPARATORS = [
"\n\n\n", # Multiple blank lines (major sections)
"\n\n", # Paragraph breaks
"\n", # Line breaks
". ", # Sentence ends
", ", # Clause boundaries
" ", # Word boundaries (last resort)
]
Semantic Chunking: Uses embedding similarity to detect topic boundaries. Adjacent sentences with low similarity indicate a topic shift and chunking boundary.
Recursive-Semantic Hybrid: Applies recursive chunking first, then refines large chunks using semantic analysis. This balances structural preservation with topic coherence.
Our default configuration uses recursive-semantic chunking with:
- Maximum chunk size: 512 tokens
- Minimum chunk size: 100 tokens
- Overlap: 50 tokens
- Similarity threshold: 0.75
Stage 3: Entity Extraction
The Entity Extractor identifies domain-relevant entities using a hybrid approach combining spaCy NER with pattern matching and optional GLiNER zero-shot extraction.
Domain patterns capture industrial vocabulary:
TRADE_SKILLS_PATTERNS = {
"equipment": [
"HVAC", "compressor", "condenser", "evaporator", "chiller",
"CRAC", "CRAH", "CDU", "cooling tower", "VFD", ...
],
"symptom": [
"noise", "vibration", "leak", "overheating", "short cycling",
"high head pressure", "low suction pressure", ...
],
"procedure": [
"maintenance", "troubleshooting", "calibration", "evacuation",
"refrigerant recovery", "PM", ...
],
"measurement": [
"superheat", "subcooling", "discharge pressure", "CFM", ...
],
"refrigerant": [
"R-410A", "R-22", "R-134a", "R-454B", ...
],
}
Extracted entities serve multiple purposes:
- Metadata enrichment for filtered search
- Seeds for knowledge graph traversal
- Query expansion candidates
- Relationship extraction anchors
Stage 4: Relationship Extraction
Beyond entities, we extract relationships between entities that appear in the same context. Pattern-based rules identify diagnostic relationships:
DIAGNOSTIC_PATTERNS = {
"HAS_SYMPTOM": ["shows", "exhibits", "experiencing", "symptom of"],
"CAUSED_BY": ["caused by", "due to", "result of", "because of"],
"RESOLVED_BY": ["fixed by", "resolved by", "corrected by"],
"REQUIRES_SKILL": ["requires", "needs", "prerequisite"],
"APPLIES_TO": ["applies to", "for", "used with"],
}
Type-based inference supplements pattern matching:
TYPE_RELATIONS = {
("equipment", "symptom"): "HAS_SYMPTOM",
("symptom", "cause"): "CAUSED_BY",
("cause", "solution"): "RESOLVED_BY",
("procedure", "equipment"): "APPLIES_TO",
}
4.3 Retrieval Architecture
The retrieval system implements hybrid search combining vector similarity with keyword matching.
Vector Store: Qdrant
We selected Qdrant as our vector database for several reasons:
- Native support for multiple vector fields per document
- Efficient filtering on metadata fields
- Horizontal scaling with sharding
- Active development and community
Document chunks are indexed with their embeddings and metadata:
# Index structure
{
"id": "chunk-uuid",
"vector": [0.12, -0.34, ...], # 384-1024 dimensions
"payload": {
"text": "The compressor discharge temperature...",
"source": "carrier-30xa-manual.pdf",
"page": 42,
"section": "Troubleshooting",
"entities": ["compressor", "discharge temperature"],
"document_type": "maintenance_manual",
}
}
Keyword Search: Meilisearch
Meilisearch provides typo-tolerant keyword search for exact term matching. This complements vector search by catching technical terms that embedding models may not handle well:
# Keyword search
results = meili.index("documents").search(
"R-410A superheat adjustment",
{
"limit": 20,
"showRankingScore": True,
"filter": "document_type = 'procedure'"
}
)
Hybrid Fusion Algorithms
Combining vector and keyword results requires careful score fusion. We implement four algorithms:
Reciprocal Rank Fusion (RRF): Combines results based on rank rather than score, making it robust to score distribution differences between retrievers.
def _rrf_fusion(self, vector_results, keyword_results, k=60):
scores = {}
for result in vector_results:
scores[result.id] = scores.get(result.id, 0) + 1 / (k + result.rank)
for result in keyword_results:
scores[result.id] = scores.get(result.id, 0) + 1 / (k + result.rank)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
Weighted Fusion: Normalizes scores and applies configurable weights to each retriever.
Convex Combination: A special case of weighted fusion where weights sum to 1, providing a convex blend of scores.
Distribution-Based Score Fusion (DBSF): Normalizes scores using z-score transformation to account for different score distributions.
def _dbsf_fusion(self, vector_results, keyword_results):
# Z-score normalize each result set
v_scores = [r.score for r in vector_results]
v_mean, v_std = np.mean(v_scores), np.std(v_scores) or 1
k_scores = [r.score for r in keyword_results]
k_mean, k_std = np.mean(k_scores), np.std(k_scores) or 1
scores = {}
for r in vector_results:
z = (r.score - v_mean) / v_std
scores[r.id] = scores.get(r.id, 0) + z
for r in keyword_results:
z = (r.score - k_mean) / k_std
scores[r.id] = scores.get(r.id, 0) + z
return sorted(scores.items(), key=lambda x: x[1], reverse=True)
Our default configuration uses RRF with k=60, which provides robust fusion without requiring score calibration.
Query Expansion
The Query Expander enhances recall by adding related terms to the query. We implement multiple expansion methods:
Synonym Expansion: Adds domain-specific synonyms from a curated vocabulary.
DOMAIN_SYNONYMS = {
"chiller": ["cooling unit", "refrigeration unit", "cooler"],
"noise": ["sound", "humming", "buzzing", "rattling", "squealing"],
"maintenance": ["service", "PM", "preventive maintenance"],
}
Abbreviation Expansion: Expands technical abbreviations.
ABBREVIATIONS = {
"crac": "computer room air conditioner",
"vfd": "variable frequency drive",
"txv": "thermostatic expansion valve",
}
Embedding-Based Expansion: Finds semantically similar terms using vector similarity.
Pseudo-Relevance Feedback (PRF): Extracts terms from top initial results to expand the query.
Hybrid Expansion: Combines synonym and embedding expansion for maximum recall.
4.4 Reranking Pipeline
Initial retrieval typically returns 50-100 candidates that require refinement. The Reranking Pipeline applies more expensive but more accurate models to reorder results:
Cross-Encoder Reranking
Cross-encoders jointly encode the query and document, enabling direct relevance scoring. We support multiple models:
RERANKER_MODELS = {
"cross-encoder": "cross-encoder/ms-marco-MiniLM-L-6-v2",
"cross-encoder-large": "cross-encoder/ms-marco-MiniLM-L-12-v2",
"bge-reranker": "BAAI/bge-reranker-large",
}
The cross-encoder processes query-document pairs:
def _rerank_cross_encoder(self, query, documents):
cross_encoder = self._load_cross_encoder(model_name)
# Create query-document pairs
pairs = [[query, doc.text] for doc in documents]
# Score all pairs
scores = cross_encoder.predict(pairs, batch_size=32)
# Sort by reranking score
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
return ranked
Cohere Reranking
For production deployments, Cohere's API-based reranker offers strong performance without local GPU requirements:
async def _rerank_cohere(self, query, documents):
response = self.cohere_client.rerank(
query=query,
documents=[doc.text for doc in documents],
model="rerank-english-v2.0",
top_n=len(documents)
)
return [(documents[r.index], r.relevance_score) for r in response.results]
FlashRank
FlashRank provides a lightweight, fast reranker suitable for edge deployment:
def _rerank_flashrank(self, query, documents):
request = FlashRerankRequest(
query=query,
passages=[{"id": doc.id, "text": doc.text} for doc in documents]
)
return self.flashrank_model.rerank(request)
LLM-Based Reranking
For maximum accuracy on domain-specific content, we can use the LLM itself to score relevance:
async def _rerank_llm(self, query, documents):
prompt = f"""Rate the relevance of each document to the query (0-10).
Query: {query}
Documents:
{formatted_documents}
Respond with only a JSON array of scores: [8, 5, 9, ...]"""
response = await self.llm.complete(prompt)
scores = json.loads(response)
return sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
Ensemble Reranking
Our ensemble approach combines multiple rerankers using RRF:
async def _rerank_ensemble(self, query, documents):
cross_encoder_ranked = self._rerank_cross_encoder(query, documents)
flashrank_ranked = self._rerank_flashrank(query, documents)
# RRF fusion of rankings
scores = {}
for rank, (doc, _) in enumerate(cross_encoder_ranked):
scores[doc.id] = 1 / (60 + rank)
for rank, (doc, _) in enumerate(flashrank_ranked):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)
return sorted(documents, key=lambda d: scores[d.id], reverse=True)
4.5 Knowledge Graph Integration
The knowledge graph captures structured relationships that complement unstructured document retrieval.
Neo4j Schema
Our graph schema models industrial domain entities:
// Node types
(:Equipment {name, type, manufacturer, model, location})
(:Symptom {name, description, severity})
(:Cause {name, description, probability})
(:Solution {name, description, difficulty, estimatedTime})
(:Procedure {name, type, steps, safetyRequirements})
(:Skill {name, level, prerequisites})
// Relationships
(equipment)-[:HAS_SYMPTOM]->(symptom)
(symptom)-[:CAUSED_BY]->(cause)
(cause)-[:RESOLVED_BY]->(solution)
(solution)-[:REQUIRES_SKILL]->(skill)
(procedure)-[:APPLIES_TO]->(equipment)
(skill)-[:PREREQUISITE_OF]->(skill)
GraphRAG Retrieval
The GraphRAG module implements graph-enhanced retrieval:
async def retrieve_graph_context(self, query, max_hops=2, max_entities=10):
# 1. Extract entities from query
entities = await self._extract_entities(query)
# 2. BFS traversal from seed entities
graph_entities, relationships = await self._traverse_graph(
entities=entities,
max_hops=max_hops,
max_entities=max_entities
)
# 3. Get diagnostic paths for equipment/symptoms
diagnostic_paths = await self._get_diagnostic_paths(entities)
# 4. Find related procedures
related_procedures = await self._get_related_procedures(entities)
# 5. Build context text
context_text = self._build_context_text(
graph_entities, relationships, diagnostic_paths, related_procedures
)
return GraphContext(
entities=graph_entities,
relationships=relationships,
diagnostic_paths=diagnostic_paths,
related_procedures=related_procedures,
context_text=context_text
)
Hybrid Vector-Graph Retrieval
Graph-enhanced mode combines vector search with graph traversal:
async def hybrid_retrieve(self, query, collection, top_k=10, graph_weight=0.3):
# Parallel retrieval
vector_task = self._vector_retrieve(query, collection, top_k)
graph_task = self.retrieve_graph_context(query)
vector_results, graph_context = await asyncio.gather(vector_task, graph_task)
# Boost vector results that mention graph entities
graph_entity_names = {e.name.lower() for e in graph_context.entities}
for result in vector_results:
text_lower = result.text.lower()
matches = sum(1 for name in graph_entity_names if name in text_lower)
if matches > 0:
result.score += graph_weight * matches * 0.1
result.graph_aligned = True
# Re-sort by boosted scores
vector_results.sort(key=lambda x: x.score, reverse=True)
return vector_results, graph_context
Diagnostic Reasoning Paths
For troubleshooting queries, the graph provides structured reasoning chains:
Equipment: Centrifugal Chiller
|
+-- HAS_SYMPTOM --> High Discharge Pressure
|
+-- CAUSED_BY --> Dirty Condenser (probability: 0.7)
| |
| +-- RESOLVED_BY --> Clean Condenser Coils
| |
| +-- REQUIRES_SKILL --> Basic HVAC Maintenance
|
+-- CAUSED_BY --> Non-Condensables (probability: 0.2)
|
+-- RESOLVED_BY --> Purge Non-Condensables
|
+-- REQUIRES_SKILL --> Refrigeration Certified
This structured knowledge complements unstructured document retrieval by providing explicit causal chains and skill requirements.
4.6 Generation and Grounding
The final stage synthesizes retrieved context into a coherent, grounded response.
LLM Routing with LiteLLM
We use LiteLLM as a unified interface to multiple LLM providers:
# Provider configuration
PROVIDERS = {
"openai": ["gpt-4o", "gpt-4o-mini"],
"anthropic": ["claude-sonnet-4-20250514", "claude-3-haiku"],
"ollama": ["llama3.2:8b", "mistral:7b"],
}
async def _generate(self, messages, model=None, temperature=0.7):
model = model or self.settings.default_model
response = await client.post(
f"{self.settings.litellm_url}/v1/chat/completions",
json={
"model": model,
"messages": messages,
"temperature": temperature,
"max_tokens": self.settings.max_output_tokens,
}
)
return response.json()["choices"][0]["message"]["content"]
Prompt Construction
The prompt combines system instructions, retrieved context, and the user query:
def _build_prompt(self, query, context, system_prompt=None):
system = system_prompt or """You are an expert industrial technician assistant.
Answer questions based on the provided context. If the context doesn't contain
sufficient information, say so rather than guessing. Always cite your sources
using [1], [2], etc."""
messages = [{"role": "system", "content": system}]
user_content = f"""Context:
{context}
Question: {query}
Provide a comprehensive answer based on the context above. Cite sources."""
messages.append({"role": "user", "content": user_content})
return messages
Context Assembly with Citations
Retrieved chunks are formatted with citation markers for source attribution:
def _build_context(self, chunks, graph_context=""):
context_parts = []
for i, chunk in enumerate(chunks):
source_info = f" [Source: {chunk.source}]" if chunk.source else ""
graph_badge = " [Graph]" if chunk.metadata.get("graph_aligned") else ""
context_parts.append(f"[{i+1}]{source_info}{graph_badge}\n{chunk.text}")
context = "\n\n".join(context_parts)
# Prepend graph context if available
if graph_context:
context = f"**Knowledge Graph Context:**\n{graph_context}\n\n---\n\n**Retrieved Documents:**\n{context}"
return context
Hallucination Mitigation
We employ multiple strategies to reduce hallucination:
-
Explicit Instructions: System prompts explicitly instruct the model to acknowledge uncertainty and avoid fabrication.
-
Citation Requirement: Requiring citations forces the model to ground claims in retrieved evidence.
-
Confidence Assessment: Post-generation analysis checks whether claims are supported by citations.
-
Temperature Control: Lower temperatures (0.3-0.5) reduce creative but unfounded responses.
-
Structured Output: For critical responses, we use structured output formats (JSON mode) that constrain generation.
5. Embedding Strategy
Embedding model selection significantly impacts retrieval quality. This section details our embedding architecture.
5.1 Multi-Model Support
The Embedding Service supports multiple embedding models to balance quality, latency, and cost:
MODEL_DIMENSIONS = {
"all-MiniLM-L6-v2": 384, # Fast, good baseline
"all-mpnet-base-v2": 768, # Better quality
"intfloat/e5-large-v2": 1024, # State-of-the-art
"BAAI/bge-large-en-v1.5": 1024, # Strong alternative
"text-embedding-3-small": 1536, # OpenAI API
"text-embedding-3-large": 3072, # OpenAI API (highest quality)
}
Model Selection Guidelines:
- Development/Testing:
all-MiniLM-L6-v2(fast, small, good enough for iteration) - Production Baseline:
all-mpnet-base-v2(good quality/speed balance) - High Accuracy:
e5-large-v2orbge-large-en-v1.5(best open-source) - Maximum Quality:
text-embedding-3-large(best overall, requires API)
5.2 Instruction-Tuned Embeddings
Modern embedding models often benefit from instruction prefixes:
def _embed_local(self, texts, model, normalize=True):
# E5 models expect "query: " prefix
if model == "e5-large-v2":
texts = [f"query: {t}" for t in texts]
# BGE models expect instruction prefix
elif model == "bge-large-en-v1.5":
texts = [f"Represent this sentence for searching relevant passages: {t}"
for t in texts]
return self.sentence_model.encode(texts, normalize_embeddings=normalize)
5.3 GPU Acceleration
Embedding generation benefits significantly from GPU acceleration:
def _get_device(self):
if self.settings.use_gpu and torch.cuda.is_available():
return "cuda"
return "cpu"
# Model loading with device placement
self.model = SentenceTransformer(model_name, device=self._device)
Benchmark comparisons (10,000 text chunks, batch_size=32):
| Model | CPU Time | GPU Time | Speedup | |-------|----------|----------|---------| | MiniLM | 45s | 3s | 15x | | MPNet | 120s | 8s | 15x | | E5-Large | 300s | 18s | 17x |
5.4 Batch Processing
Efficient batch processing is essential for initial indexing of large document collections:
embeddings = model.encode(
texts,
batch_size=32, # Tune for GPU memory
normalize_embeddings=True,
show_progress_bar=False,
convert_to_numpy=True
)
For very large collections, we implement streaming batch processing to avoid memory exhaustion:
async def embed_collection(self, documents, batch_size=1000):
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
embeddings = await self.embed(batch)
await self.index_batch(batch, embeddings)
# Yield control to prevent blocking
await asyncio.sleep(0)
6. Evaluation Framework
Evaluating RAG systems requires metrics that assess both retrieval and generation quality. Unlike traditional information retrieval, RAG evaluation must consider the entire pipeline: were the right documents retrieved? Was the generation faithful to those documents? Did the answer actually help the user?
This section presents our comprehensive evaluation framework spanning offline benchmarks, online monitoring, and human evaluation protocols.
6.1 Retrieval Metrics
Mean Reciprocal Rank (MRR): Measures the average position of the first relevant result.
MRR = (1/N) * sum(1/rank_i)
Normalized Discounted Cumulative Gain (NDCG): Accounts for graded relevance with position-based discounting.
DCG = sum(relevance_i / log2(i + 1))
NDCG = DCG / IDCG # Normalized by ideal ranking
Precision@K and Recall@K: Fraction of retrieved documents that are relevant, and fraction of relevant documents that are retrieved.
Precision@K = |relevant in top K| / K
Recall@K = |relevant in top K| / |total relevant|
Hit Rate: Whether any relevant document appears in the top-K results.
6.2 Generation Metrics (RAGAS)
We integrate the RAGAS framework for RAG-specific evaluation:
Faithfulness: Measures whether generated claims are supported by the retrieved context.
def evaluate_faithfulness(answer, context):
# Extract claims from answer
claims = extract_claims(answer)
# Check each claim against context
supported = sum(1 for claim in claims if is_supported(claim, context))
return supported / len(claims)
Answer Relevancy: Measures whether the answer addresses the question.
def evaluate_relevancy(question, answer):
# Generate questions that the answer would address
generated_questions = generate_questions_from_answer(answer)
# Compute similarity to original question
similarities = [similarity(q, question) for q in generated_questions]
return mean(similarities)
Context Precision: Measures whether retrieved documents are relevant to the question.
Context Recall: Measures whether all information needed for the answer was retrieved.
6.3 Domain-Specific Evaluation
Industrial applications require domain-specific evaluation beyond generic metrics:
Factual Accuracy: Expert review of technical correctness.
HVAC_FACTS_TEST_SET = [
{
"question": "What is the ideal superheat for R-410A?",
"expected_range": "8-12 degrees Fahrenheit",
"source": "ASHRAE Handbook"
},
# ...
]
Safety Compliance: Verify that safety-critical information is accurate.
Procedure Completeness: Ensure procedural answers include all required steps.
6.4 Human Evaluation
Automated metrics provide scale but cannot fully assess answer quality for industrial applications. We supplement automated evaluation with structured human review:
Expert Review Protocol: Domain experts (certified HVAC technicians) review a sample of responses for:
- Technical accuracy: Are the facts correct?
- Safety completeness: Are relevant safety warnings included?
- Procedural correctness: Are steps in the right order with no omissions?
- Appropriate confidence: Does the system appropriately acknowledge uncertainty?
Inter-Rater Reliability: Multiple experts review the same responses to establish consistency. Cohen's kappa measures agreement levels, with recalibration when agreement drops below 0.7.
Feedback Loop: Expert corrections are incorporated into the test set for regression testing and potentially into the knowledge base for future retrieval.
6.5 Continuous Evaluation
We implement continuous evaluation in production:
Comprehensive Logging: All queries, retrieved documents, generated responses, and latency metrics are logged to a centralized store. This enables post-hoc analysis of any production issue.
Statistical Sampling: A configurable percentage of queries (default 5%) are flagged for human review. Sampling is stratified by query complexity to ensure coverage across difficulty levels.
User Feedback Collection: Users can provide explicit feedback:
- Thumbs up/down on overall response quality
- Flag specific claims as incorrect
- Provide corrected information for training
Implicit Feedback Signals: We track behavioral signals that indicate satisfaction:
- Follow-up questions (suggests incomplete first answer)
- Time spent reading response
- Copy/paste actions (suggests useful content)
- Escalation to human expert (suggests insufficient AI response)
Regression Testing: A curated set of known-good query-response pairs is evaluated weekly. Any degradation in RAGAS scores or retrieval metrics triggers investigation.
A/B Testing Framework: New models, prompts, or retrieval configurations are deployed to a subset of traffic. Statistical significance testing determines whether changes improve key metrics before full rollout.
6.6 Evaluation Benchmarks
We maintain domain-specific benchmark datasets:
HVAC Factual QA (200 questions): Factual questions about HVAC systems with known correct answers from authoritative sources.
Troubleshooting Scenarios (50 scenarios): Multi-step diagnostic scenarios where the correct answer requires reasoning over symptoms, causes, and solutions.
Procedure Verification (30 procedures): Procedural questions where completeness and step ordering matter.
Safety Edge Cases (100 questions): Questions designed to elicit potentially dangerous advice, testing the system's safety guardrails.
Benchmark evaluation occurs weekly, with alerts when any metric drops more than 5% from baseline.
7. Deployment Considerations
Production deployment of RAG systems requires attention to scalability, latency, cost optimization, and operational concerns. Industrial deployments face additional constraints including data residency requirements, network isolation, and integration with existing enterprise systems.
7.1 Scalability Architecture
Our microservices architecture enables independent scaling of components:
[Load Balancer]
|
v
[API Gateway] ---> [Rate Limiter]
|
+---> [RAG Orchestrator] (3 replicas)
| |
| +---> [Hybrid Search] (2 replicas)
| | |
| | +---> [Qdrant] (3 shards)
| | +---> [Meilisearch]
| |
| +---> [Reranking] (2 replicas, GPU)
| |
| +---> [LiteLLM] ---> [LLM Providers]
|
+---> [Document Processor] (2 replicas)
+---> [Embedding Service] (2 replicas, GPU)
+---> [Context Graph] ---> [Neo4j] (cluster)
Scaling Guidelines:
- Embedding Service: Scale based on indexing throughput needs
- Hybrid Search: Scale based on query volume
- Reranking: Scale based on query volume (GPU-intensive)
- RAG Orchestrator: Scale based on concurrent users
- Qdrant: Shard based on collection size
7.2 Latency Optimization
End-to-end latency targets for different modes:
| Mode | Target P50 | Target P99 | |------|------------|------------| | Simple | 1s | 3s | | Advanced | 2s | 5s | | Graph-Enhanced | 3s | 8s | | Agentic | 5s | 15s |
Optimization Techniques:
- Parallel Retrieval: Execute vector and keyword search simultaneously.
vector_task = self.vector_search(query, ...)
keyword_task = self.keyword_search(query, ...)
vector_results, keyword_results = await asyncio.gather(vector_task, keyword_task)
-
Embedding Caching: Cache query embeddings to avoid redundant computation.
-
Connection Pooling: Reuse HTTP connections to backend services.
-
Response Streaming: Stream LLM output to reduce perceived latency.
-
Speculative Execution: Begin LLM generation with partial context while reranking completes.
7.3 Edge Deployment
For offline or low-latency requirements, we support edge deployment:
Model Quantization: Reduce model size for edge devices.
# ONNX conversion with quantization
def convert_to_onnx(model_name, output_path, quantize=True):
model = SentenceTransformer(model_name)
model.save(output_path)
if quantize:
quantize_dynamic(
f"{output_path}/model.onnx",
f"{output_path}/model_quantized.onnx",
weight_type=QuantType.QInt8
)
Local LLM: Ollama provides local LLM inference for edge deployments.
Sync Protocol: Bidirectional synchronization keeps edge indices updated.
7.4 Cost Optimization
RAG systems incur costs across multiple dimensions: compute (embedding, reranking, LLM inference), storage (vector indices, document store), and external APIs (LLM providers). Careful optimization can reduce costs by an order of magnitude without sacrificing quality.
Embedding Cost Optimization:
- Use smaller embedding models (MiniLM-384 vs. E5-1024) for initial indexing; quality difference is often negligible for retrieval
- Batch embedding requests to maximize GPU utilization
- Cache query embeddings for frequently-asked questions
- Implement tiered embedding: quick MiniLM for initial filtering, high-quality E5 for final ranking
LLM Cost Optimization:
- Route simple queries to smaller, cheaper models (GPT-4o-mini vs. GPT-4o)
- Implement semantic caching: if a semantically similar query was recently answered, return the cached response
- Compress context: summarize long retrieved passages before sending to the LLM
- Use local models (Ollama with Llama 3) for development and non-critical queries
Storage Cost Optimization:
- Implement vector quantization to reduce storage requirements (typically 4x compression with minimal quality loss)
- Use tiered storage: hot indices for recent/popular content, cold storage for archival documents
- Deduplicate similar chunks before indexing
- Implement TTL policies for outdated content
Cost Monitoring Dashboard: We track cost per query broken down by component:
| Component | Cost per 1K Queries | Optimization Potential | | ----------------- | ------------------- | ------------------------ | | Embedding | $0.02 | Use local models | | Vector Search | $0.01 | Optimize index settings | | Reranking | $0.05 | Reduce candidate count | | LLM (GPT-4o) | $0.50 | Route to smaller models | | LLM (GPT-4o-mini) | $0.05 | Cache frequent queries | | Total | $0.63 | Target: $0.10 |
7.5 Operational Concerns
Monitoring: Prometheus metrics for all services enable real-time observability and alerting.
# Metrics exposed by each service
REQUEST_LATENCY = Histogram('request_latency_seconds', 'Request latency')
REQUEST_COUNT = Counter('request_count', 'Request count', ['method', 'status'])
EMBEDDING_DIMENSION = Gauge('embedding_dimension', 'Embedding dimension')
RETRIEVAL_PRECISION = Gauge('retrieval_precision', 'Retrieval precision at K')
LLM_TOKEN_USAGE = Counter('llm_tokens_total', 'Total tokens used', ['model'])
Alerting: Critical alerts for service health and quality degradation:
- Service availability (any service down for >1 minute)
- Latency P99 exceeding SLA (>10s for Advanced mode)
- Error rate exceeding threshold (>1% of queries)
- Quality regression (RAGAS faithfulness <0.8)
- Cost spike (>2x daily average)
Logging: Structured logging with correlation IDs enables end-to-end request tracing. All services emit JSON logs with consistent fields:
{
"timestamp": "2026-01-15T10:30:45Z",
"level": "info",
"service": "rag-orchestrator",
"correlation_id": "req-abc123",
"event": "query_completed",
"mode": "advanced",
"retrieval_time_ms": 245,
"generation_time_ms": 1823,
"chunks_retrieved": 10,
"chunks_after_rerank": 5
}
Backup and Recovery:
- Vector indices: Daily full backup with hourly incremental
- Graph database: Continuous replication with point-in-time recovery
- Document store: Object versioning with 30-day retention
- Recovery time objective (RTO): 4 hours
- Recovery point objective (RPO): 1 hour
Security Considerations:
- All inter-service communication uses mTLS
- API authentication via JWT with short expiration (15 minutes)
- Rate limiting per tenant and per endpoint
- Query logging excludes PII; PII detection runs before logging
- Audit trail for all document access and modifications
8. Conclusion
8.1 Summary
This paper presented the MuVeraAI RAG architecture, a production system designed for industrial AI applications. Key contributions include:
-
Four-Mode Pipeline: Progressive sophistication from simple retrieval to graph-enhanced reasoning, selected based on query complexity.
-
Hybrid Search: Combination of vector similarity and keyword matching using four fusion algorithms, with domain-specific query expansion.
-
Multi-Model Reranking: Five reranking options (cross-encoder, Cohere, FlashRank, LLM, ensemble) enabling quality/latency trade-offs.
-
GraphRAG Integration: Knowledge graph traversal for structured reasoning over equipment relationships and diagnostic paths.
-
Industrial Focus: Semantic chunking preserving document structure, domain-aware entity extraction, and safety-conscious generation.
8.2 Lessons Learned
Building production RAG systems taught us several lessons:
Chunking Matters More Than Expected: Poor chunking fragments context and degrades both retrieval and generation. Investing in structure-aware chunking pays dividends.
Hybrid Search Outperforms Pure Vector: Technical domains benefit from exact term matching. Pure vector search misses precise terminology.
Reranking Is Worth the Latency: The quality improvement from cross-encoder reranking justifies the additional latency for most use cases.
Graph Context Is Situationally Valuable: Knowledge graph integration helps significantly for relationship queries but adds overhead for simple factual queries.
Evaluation Is Continuous: One-time evaluation is insufficient. Production systems require continuous monitoring and periodic re-evaluation.
8.3 Trade-offs and Design Decisions
Throughout the development of this architecture, we made several deliberate trade-offs:
Microservices vs. Monolith: We chose a microservices architecture despite the operational complexity because it enables independent scaling, technology heterogeneity (Python for ML services, Node.js for real-time features), and team autonomy. The added latency from inter-service communication is mitigated through connection pooling and parallel execution.
Local vs. API-Based Embedding: We default to local embedding models (MiniLM, E5) rather than API-based embeddings (OpenAI) because industrial deployments often have strict data residency requirements. Local models also eliminate API costs for high-volume indexing. The trade-off is lower embedding quality compared to text-embedding-3-large.
Graph Database vs. Relational: Neo4j was chosen for the knowledge graph despite its operational complexity because graph traversal patterns (find all related equipment within 3 hops) are natural in Cypher but complex in SQL. The trade-off is another database to operate and keep synchronized.
Reranking vs. More Retrieval: We invest in compute-intensive reranking rather than retrieving more candidates because reranking improves precision at the top of results, which directly impacts generation quality. The latency cost (typically 100-300ms) is worthwhile for improved answer quality.
8.4 Future Directions
Our roadmap includes:
Advanced Agentic Capabilities: Multi-agent collaboration for complex troubleshooting scenarios. Multiple specialized agents (diagnostic agent, safety agent, procedure agent) will coordinate through a shared state machine to handle multi-faceted industrial queries.
Multimodal RAG: Integration of image understanding for equipment identification and diagram interpretation. Technicians will be able to photograph equipment and receive contextual information including relevant procedures and maintenance history.
Adaptive Mode Selection: Automatic selection of RAG mode based on query classification. A lightweight classifier will route simple factual queries to Simple mode while directing complex diagnostic queries to Graph-Enhanced mode, optimizing the quality/latency trade-off.
Personalized Retrieval: User-specific relevance based on role, certification level, and interaction history. A technician certified in refrigeration will receive more detailed refrigerant-related content, while a facility manager will see summary information.
Federated Learning: Privacy-preserving model improvement across multiple deployments. Local fine-tuning signals will be aggregated without sharing raw data, enabling continuous improvement while maintaining data isolation.
Real-Time Knowledge Updates: Event-driven document processing that automatically re-indexes modified documents and invalidates affected cache entries, reducing the latency between document updates and query relevance.
Structured Output Extraction: Enhanced generation modes that produce structured outputs (JSON schemas) for integration with downstream systems, enabling automated work order creation, parts ordering, and compliance documentation.
Confidence-Based Routing: Automatic escalation to human experts when the system's confidence falls below configurable thresholds, ensuring that uncertain queries receive appropriate human review rather than potentially incorrect AI responses.
Appendix A: Technology Stack Summary
Our RAG architecture leverages the following technology stack:
Programming Languages:
- Python 3.11 for all ML services (embedding, reranking, entity extraction)
- TypeScript for API Gateway and frontend integrations
- Cypher for Neo4j graph queries
- SQL for metadata and audit logging
ML Frameworks and Libraries:
- sentence-transformers for embedding generation
- PyTorch as the deep learning backend
- spaCy for NLP and named entity recognition
- GLiNER for zero-shot entity extraction
- tiktoken for accurate token counting
Databases and Storage:
- Qdrant for vector similarity search (self-hosted)
- Neo4j for knowledge graph storage and traversal
- Meilisearch for full-text keyword search
- PostgreSQL for relational data and audit logs
- Redis for caching and session management
- MinIO for document object storage (S3-compatible)
Infrastructure:
- Docker and Docker Compose for containerization
- Kubernetes for production orchestration
- Prometheus and Grafana for monitoring
- LiteLLM for LLM provider abstraction
External Services:
- OpenAI API for high-quality LLM inference
- Anthropic API for Claude models
- Cohere API for reranking (optional)
- Ollama for local LLM deployment
Appendix B: Service Architecture Diagram
User Query
|
v
+--------------------+
| API Gateway |
| (Port 8080) |
+--------------------+
|
v
+--------------------+
| RAG Orchestrator |
| (Port 8604) |
+--------------------+
| | |
+------------------+ | +------------------+
| | |
v v v
+--------------------+ +--------------------+ +--------------------+
| Hybrid Search | | Reranking | | Context Graph |
| (Port 8602) | | (Port 8603) | | (Port 8500) |
+--------------------+ +--------------------+ +--------------------+
| | | |
v v v v
+----------+ +----------+ +----------+ +----------+
| Qdrant | |Meilisearch| |Cross-Enc.| | Neo4j |
+----------+ +----------+ +----------+ +----------+
+--------------------+
| Embedding Service |
| (Port 8502) |
+--------------------+
|
v
+--------------------+
| Document Ingestion|
+--------------------+
| | | | |
v v v v v
+--------+ +--------+ +--------+
|Document| |Semantic| |Entity |
|Process | |Chunker | |Extract |
|(8501) | |(8600) | |(8601) |
+--------+ +--------+ +--------+
Service Interaction Patterns
The architecture employs several interaction patterns to ensure reliability and performance:
Synchronous Request-Response: Used for latency-sensitive operations like query embedding and vector search where the caller needs immediate results.
Asynchronous Processing: Document ingestion uses asynchronous patterns where the upload returns immediately with a job ID, and processing continues in the background with status updates available via polling or webhook.
Streaming: LLM generation uses server-sent events (SSE) to stream partial responses, improving perceived latency for users.
Circuit Breaker: Each service-to-service call implements circuit breaker patterns to prevent cascade failures when downstream services are unavailable.
Retry with Exponential Backoff: Transient failures trigger automatic retries with exponential backoff to handle temporary network issues or rate limiting.
Appendix C: Configuration Reference
Embedding Service Configuration
embedding:
default_model: "all-MiniLM-L6-v2"
batch_size: 32
use_gpu: true
cache_dir: "/models"
normalize: true
Hybrid Search Configuration
hybrid_search:
default_fusion: "rrf"
rrf_k: 60
vector_weight: 0.5
fetch_multiplier: 3
enable_query_expansion: true
expansion_method: "hybrid"
Reranking Configuration
reranking:
default_model: "cross-encoder"
cross_encoder_model: "cross-encoder/ms-marco-MiniLM-L-6-v2"
batch_size: 32
use_gpu: true
fallback_to_cross_encoder: true
RAG Orchestrator Configuration
rag:
default_mode: "advanced"
default_model: "gpt-4o-mini"
max_output_tokens: 2048
temperature: 0.7
top_k: 10
rerank_top_k: 5
enable_streaming: true
Appendix D: API Reference
Query Endpoint
POST /api/v1/rag/query
Content-Type: application/json
{
"query": "How do I troubleshoot high head pressure?",
"mode": "advanced",
"collection": "hvac_manuals",
"top_k": 10,
"rerank": true,
"rerank_top_k": 5,
"include_sources": true,
"stream": false,
"llm_model": "gpt-4o-mini",
"temperature": 0.7
}
Response Format
{
"answer": "High head pressure can be caused by several factors...",
"query": "How do I troubleshoot high head pressure?",
"mode": "advanced",
"sources": [
{
"id": "chunk-uuid-1",
"text": "High discharge pressure typically indicates...",
"score": 0.92,
"source": "carrier-30xa-troubleshooting.pdf",
"metadata": {
"page": 45,
"section": "Troubleshooting"
}
}
],
"llm_model": "gpt-4o-mini",
"retrieval_time_ms": 245,
"generation_time_ms": 1823,
"total_time_ms": 2068,
"tokens_used": 1456
}
References
-
Lewis, P., et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.
-
Borgeaud, S., et al. "Improving Language Models by Retrieving from Trillions of Tokens." ICML 2022.
-
Gao, L., et al. "Precise Zero-Shot Dense Retrieval without Relevance Labels." ACL 2023.
-
Wang, L., et al. "Text Embeddings by Weakly-Supervised Contrastive Pre-training." arXiv 2022.
-
Es, S., et al. "RAGAS: Automated Evaluation of Retrieval Augmented Generation." arXiv 2023.
-
Cormack, G.V., et al. "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods." SIGIR 2009.
-
Nogueira, R., and Cho, K. "Passage Re-ranking with BERT." arXiv 2019.
-
ASHRAE. "ASHRAE Handbook: HVAC Applications." 2023.
Document Control
| Version | Date | Author | Changes | |---------|------|--------|---------| | 1.0 | January 2026 | MuVeraAI Technical Team | Initial release |
Copyright 2026 MuVeraAI. All rights reserved.