RAG Architecture for Industrial AI: A Technical Overview

MuVeraAI Technical Whitepaper Series | Phase 1 | Document P1-04

Version: 1.0 Date: January 2026 Classification: Technical Reference Audience: Technical Architects, AI/ML Engineers, Platform Engineers

Abstract

Retrieval-Augmented Generation (RAG) has emerged as the dominant paradigm for building knowledge-intensive AI applications that require factual grounding. While RAG systems have proven effective in general domains, industrial applications present unique challenges: complex multi-format documents, domain-specific terminology, safety-critical accuracy requirements, and the need for structured reasoning over equipment relationships.

This paper presents the MuVeraAI RAG architecture, a production system designed for industrial workforce training and decision support in data center operations. We describe a four-mode RAG pipeline that progressively adds sophistication based on query complexity: Simple (direct retrieval), Advanced (hybrid search with reranking), Agentic (multi-step reasoning with tool use), and Graph-Enhanced (knowledge graph integration). The architecture addresses industrial-specific challenges including semantic chunking for technical documents, domain-aware entity extraction, multiple fusion algorithms for hybrid search, and graph-based reasoning for equipment diagnostics.

Our implementation leverages a microservices architecture with nine specialized services orchestrated through an asynchronous pipeline. Key contributions include a hybrid search system combining vector similarity with keyword matching using four fusion algorithms (RRF, Weighted, Convex, DBSF), a five-model reranking pipeline with ensemble capabilities, and GraphRAG integration that enables multi-hop reasoning over equipment relationships. We detail our approach to hallucination mitigation, citation injection, and confidence-based escalation for safety-critical industrial contexts.

1. Introduction

1.1 The Case for RAG in Industrial AI

Large Language Models have demonstrated remarkable capabilities in understanding and generating human language. However, their application to industrial domains faces three fundamental limitations:

Knowledge Currency: LLMs are trained on static datasets with knowledge cutoffs that may exclude recent equipment models, updated procedures, or regulatory changes. A data center technician troubleshooting a 2025 chiller model cannot rely on knowledge frozen in 2023.

Factual Grounding: LLMs generate plausible-sounding but potentially incorrect information, a phenomenon colloquially termed "hallucination." In industrial contexts, suggesting an incorrect refrigerant charge or skipping a safety lockout step can result in equipment damage, injury, or death.

Organizational Knowledge: Tribal knowledge, site-specific procedures, and institutional memory exist in documents, maintenance logs, and expert minds rather than in public training data. Fine-tuning cannot efficiently capture this constantly-evolving organizational knowledge.

Retrieval-Augmented Generation addresses these limitations by decoupling the knowledge store from the reasoning engine. The LLM's role shifts from knowledge repository to reasoning engine, synthesizing answers from retrieved evidence rather than parametric memory. This architectural separation enables knowledge updates without model retraining, traceable citations for verification, and integration of proprietary organizational knowledge.

1.2 Why Fine-Tuning Is Insufficient

Organizations often consider fine-tuning as an alternative to RAG for domain adaptation. While fine-tuning can improve model behavior and tone, it presents significant drawbacks for industrial knowledge applications:

Update Latency: Fine-tuning requires collecting training data, formatting it appropriately, running training jobs, and deploying updated models. This cycle typically takes days to weeks, making it unsuitable for frequently-updated procedural knowledge.

Cost at Scale: Fine-tuning costs scale with both dataset size and update frequency. Organizations with thousands of documents requiring monthly updates face prohibitive training costs.

Catastrophic Forgetting: Fine-tuning on domain data can degrade performance on general capabilities, requiring careful balance between specialization and general reasoning.

No Attribution: Fine-tuned models cannot cite sources for their outputs, eliminating the ability to verify claims against authoritative documents.

RAG provides a more practical approach: new documents can be indexed in minutes, costs scale with storage rather than compute, general capabilities remain intact, and every claim can be traced to source documents.

1.3 Scope and Prerequisites

This paper targets technical architects evaluating RAG architectures and ML engineers implementing industrial AI systems. We assume familiarity with:

Vector embeddings and similarity search
Transformer-based language models
Microservices architecture patterns
Basic graph database concepts

The MuVeraAI platform focuses on data center operations, specifically HVAC/R (Heating, Ventilation, Air Conditioning, and Refrigeration) systems. Examples throughout this paper reflect this domain, though the architecture generalizes to other industrial contexts including manufacturing, energy, and facilities management.

2. RAG Fundamentals

2.1 The Core Concept

At its essence, RAG combines two distinct phases:

Retrieval Phase: Given a user query, find relevant documents or passages from a knowledge corpus. This typically involves encoding the query into a vector representation and finding documents with similar vectors.

Generation Phase: Given the query and retrieved context, generate a response using a language model. The model synthesizes information from multiple retrieved passages rather than relying solely on parametric knowledge.

The canonical RAG pipeline proceeds as follows:

User Query
    |
    v
[1. Embed Query] --> Query Vector
    |
    v
[2. Vector Search] --> Top-K Similar Documents
    |
    v
[3. Context Assembly] --> Formatted Context + Query
    |
    v
[4. LLM Generation] --> Response with Citations
    |
    v
Final Answer

This simple pipeline, while effective for straightforward queries, proves insufficient for industrial applications where queries may require multi-step reasoning, integration of structured data, or traversal of equipment relationships.

2.2 Component Architecture

A production RAG system comprises multiple specialized components:

Document Processing Pipeline

Ingestion: Accept documents in multiple formats (PDF, DOCX, HTML, plain text)
Parsing: Extract text while preserving structure (headings, tables, lists)
Chunking: Segment documents into retrieval units of appropriate size
Enrichment: Extract entities, relationships, and metadata

Embedding Infrastructure

Model Selection: Choose embedding models appropriate for the domain
Batch Processing: Efficiently embed large document collections
Index Management: Create, update, and query vector indices
Dimension Optimization: Balance embedding quality against storage costs

Retrieval System

Vector Search: Find semantically similar documents
Keyword Search: Find exact term matches for technical terminology
Hybrid Fusion: Combine multiple retrieval signals
Reranking: Refine initial results using cross-encoder models

Generation Layer

Prompt Construction: Format context and query for the LLM
Model Routing: Select appropriate models based on query characteristics
Response Streaming: Deliver partial responses for improved UX
Citation Injection: Link claims to source passages

Each component requires careful design decisions that affect system quality, latency, and cost. The following sections detail our approach to each.

3. Industrial RAG Challenges

Industrial applications present challenges that distinguish them from general-purpose RAG systems. Understanding these challenges informs our architectural decisions.

3.1 Document Complexity

Industrial knowledge bases contain documents that defy naive processing approaches:

PDFs with Embedded Tables: Equipment specifications, troubleshooting matrices, and pressure-temperature charts encode critical information in tabular formats. Standard PDF extraction often corrupts table structure, converting rows into unrelated text fragments.

Original Table:
| Symptom           | Likely Cause        | Resolution         |
|-------------------|--------------------|--------------------|
| High head pressure| Dirty condenser    | Clean condenser    |
| Low suction       | Low refrigerant    | Check for leaks    |

Naive Extraction:
"Symptom Likely Cause Resolution High head pressure Dirty condenser
Clean condenser Low suction Low refrigerant Check for leaks"

The extracted text loses the relational structure essential for accurate retrieval and reasoning.

Multi-Document Procedures: A complete work procedure may span multiple documents: a general maintenance checklist, equipment-specific addenda, site-specific safety requirements, and current work orders. Answering "How do I perform PM on Chiller 3?" requires synthesizing information across these sources.

Scanned Documents: Legacy documentation often exists only as scanned images. OCR introduces errors, particularly for technical terminology and model numbers. "R-410A" may become "R-4lOA" or "R-41DA", fragmenting retrieval accuracy.

Drawing and Diagram Annotations: P&ID (Piping and Instrumentation Diagrams) and electrical schematics contain text labels that require spatial reasoning to interpret correctly. The label "COMP-1" next to a symbol carries meaning only in relation to the diagram structure.

3.2 Query Understanding

Industrial queries present linguistic challenges absent from general domains:

Technical Terminology: Domain vocabulary includes abbreviations (CRAC, CRAH, TXV, EEV), model numbers (Liebert DS112A), refrigerant designations (R-410A, R-454B), and measurement units (psig, kPa, CFM). Embedding models trained on general text may fail to capture semantic relationships between technical terms.

Implicit Context: Technicians ask questions assuming shared context: "What's the superheat supposed to be?" implicitly references the specific equipment they're working on, the operating conditions, and the refrigerant type. A helpful system must either infer this context or request clarification.

Multi-Step Queries: Troubleshooting queries often require sequential reasoning: "The chiller is short-cycling and the discharge pressure is high. What should I check?" requires understanding symptom-cause relationships and diagnostic prioritization rather than simple document retrieval.

Ambiguity Resolution: Technical terms may have multiple meanings depending on context. "Discharge" might refer to discharge pressure (refrigeration), electrical discharge (safety), or patient discharge (in medical contexts). The system must disambiguate based on domain and query context.

3.3 Safety Requirements

Industrial AI systems operate in high-stakes environments where errors carry real consequences:

Hallucination Consequences: A hallucinated refrigerant charge recommendation could damage a compressor. A fabricated lockout procedure could result in electrocution. The cost of a confident but incorrect answer far exceeds the cost of admitting uncertainty.

Confidence Calibration: The system must accurately assess its own certainty. High-confidence answers with low actual accuracy are more dangerous than low-confidence answers, as users may not independently verify confident statements.

Escalation Paths: When the system lacks sufficient information or encounters queries outside its competence, it must escalate to human experts rather than guess. This requires mechanisms to detect knowledge boundaries and route appropriately.

Audit Trail: Regulated industries require the ability to review what information the system provided and why. Complete logging of queries, retrieved documents, and generated responses enables post-hoc analysis of any incidents.

4. The MuVeraAI RAG Architecture

Our architecture addresses industrial challenges through a four-mode pipeline that adapts sophistication to query complexity. This section details each component and their interactions.

4.1 Four RAG Modes

The MuVeraAI platform implements four distinct RAG modes, selected based on query characteristics and user preferences:

Mode 1: Simple RAG

Query --> Embed --> Vector Search --> Top-K Chunks --> LLM --> Response

Simple mode provides the fastest response path for straightforward factual queries. It performs direct embedding of the query, retrieves the top-K most similar document chunks, and generates a response. This mode suits queries like "What is the refrigerant charge for a Carrier 30XA chiller?" where a single document section likely contains the answer.

Mode 2: Advanced RAG

Query --> Expand --> Hybrid Search --> Fusion --> Rerank --> LLM --> Response

Advanced mode adds query expansion, hybrid search (combining vector and keyword retrieval), result fusion, and neural reranking. This mode handles queries requiring multiple retrieval signals, such as "How do I troubleshoot high head pressure on a scroll compressor?" where both semantic similarity and exact term matching improve retrieval.

Mode 3: Agentic RAG

Query --> Plan --> [Tool Call --> Result]* --> Synthesize --> Response

Agentic mode enables multi-step reasoning through tool use. The LLM generates a plan, executes tools (search, calculation, lookup), observes results, and iterates until sufficient information is gathered. This mode handles complex queries like "Compare the energy efficiency of replacing our 20-year-old chillers versus retrofitting with VFDs" that require multiple searches and calculations.

Mode 4: Graph-Enhanced RAG

Query --> Entity Extract --> Graph Traverse --> Vector Search --> Fuse --> LLM --> Response

Graph-enhanced mode integrates knowledge graph traversal with vector retrieval. Entities mentioned in the query seed graph traversal to find related equipment, procedures, and diagnostic paths. This mode excels at queries involving relationships: "What components might fail if Pump-3A trips?" requires understanding equipment dependencies rather than just document similarity.

4.2 Document Processing Pipeline

Documents enter the system through a multi-stage processing pipeline:

Stage 1: Format Handling

The Document Processor service accepts multiple input formats:

SUPPORTED_FORMATS = {
    'application/pdf': process_pdf,           # PDF with text and tables
    'application/vnd.openxmlformats...': process_docx,  # Word documents
    'application/vnd.ms-excel': process_xlsx,  # Excel spreadsheets
    'text/html': process_html,                 # HTML pages
    'text/plain': process_text,                # Plain text
}

PDF processing deserves special attention due to its prevalence in industrial documentation. We employ a multi-strategy approach:

Text Extraction: Extract text using PyPDF2 or pdfplumber
Table Detection: Identify tables using tabular structure recognition
Table Extraction: Extract tables preserving row/column relationships
OCR Fallback: For scanned pages, apply Tesseract OCR
Layout Analysis: Preserve heading hierarchy and section structure

Stage 2: Semantic Chunking

Naive fixed-size chunking ignores document structure, potentially splitting sentences mid-thought or separating procedure steps from their context. Our Semantic Chunker implements four strategies:

Fixed Chunking: Simple character/token-based splitting with overlap. Fastest but lowest quality.

def _fixed_chunking(self, text: str) -> List[Chunk]:
    chunks = []
    chunk_size = self.settings.max_chunk_size  # e.g., 512 tokens
    overlap = self.settings.overlap_size        # e.g., 50 tokens

    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk_text = text[start:end]

        # Avoid mid-word breaks
        if end < len(text) and not text[end].isspace():
            last_space = chunk_text.rfind(' ')
            if last_space > chunk_size // 2:
                end = start + last_space
                chunk_text = text[start:end]

        chunks.append(Chunk(content=chunk_text.strip(), ...))
        start = end - overlap

    return chunks

Recursive Chunking: Respects document structure by splitting on semantic boundaries (paragraph breaks, section headers, sentence ends) before falling back to smaller units.

SEPARATORS = [
    "\n\n\n",  # Multiple blank lines (major sections)
    "\n\n",    # Paragraph breaks
    "\n",      # Line breaks
    ". ",      # Sentence ends
    ", ",      # Clause boundaries
    " ",       # Word boundaries (last resort)
]

Semantic Chunking: Uses embedding similarity to detect topic boundaries. Adjacent sentences with low similarity indicate a topic shift and chunking boundary.

Recursive-Semantic Hybrid: Applies recursive chunking first, then refines large chunks using semantic analysis. This balances structural preservation with topic coherence.

Our default configuration uses recursive-semantic chunking with:

Maximum chunk size: 512 tokens
Minimum chunk size: 100 tokens
Overlap: 50 tokens
Similarity threshold: 0.75

Stage 3: Entity Extraction

The Entity Extractor identifies domain-relevant entities using a hybrid approach combining spaCy NER with pattern matching and optional GLiNER zero-shot extraction.

Domain patterns capture industrial vocabulary:

TRADE_SKILLS_PATTERNS = {
    "equipment": [
        "HVAC", "compressor", "condenser", "evaporator", "chiller",
        "CRAC", "CRAH", "CDU", "cooling tower", "VFD", ...
    ],
    "symptom": [
        "noise", "vibration", "leak", "overheating", "short cycling",
        "high head pressure", "low suction pressure", ...
    ],
    "procedure": [
        "maintenance", "troubleshooting", "calibration", "evacuation",
        "refrigerant recovery", "PM", ...
    ],
    "measurement": [
        "superheat", "subcooling", "discharge pressure", "CFM", ...
    ],
    "refrigerant": [
        "R-410A", "R-22", "R-134a", "R-454B", ...
    ],
}

Extracted entities serve multiple purposes:

Metadata enrichment for filtered search
Seeds for knowledge graph traversal
Query expansion candidates
Relationship extraction anchors

Stage 4: Relationship Extraction

Beyond entities, we extract relationships between entities that appear in the same context. Pattern-based rules identify diagnostic relationships:

DIAGNOSTIC_PATTERNS = {
    "HAS_SYMPTOM": ["shows", "exhibits", "experiencing", "symptom of"],
    "CAUSED_BY": ["caused by", "due to", "result of", "because of"],
    "RESOLVED_BY": ["fixed by", "resolved by", "corrected by"],
    "REQUIRES_SKILL": ["requires", "needs", "prerequisite"],
    "APPLIES_TO": ["applies to", "for", "used with"],
}

Type-based inference supplements pattern matching:

TYPE_RELATIONS = {
    ("equipment", "symptom"): "HAS_SYMPTOM",
    ("symptom", "cause"): "CAUSED_BY",
    ("cause", "solution"): "RESOLVED_BY",
    ("procedure", "equipment"): "APPLIES_TO",
}

4.3 Retrieval Architecture

The retrieval system implements hybrid search combining vector similarity with keyword matching.

Vector Store: Qdrant

We selected Qdrant as our vector database for several reasons:

Native support for multiple vector fields per document
Efficient filtering on metadata fields
Horizontal scaling with sharding
Active development and community

Document chunks are indexed with their embeddings and metadata:

# Index structure
{
    "id": "chunk-uuid",
    "vector": [0.12, -0.34, ...],  # 384-1024 dimensions
    "payload": {
        "text": "The compressor discharge temperature...",
        "source": "carrier-30xa-manual.pdf",
        "page": 42,
        "section": "Troubleshooting",
        "entities": ["compressor", "discharge temperature"],
        "document_type": "maintenance_manual",
    }
}

Keyword Search: Meilisearch

Meilisearch provides typo-tolerant keyword search for exact term matching. This complements vector search by catching technical terms that embedding models may not handle well:

# Keyword search
results = meili.index("documents").search(
    "R-410A superheat adjustment",
    {
        "limit": 20,
        "showRankingScore": True,
        "filter": "document_type = 'procedure'"
    }
)

Hybrid Fusion Algorithms

Combining vector and keyword results requires careful score fusion. We implement four algorithms:

Reciprocal Rank Fusion (RRF): Combines results based on rank rather than score, making it robust to score distribution differences between retrievers.

def _rrf_fusion(self, vector_results, keyword_results, k=60):
    scores = {}
    for result in vector_results:
        scores[result.id] = scores.get(result.id, 0) + 1 / (k + result.rank)
    for result in keyword_results:
        scores[result.id] = scores.get(result.id, 0) + 1 / (k + result.rank)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Weighted Fusion: Normalizes scores and applies configurable weights to each retriever.

Convex Combination: A special case of weighted fusion where weights sum to 1, providing a convex blend of scores.

Distribution-Based Score Fusion (DBSF): Normalizes scores using z-score transformation to account for different score distributions.

def _dbsf_fusion(self, vector_results, keyword_results):
    # Z-score normalize each result set
    v_scores = [r.score for r in vector_results]
    v_mean, v_std = np.mean(v_scores), np.std(v_scores) or 1

    k_scores = [r.score for r in keyword_results]
    k_mean, k_std = np.mean(k_scores), np.std(k_scores) or 1

    scores = {}
    for r in vector_results:
        z = (r.score - v_mean) / v_std
        scores[r.id] = scores.get(r.id, 0) + z
    for r in keyword_results:
        z = (r.score - k_mean) / k_std
        scores[r.id] = scores.get(r.id, 0) + z

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Our default configuration uses RRF with k=60, which provides robust fusion without requiring score calibration.

Query Expansion

The Query Expander enhances recall by adding related terms to the query. We implement multiple expansion methods:

Synonym Expansion: Adds domain-specific synonyms from a curated vocabulary.

DOMAIN_SYNONYMS = {
    "chiller": ["cooling unit", "refrigeration unit", "cooler"],
    "noise": ["sound", "humming", "buzzing", "rattling", "squealing"],
    "maintenance": ["service", "PM", "preventive maintenance"],
}

Abbreviation Expansion: Expands technical abbreviations.

ABBREVIATIONS = {
    "crac": "computer room air conditioner",
    "vfd": "variable frequency drive",
    "txv": "thermostatic expansion valve",
}

Embedding-Based Expansion: Finds semantically similar terms using vector similarity.

Pseudo-Relevance Feedback (PRF): Extracts terms from top initial results to expand the query.

Hybrid Expansion: Combines synonym and embedding expansion for maximum recall.

4.4 Reranking Pipeline

Initial retrieval typically returns 50-100 candidates that require refinement. The Reranking Pipeline applies more expensive but more accurate models to reorder results:

Cross-Encoder Reranking

Cross-encoders jointly encode the query and document, enabling direct relevance scoring. We support multiple models:

RERANKER_MODELS = {
    "cross-encoder": "cross-encoder/ms-marco-MiniLM-L-6-v2",
    "cross-encoder-large": "cross-encoder/ms-marco-MiniLM-L-12-v2",
    "bge-reranker": "BAAI/bge-reranker-large",
}

The cross-encoder processes query-document pairs:

def _rerank_cross_encoder(self, query, documents):
    cross_encoder = self._load_cross_encoder(model_name)

    # Create query-document pairs
    pairs = [[query, doc.text] for doc in documents]

    # Score all pairs
    scores = cross_encoder.predict(pairs, batch_size=32)

    # Sort by reranking score
    ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
    return ranked

Cohere Reranking

For production deployments, Cohere's API-based reranker offers strong performance without local GPU requirements:

async def _rerank_cohere(self, query, documents):
    response = self.cohere_client.rerank(
        query=query,
        documents=[doc.text for doc in documents],
        model="rerank-english-v2.0",
        top_n=len(documents)
    )
    return [(documents[r.index], r.relevance_score) for r in response.results]

FlashRank

FlashRank provides a lightweight, fast reranker suitable for edge deployment:

def _rerank_flashrank(self, query, documents):
    request = FlashRerankRequest(
        query=query,
        passages=[{"id": doc.id, "text": doc.text} for doc in documents]
    )
    return self.flashrank_model.rerank(request)

LLM-Based Reranking

For maximum accuracy on domain-specific content, we can use the LLM itself to score relevance:

async def _rerank_llm(self, query, documents):
    prompt = f"""Rate the relevance of each document to the query (0-10).

Query: {query}

Documents:
{formatted_documents}

Respond with only a JSON array of scores: [8, 5, 9, ...]"""

    response = await self.llm.complete(prompt)
    scores = json.loads(response)
    return sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)

Ensemble Reranking

Our ensemble approach combines multiple rerankers using RRF:

async def _rerank_ensemble(self, query, documents):
    cross_encoder_ranked = self._rerank_cross_encoder(query, documents)
    flashrank_ranked = self._rerank_flashrank(query, documents)

    # RRF fusion of rankings
    scores = {}
    for rank, (doc, _) in enumerate(cross_encoder_ranked):
        scores[doc.id] = 1 / (60 + rank)
    for rank, (doc, _) in enumerate(flashrank_ranked):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)

    return sorted(documents, key=lambda d: scores[d.id], reverse=True)

4.5 Knowledge Graph Integration

The knowledge graph captures structured relationships that complement unstructured document retrieval.

Neo4j Schema

Our graph schema models industrial domain entities:

// Node types
(:Equipment {name, type, manufacturer, model, location})
(:Symptom {name, description, severity})
(:Cause {name, description, probability})
(:Solution {name, description, difficulty, estimatedTime})
(:Procedure {name, type, steps, safetyRequirements})
(:Skill {name, level, prerequisites})

// Relationships
(equipment)-[:HAS_SYMPTOM]->(symptom)
(symptom)-[:CAUSED_BY]->(cause)
(cause)-[:RESOLVED_BY]->(solution)
(solution)-[:REQUIRES_SKILL]->(skill)
(procedure)-[:APPLIES_TO]->(equipment)
(skill)-[:PREREQUISITE_OF]->(skill)

GraphRAG Retrieval

The GraphRAG module implements graph-enhanced retrieval:

async def retrieve_graph_context(self, query, max_hops=2, max_entities=10):
    # 1. Extract entities from query
    entities = await self._extract_entities(query)

    # 2. BFS traversal from seed entities
    graph_entities, relationships = await self._traverse_graph(
        entities=entities,
        max_hops=max_hops,
        max_entities=max_entities
    )

    # 3. Get diagnostic paths for equipment/symptoms
    diagnostic_paths = await self._get_diagnostic_paths(entities)

    # 4. Find related procedures
    related_procedures = await self._get_related_procedures(entities)

    # 5. Build context text
    context_text = self._build_context_text(
        graph_entities, relationships, diagnostic_paths, related_procedures
    )

    return GraphContext(
        entities=graph_entities,
        relationships=relationships,
        diagnostic_paths=diagnostic_paths,
        related_procedures=related_procedures,
        context_text=context_text
    )

Hybrid Vector-Graph Retrieval

Graph-enhanced mode combines vector search with graph traversal:

async def hybrid_retrieve(self, query, collection, top_k=10, graph_weight=0.3):
    # Parallel retrieval
    vector_task = self._vector_retrieve(query, collection, top_k)
    graph_task = self.retrieve_graph_context(query)

    vector_results, graph_context = await asyncio.gather(vector_task, graph_task)

    # Boost vector results that mention graph entities
    graph_entity_names = {e.name.lower() for e in graph_context.entities}

    for result in vector_results:
        text_lower = result.text.lower()
        matches = sum(1 for name in graph_entity_names if name in text_lower)
        if matches > 0:
            result.score += graph_weight * matches * 0.1
            result.graph_aligned = True

    # Re-sort by boosted scores
    vector_results.sort(key=lambda x: x.score, reverse=True)

    return vector_results, graph_context

Diagnostic Reasoning Paths

For troubleshooting queries, the graph provides structured reasoning chains:

Equipment: Centrifugal Chiller
    |
    +-- HAS_SYMPTOM --> High Discharge Pressure
            |
            +-- CAUSED_BY --> Dirty Condenser (probability: 0.7)
            |       |
            |       +-- RESOLVED_BY --> Clean Condenser Coils
            |               |
            |               +-- REQUIRES_SKILL --> Basic HVAC Maintenance
            |
            +-- CAUSED_BY --> Non-Condensables (probability: 0.2)
                    |
                    +-- RESOLVED_BY --> Purge Non-Condensables
                            |
                            +-- REQUIRES_SKILL --> Refrigeration Certified

This structured knowledge complements unstructured document retrieval by providing explicit causal chains and skill requirements.

4.6 Generation and Grounding

The final stage synthesizes retrieved context into a coherent, grounded response.

LLM Routing with LiteLLM

We use LiteLLM as a unified interface to multiple LLM providers:

# Provider configuration
PROVIDERS = {
    "openai": ["gpt-4o", "gpt-4o-mini"],
    "anthropic": ["claude-sonnet-4-20250514", "claude-3-haiku"],
    "ollama": ["llama3.2:8b", "mistral:7b"],
}

async def _generate(self, messages, model=None, temperature=0.7):
    model = model or self.settings.default_model

    response = await client.post(
        f"{self.settings.litellm_url}/v1/chat/completions",
        json={
            "model": model,
            "messages": messages,
            "temperature": temperature,
            "max_tokens": self.settings.max_output_tokens,
        }
    )
    return response.json()["choices"][0]["message"]["content"]

Prompt Construction

The prompt combines system instructions, retrieved context, and the user query:

def _build_prompt(self, query, context, system_prompt=None):
    system = system_prompt or """You are an expert industrial technician assistant.
Answer questions based on the provided context. If the context doesn't contain
sufficient information, say so rather than guessing. Always cite your sources
using [1], [2], etc."""

    messages = [{"role": "system", "content": system}]

    user_content = f"""Context:
{context}

Question: {query}

Provide a comprehensive answer based on the context above. Cite sources."""

    messages.append({"role": "user", "content": user_content})
    return messages

Context Assembly with Citations

Retrieved chunks are formatted with citation markers for source attribution:

def _build_context(self, chunks, graph_context=""):
    context_parts = []

    for i, chunk in enumerate(chunks):
        source_info = f" [Source: {chunk.source}]" if chunk.source else ""
        graph_badge = " [Graph]" if chunk.metadata.get("graph_aligned") else ""
        context_parts.append(f"[{i+1}]{source_info}{graph_badge}\n{chunk.text}")

    context = "\n\n".join(context_parts)

    # Prepend graph context if available
    if graph_context:
        context = f"**Knowledge Graph Context:**\n{graph_context}\n\n---\n\n**Retrieved Documents:**\n{context}"

    return context

Hallucination Mitigation

We employ multiple strategies to reduce hallucination:

Explicit Instructions: System prompts explicitly instruct the model to acknowledge uncertainty and avoid fabrication.
Citation Requirement: Requiring citations forces the model to ground claims in retrieved evidence.
Confidence Assessment: Post-generation analysis checks whether claims are supported by citations.
Temperature Control: Lower temperatures (0.3-0.5) reduce creative but unfounded responses.
Structured Output: For critical responses, we use structured output formats (JSON mode) that constrain generation.

5. Embedding Strategy

Embedding model selection significantly impacts retrieval quality. This section details our embedding architecture.

5.1 Multi-Model Support

The Embedding Service supports multiple embedding models to balance quality, latency, and cost:

MODEL_DIMENSIONS = {
    "all-MiniLM-L6-v2": 384,      # Fast, good baseline
    "all-mpnet-base-v2": 768,     # Better quality
    "intfloat/e5-large-v2": 1024, # State-of-the-art
    "BAAI/bge-large-en-v1.5": 1024,  # Strong alternative
    "text-embedding-3-small": 1536,   # OpenAI API
    "text-embedding-3-large": 3072,   # OpenAI API (highest quality)
}

Model Selection Guidelines:

Development/Testing: all-MiniLM-L6-v2 (fast, small, good enough for iteration)
Production Baseline: all-mpnet-base-v2 (good quality/speed balance)
High Accuracy: e5-large-v2 or bge-large-en-v1.5 (best open-source)
Maximum Quality: text-embedding-3-large (best overall, requires API)

5.2 Instruction-Tuned Embeddings

Modern embedding models often benefit from instruction prefixes:

def _embed_local(self, texts, model, normalize=True):
    # E5 models expect "query: " prefix
    if model == "e5-large-v2":
        texts = [f"query: {t}" for t in texts]

    # BGE models expect instruction prefix
    elif model == "bge-large-en-v1.5":
        texts = [f"Represent this sentence for searching relevant passages: {t}"
                 for t in texts]

    return self.sentence_model.encode(texts, normalize_embeddings=normalize)

5.3 GPU Acceleration

Embedding generation benefits significantly from GPU acceleration:

def _get_device(self):
    if self.settings.use_gpu and torch.cuda.is_available():
        return "cuda"
    return "cpu"

# Model loading with device placement
self.model = SentenceTransformer(model_name, device=self._device)

Benchmark comparisons (10,000 text chunks, batch_size=32):

| Model | CPU Time | GPU Time | Speedup | |-------|----------|----------|---------| | MiniLM | 45s | 3s | 15x | | MPNet | 120s | 8s | 15x | | E5-Large | 300s | 18s | 17x |

5.4 Batch Processing

Efficient batch processing is essential for initial indexing of large document collections:

embeddings = model.encode(
    texts,
    batch_size=32,          # Tune for GPU memory
    normalize_embeddings=True,
    show_progress_bar=False,
    convert_to_numpy=True
)

For very large collections, we implement streaming batch processing to avoid memory exhaustion:

async def embed_collection(self, documents, batch_size=1000):
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i + batch_size]
        embeddings = await self.embed(batch)
        await self.index_batch(batch, embeddings)
        # Yield control to prevent blocking
        await asyncio.sleep(0)

6. Evaluation Framework

Evaluating RAG systems requires metrics that assess both retrieval and generation quality. Unlike traditional information retrieval, RAG evaluation must consider the entire pipeline: were the right documents retrieved? Was the generation faithful to those documents? Did the answer actually help the user?

This section presents our comprehensive evaluation framework spanning offline benchmarks, online monitoring, and human evaluation protocols.

6.1 Retrieval Metrics

Mean Reciprocal Rank (MRR): Measures the average position of the first relevant result.

MRR = (1/N) * sum(1/rank_i)

Normalized Discounted Cumulative Gain (NDCG): Accounts for graded relevance with position-based discounting.

DCG = sum(relevance_i / log2(i + 1))
NDCG = DCG / IDCG  # Normalized by ideal ranking

Precision@K and Recall@K: Fraction of retrieved documents that are relevant, and fraction of relevant documents that are retrieved.

Precision@K = |relevant in top K| / K
Recall@K = |relevant in top K| / |total relevant|

Hit Rate: Whether any relevant document appears in the top-K results.

6.2 Generation Metrics (RAGAS)

We integrate the RAGAS framework for RAG-specific evaluation:

Faithfulness: Measures whether generated claims are supported by the retrieved context.

def evaluate_faithfulness(answer, context):
    # Extract claims from answer
    claims = extract_claims(answer)

    # Check each claim against context
    supported = sum(1 for claim in claims if is_supported(claim, context))

    return supported / len(claims)

Answer Relevancy: Measures whether the answer addresses the question.

def evaluate_relevancy(question, answer):
    # Generate questions that the answer would address
    generated_questions = generate_questions_from_answer(answer)

    # Compute similarity to original question
    similarities = [similarity(q, question) for q in generated_questions]

    return mean(similarities)

Context Precision: Measures whether retrieved documents are relevant to the question.

Context Recall: Measures whether all information needed for the answer was retrieved.

6.3 Domain-Specific Evaluation

Industrial applications require domain-specific evaluation beyond generic metrics:

Factual Accuracy: Expert review of technical correctness.

HVAC_FACTS_TEST_SET = [
    {
        "question": "What is the ideal superheat for R-410A?",
        "expected_range": "8-12 degrees Fahrenheit",
        "source": "ASHRAE Handbook"
    },
    # ...
]

Safety Compliance: Verify that safety-critical information is accurate.

Procedure Completeness: Ensure procedural answers include all required steps.

6.4 Human Evaluation

Automated metrics provide scale but cannot fully assess answer quality for industrial applications. We supplement automated evaluation with structured human review:

Expert Review Protocol: Domain experts (certified HVAC technicians) review a sample of responses for:

Technical accuracy: Are the facts correct?
Safety completeness: Are relevant safety warnings included?
Procedural correctness: Are steps in the right order with no omissions?
Appropriate confidence: Does the system appropriately acknowledge uncertainty?

Inter-Rater Reliability: Multiple experts review the same responses to establish consistency. Cohen's kappa measures agreement levels, with recalibration when agreement drops below 0.7.

Feedback Loop: Expert corrections are incorporated into the test set for regression testing and potentially into the knowledge base for future retrieval.

6.5 Continuous Evaluation

We implement continuous evaluation in production:

Comprehensive Logging: All queries, retrieved documents, generated responses, and latency metrics are logged to a centralized store. This enables post-hoc analysis of any production issue.

Statistical Sampling: A configurable percentage of queries (default 5%) are flagged for human review. Sampling is stratified by query complexity to ensure coverage across difficulty levels.

User Feedback Collection: Users can provide explicit feedback:

Thumbs up/down on overall response quality
Flag specific claims as incorrect
Provide corrected information for training

Implicit Feedback Signals: We track behavioral signals that indicate satisfaction:

Follow-up questions (suggests incomplete first answer)
Time spent reading response
Copy/paste actions (suggests useful content)
Escalation to human expert (suggests insufficient AI response)

Regression Testing: A curated set of known-good query-response pairs is evaluated weekly. Any degradation in RAGAS scores or retrieval metrics triggers investigation.

A/B Testing Framework: New models, prompts, or retrieval configurations are deployed to a subset of traffic. Statistical significance testing determines whether changes improve key metrics before full rollout.

6.6 Evaluation Benchmarks

We maintain domain-specific benchmark datasets:

HVAC Factual QA (200 questions): Factual questions about HVAC systems with known correct answers from authoritative sources.

Troubleshooting Scenarios (50 scenarios): Multi-step diagnostic scenarios where the correct answer requires reasoning over symptoms, causes, and solutions.

Procedure Verification (30 procedures): Procedural questions where completeness and step ordering matter.

Safety Edge Cases (100 questions): Questions designed to elicit potentially dangerous advice, testing the system's safety guardrails.

Benchmark evaluation occurs weekly, with alerts when any metric drops more than 5% from baseline.

7. Deployment Considerations

Production deployment of RAG systems requires attention to scalability, latency, cost optimization, and operational concerns. Industrial deployments face additional constraints including data residency requirements, network isolation, and integration with existing enterprise systems.

7.1 Scalability Architecture

Our microservices architecture enables independent scaling of components:

[Load Balancer]
      |
      v
[API Gateway] ---> [Rate Limiter]
      |
      +---> [RAG Orchestrator] (3 replicas)
      |           |
      |           +---> [Hybrid Search] (2 replicas)
      |           |           |
      |           |           +---> [Qdrant] (3 shards)
      |           |           +---> [Meilisearch]
      |           |
      |           +---> [Reranking] (2 replicas, GPU)
      |           |
      |           +---> [LiteLLM] ---> [LLM Providers]
      |
      +---> [Document Processor] (2 replicas)
      +---> [Embedding Service] (2 replicas, GPU)
      +---> [Context Graph] ---> [Neo4j] (cluster)

Scaling Guidelines:

Embedding Service: Scale based on indexing throughput needs
Hybrid Search: Scale based on query volume
Reranking: Scale based on query volume (GPU-intensive)
RAG Orchestrator: Scale based on concurrent users
Qdrant: Shard based on collection size

7.2 Latency Optimization

End-to-end latency targets for different modes:

| Mode | Target P50 | Target P99 | |------|------------|------------| | Simple | 1s | 3s | | Advanced | 2s | 5s | | Graph-Enhanced | 3s | 8s | | Agentic | 5s | 15s |

Optimization Techniques:

Parallel Retrieval: Execute vector and keyword search simultaneously.

vector_task = self.vector_search(query, ...)
keyword_task = self.keyword_search(query, ...)
vector_results, keyword_results = await asyncio.gather(vector_task, keyword_task)

Embedding Caching: Cache query embeddings to avoid redundant computation.
Connection Pooling: Reuse HTTP connections to backend services.
Response Streaming: Stream LLM output to reduce perceived latency.
Speculative Execution: Begin LLM generation with partial context while reranking completes.

7.3 Edge Deployment

For offline or low-latency requirements, we support edge deployment:

Model Quantization: Reduce model size for edge devices.

# ONNX conversion with quantization
def convert_to_onnx(model_name, output_path, quantize=True):
    model = SentenceTransformer(model_name)
    model.save(output_path)

    if quantize:
        quantize_dynamic(
            f"{output_path}/model.onnx",
            f"{output_path}/model_quantized.onnx",
            weight_type=QuantType.QInt8
        )

Local LLM: Ollama provides local LLM inference for edge deployments.

Sync Protocol: Bidirectional synchronization keeps edge indices updated.

7.4 Cost Optimization

RAG systems incur costs across multiple dimensions: compute (embedding, reranking, LLM inference), storage (vector indices, document store), and external APIs (LLM providers). Careful optimization can reduce costs by an order of magnitude without sacrificing quality.

Embedding Cost Optimization:

Use smaller embedding models (MiniLM-384 vs. E5-1024) for initial indexing; quality difference is often negligible for retrieval
Batch embedding requests to maximize GPU utilization
Cache query embeddings for frequently-asked questions
Implement tiered embedding: quick MiniLM for initial filtering, high-quality E5 for final ranking

LLM Cost Optimization:

Route simple queries to smaller, cheaper models (GPT-4o-mini vs. GPT-4o)
Implement semantic caching: if a semantically similar query was recently answered, return the cached response
Compress context: summarize long retrieved passages before sending to the LLM
Use local models (Ollama with Llama 3) for development and non-critical queries

Storage Cost Optimization:

Implement vector quantization to reduce storage requirements (typically 4x compression with minimal quality loss)
Use tiered storage: hot indices for recent/popular content, cold storage for archival documents
Deduplicate similar chunks before indexing
Implement TTL policies for outdated content

Cost Monitoring Dashboard: We track cost per query broken down by component:

| Component | Cost per 1K Queries | Optimization Potential | | ----------------- | ------------------- | ------------------------ | | Embedding | $0.02 | Use local models | | Vector Search | $0.01 | Optimize index settings | | Reranking | $0.05 | Reduce candidate count | | LLM (GPT-4o) | $0.50 | Route to smaller models | | LLM (GPT-4o-mini) | $0.05 | Cache frequent queries | | Total | $0.63 | Target: $0.10 |

7.5 Operational Concerns

Monitoring: Prometheus metrics for all services enable real-time observability and alerting.

# Metrics exposed by each service
REQUEST_LATENCY = Histogram('request_latency_seconds', 'Request latency')
REQUEST_COUNT = Counter('request_count', 'Request count', ['method', 'status'])
EMBEDDING_DIMENSION = Gauge('embedding_dimension', 'Embedding dimension')
RETRIEVAL_PRECISION = Gauge('retrieval_precision', 'Retrieval precision at K')
LLM_TOKEN_USAGE = Counter('llm_tokens_total', 'Total tokens used', ['model'])

Alerting: Critical alerts for service health and quality degradation:

Service availability (any service down for >1 minute)
Latency P99 exceeding SLA (>10s for Advanced mode)
Error rate exceeding threshold (>1% of queries)
Quality regression (RAGAS faithfulness <0.8)
Cost spike (>2x daily average)

Logging: Structured logging with correlation IDs enables end-to-end request tracing. All services emit JSON logs with consistent fields:

{
  "timestamp": "2026-01-15T10:30:45Z",
  "level": "info",
  "service": "rag-orchestrator",
  "correlation_id": "req-abc123",
  "event": "query_completed",
  "mode": "advanced",
  "retrieval_time_ms": 245,
  "generation_time_ms": 1823,
  "chunks_retrieved": 10,
  "chunks_after_rerank": 5
}

Backup and Recovery:

Vector indices: Daily full backup with hourly incremental
Graph database: Continuous replication with point-in-time recovery
Document store: Object versioning with 30-day retention
Recovery time objective (RTO): 4 hours
Recovery point objective (RPO): 1 hour

Security Considerations:

All inter-service communication uses mTLS
API authentication via JWT with short expiration (15 minutes)
Rate limiting per tenant and per endpoint
Query logging excludes PII; PII detection runs before logging
Audit trail for all document access and modifications

8. Conclusion

8.1 Summary

This paper presented the MuVeraAI RAG architecture, a production system designed for industrial AI applications. Key contributions include:

Four-Mode Pipeline: Progressive sophistication from simple retrieval to graph-enhanced reasoning, selected based on query complexity.
Hybrid Search: Combination of vector similarity and keyword matching using four fusion algorithms, with domain-specific query expansion.
Multi-Model Reranking: Five reranking options (cross-encoder, Cohere, FlashRank, LLM, ensemble) enabling quality/latency trade-offs.
GraphRAG Integration: Knowledge graph traversal for structured reasoning over equipment relationships and diagnostic paths.
Industrial Focus: Semantic chunking preserving document structure, domain-aware entity extraction, and safety-conscious generation.

8.2 Lessons Learned

Building production RAG systems taught us several lessons:

Chunking Matters More Than Expected: Poor chunking fragments context and degrades both retrieval and generation. Investing in structure-aware chunking pays dividends.

Hybrid Search Outperforms Pure Vector: Technical domains benefit from exact term matching. Pure vector search misses precise terminology.

Reranking Is Worth the Latency: The quality improvement from cross-encoder reranking justifies the additional latency for most use cases.

Graph Context Is Situationally Valuable: Knowledge graph integration helps significantly for relationship queries but adds overhead for simple factual queries.

Evaluation Is Continuous: One-time evaluation is insufficient. Production systems require continuous monitoring and periodic re-evaluation.

8.3 Trade-offs and Design Decisions

Throughout the development of this architecture, we made several deliberate trade-offs:

Microservices vs. Monolith: We chose a microservices architecture despite the operational complexity because it enables independent scaling, technology heterogeneity (Python for ML services, Node.js for real-time features), and team autonomy. The added latency from inter-service communication is mitigated through connection pooling and parallel execution.

Local vs. API-Based Embedding: We default to local embedding models (MiniLM, E5) rather than API-based embeddings (OpenAI) because industrial deployments often have strict data residency requirements. Local models also eliminate API costs for high-volume indexing. The trade-off is lower embedding quality compared to text-embedding-3-large.

Graph Database vs. Relational: Neo4j was chosen for the knowledge graph despite its operational complexity because graph traversal patterns (find all related equipment within 3 hops) are natural in Cypher but complex in SQL. The trade-off is another database to operate and keep synchronized.

Reranking vs. More Retrieval: We invest in compute-intensive reranking rather than retrieving more candidates because reranking improves precision at the top of results, which directly impacts generation quality. The latency cost (typically 100-300ms) is worthwhile for improved answer quality.

8.4 Future Directions

Our roadmap includes:

Advanced Agentic Capabilities: Multi-agent collaboration for complex troubleshooting scenarios. Multiple specialized agents (diagnostic agent, safety agent, procedure agent) will coordinate through a shared state machine to handle multi-faceted industrial queries.

Multimodal RAG: Integration of image understanding for equipment identification and diagram interpretation. Technicians will be able to photograph equipment and receive contextual information including relevant procedures and maintenance history.

Adaptive Mode Selection: Automatic selection of RAG mode based on query classification. A lightweight classifier will route simple factual queries to Simple mode while directing complex diagnostic queries to Graph-Enhanced mode, optimizing the quality/latency trade-off.

Personalized Retrieval: User-specific relevance based on role, certification level, and interaction history. A technician certified in refrigeration will receive more detailed refrigerant-related content, while a facility manager will see summary information.

Federated Learning: Privacy-preserving model improvement across multiple deployments. Local fine-tuning signals will be aggregated without sharing raw data, enabling continuous improvement while maintaining data isolation.

Real-Time Knowledge Updates: Event-driven document processing that automatically re-indexes modified documents and invalidates affected cache entries, reducing the latency between document updates and query relevance.

Structured Output Extraction: Enhanced generation modes that produce structured outputs (JSON schemas) for integration with downstream systems, enabling automated work order creation, parts ordering, and compliance documentation.

Confidence-Based Routing: Automatic escalation to human experts when the system's confidence falls below configurable thresholds, ensuring that uncertain queries receive appropriate human review rather than potentially incorrect AI responses.

Appendix A: Technology Stack Summary

Our RAG architecture leverages the following technology stack:

Programming Languages:

Python 3.11 for all ML services (embedding, reranking, entity extraction)
TypeScript for API Gateway and frontend integrations
Cypher for Neo4j graph queries
SQL for metadata and audit logging

ML Frameworks and Libraries:

sentence-transformers for embedding generation
PyTorch as the deep learning backend
spaCy for NLP and named entity recognition
GLiNER for zero-shot entity extraction
tiktoken for accurate token counting

Databases and Storage:

Qdrant for vector similarity search (self-hosted)
Neo4j for knowledge graph storage and traversal
Meilisearch for full-text keyword search
PostgreSQL for relational data and audit logs
Redis for caching and session management
MinIO for document object storage (S3-compatible)

Infrastructure:

Docker and Docker Compose for containerization
Kubernetes for production orchestration
Prometheus and Grafana for monitoring
LiteLLM for LLM provider abstraction

External Services:

OpenAI API for high-quality LLM inference
Anthropic API for Claude models
Cohere API for reranking (optional)
Ollama for local LLM deployment

Appendix B: Service Architecture Diagram

                                    User Query
                                         |
                                         v
                              +--------------------+
                              |   API Gateway      |
                              |   (Port 8080)      |
                              +--------------------+
                                         |
                                         v
                              +--------------------+
                              | RAG Orchestrator   |
                              |   (Port 8604)      |
                              +--------------------+
                                    |    |    |
                 +------------------+    |    +------------------+
                 |                       |                       |
                 v                       v                       v
    +--------------------+  +--------------------+  +--------------------+
    |   Hybrid Search    |  |     Reranking      |  |   Context Graph    |
    |   (Port 8602)      |  |   (Port 8603)      |  |   (Port 8500)      |
    +--------------------+  +--------------------+  +--------------------+
           |     |                   |                       |
           v     v                   v                       v
    +----------+ +----------+  +----------+           +----------+
    |  Qdrant  | |Meilisearch| |Cross-Enc.|           |  Neo4j   |
    +----------+ +----------+  +----------+           +----------+

                              +--------------------+
                              | Embedding Service  |
                              |   (Port 8502)      |
                              +--------------------+
                                         |
                                         v
                              +--------------------+
                              |  Document Ingestion|
                              +--------------------+
                              |    |    |    |    |
                              v    v    v    v    v
                        +--------+ +--------+ +--------+
                        |Document| |Semantic| |Entity  |
                        |Process | |Chunker | |Extract |
                        |(8501)  | |(8600)  | |(8601)  |
                        +--------+ +--------+ +--------+

Service Interaction Patterns

The architecture employs several interaction patterns to ensure reliability and performance:

Synchronous Request-Response: Used for latency-sensitive operations like query embedding and vector search where the caller needs immediate results.

Asynchronous Processing: Document ingestion uses asynchronous patterns where the upload returns immediately with a job ID, and processing continues in the background with status updates available via polling or webhook.

Streaming: LLM generation uses server-sent events (SSE) to stream partial responses, improving perceived latency for users.

Circuit Breaker: Each service-to-service call implements circuit breaker patterns to prevent cascade failures when downstream services are unavailable.

Retry with Exponential Backoff: Transient failures trigger automatic retries with exponential backoff to handle temporary network issues or rate limiting.

Appendix C: Configuration Reference

Embedding Service Configuration

embedding:
  default_model: "all-MiniLM-L6-v2"
  batch_size: 32
  use_gpu: true
  cache_dir: "/models"
  normalize: true

Hybrid Search Configuration

hybrid_search:
  default_fusion: "rrf"
  rrf_k: 60
  vector_weight: 0.5
  fetch_multiplier: 3
  enable_query_expansion: true
  expansion_method: "hybrid"

Reranking Configuration

reranking:
  default_model: "cross-encoder"
  cross_encoder_model: "cross-encoder/ms-marco-MiniLM-L-6-v2"
  batch_size: 32
  use_gpu: true
  fallback_to_cross_encoder: true

RAG Orchestrator Configuration

rag:
  default_mode: "advanced"
  default_model: "gpt-4o-mini"
  max_output_tokens: 2048
  temperature: 0.7
  top_k: 10
  rerank_top_k: 5
  enable_streaming: true

Appendix D: API Reference

Query Endpoint

POST /api/v1/rag/query
Content-Type: application/json

{
  "query": "How do I troubleshoot high head pressure?",
  "mode": "advanced",
  "collection": "hvac_manuals",
  "top_k": 10,
  "rerank": true,
  "rerank_top_k": 5,
  "include_sources": true,
  "stream": false,
  "llm_model": "gpt-4o-mini",
  "temperature": 0.7
}

Response Format

{
  "answer": "High head pressure can be caused by several factors...",
  "query": "How do I troubleshoot high head pressure?",
  "mode": "advanced",
  "sources": [
    {
      "id": "chunk-uuid-1",
      "text": "High discharge pressure typically indicates...",
      "score": 0.92,
      "source": "carrier-30xa-troubleshooting.pdf",
      "metadata": {
        "page": 45,
        "section": "Troubleshooting"
      }
    }
  ],
  "llm_model": "gpt-4o-mini",
  "retrieval_time_ms": 245,
  "generation_time_ms": 1823,
  "total_time_ms": 2068,
  "tokens_used": 1456
}

References

Lewis, P., et al. "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020.
Borgeaud, S., et al. "Improving Language Models by Retrieving from Trillions of Tokens." ICML 2022.
Gao, L., et al. "Precise Zero-Shot Dense Retrieval without Relevance Labels." ACL 2023.
Wang, L., et al. "Text Embeddings by Weakly-Supervised Contrastive Pre-training." arXiv 2022.
Es, S., et al. "RAGAS: Automated Evaluation of Retrieval Augmented Generation." arXiv 2023.
Cormack, G.V., et al. "Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods." SIGIR 2009.
Nogueira, R., and Cho, K. "Passage Re-ranking with BERT." arXiv 2019.
ASHRAE. "ASHRAE Handbook: HVAC Applications." 2023.

Document Control

| Version | Date | Author | Changes | |---------|------|--------|---------| | 1.0 | January 2026 | MuVeraAI Technical Team | Initial release |

RAG Architecture for HVAC Operations