Author:

Kamil Klepusewicz

Software Engineer

Date:

Table of Contents

Deploying an internal AI chatbot is not a prompt engineering exercise – it is a distributed systems problem with data governance constraints. Most organizations reach a structural ceiling within 90 days of a prototype: retrieval quality degrades across document types, access control enforcement breaks at the retrieval layer, and inference costs become unauditable.

 

The root cause is almost always the same: a PoC architecture (typically a LangChain wrapper over a single vector store) was promoted to production without addressing the underlying data platform requirements.

 

A production-grade internal AI assistant requires a coherent architecture across five domains: data ingestion pipelines, chunking and embedding strategy, vector search with metadata filtering, LLM serving with governance, and observability.

 

Each of these is a first-class engineering concern. This guide walks through each layer – with specific reference to Databricks Mosaic AI and its native toolchain – so that architecture decisions are grounded in operational reality rather than vendor abstractions.

 

 

Step 0: Define the Architectural Scope Before Writing a Line of Code

 

The first decision is not which LLM to use. It is what data the chatbot needs to access, and under what access constraints.

 

For a regulated enterprise – pharma, finance, enterprise IT – this immediately surfaces three non-negotiable requirements:

 

  • Row- and column-level access control on retrieved documents (e.g., HR documents visible only to HR, financial summaries scoped by cost center)
  • Data lineage on every document chunk that reaches the context window
  • Audit logging of every inference request, including the retrieved sources

 

Architectures that bolt these concerns on post-hoc will accumulate technical debt rapidly. The design must start with governance as a first-class constraint.

 

Step 1: Build a Governed Data Ingestion Pipeline

 

The knowledge base for an internal AI chatbot is only as reliable as the ingestion pipeline that produces it. Document freshness, deduplication, and schema consistency must be enforced upstream of the vector store – not at query time.

 

Recommended approach on Databricks:

 

  • Use Delta Live Tables (DLT) to define declarative, incremental ingestion pipelines from source systems (SharePoint, Confluence, internal wikis, S3 buckets, ServiceNow).
  • Store raw documents in the Bronze layer of a Medallion architecture. Apply parsing, cleaning, and metadata extraction in the Silver layer. The Gold layer produces normalized, chunk-ready records.
  • All tables are registered in Unity Catalog, which provides a single governance plane: ownership, tagging, lineage tracking, and access policies that propagate downstream to vector search.

 

The critical property here is lineage continuity: every chunk in the vector index can be traced back to its source document, its ingestion timestamp, and the DLT pipeline version that produced it.

 

Step 2: Design a Chunking and Embedding Strategy

 

Chunking is the most underspecified component in most internal AI chatbot architectures. Fixed-size chunking (e.g., 512 tokens with 64-token overlap) is a reasonable default for dense prose, but it fails structurally on tables, code blocks, regulatory policy documents, and multi-section PDFs.

 

Chunking strategy by document type:

 

Document Type Recommended Strategy
Prose / Wiki pages Recursive character splitting, 400–600 tokens
Structured tables Row-level or semantic table chunking
Policy / regulatory docs Section-boundary chunking with header hierarchy preserved
Code / runbooks Function-level or block-level chunking
PDFs with mixed content Native parsing via ai_parse_document (GA, supports PDF/DOCX/PPTX/images up to 500 pages); Unstructured.io as fallback for edge cases

 

The ai_parse_document SQL/Python function (GA as of April 2026) handles layout-aware extraction natively within Databricks – no external parsing service required for the majority of enterprise document types. It integrates directly into the DLT Silver transformation step, keeping the ingestion pipeline self-contained.

 

Embeddings should be generated using Databricks Model Serving – either a hosted embedding model (e.g., bge-large-en-v1.5, text-embedding-3-large via the AI Gateway) or a fine-tuned domain-specific model. Embedding generation is run as a Spark UDF over the Silver-to-Gold transformation, making it horizontally scalable and idempotent.

 

Step 3: Deploy a Vector Index with Unity Catalog Governance

 

Mosaic AI Vector Search is Databricks’ native managed vector store. Unlike externally hosted solutions (Pinecone, Weaviate, Qdrant), it stores vector indexes as Unity Catalog objects, which means:

 

  • Access control policies on the source Delta table are automatically inherited by the vector index
  • The index can be queried using the same personal access token or service principal that governs the rest of the lakehouse – no separate credential plane
  • Indexes are updated incrementally and automatically as the underlying Delta table changes (via Delta Change Data Feed)

 

The architectural consequence is significant: you do not need a separate synchronization job to keep the vector store in sync with the document corpus. The platform handles it as a first-class operation. This eliminates an entire category of operational overhead that plagues self-hosted vector store deployments.

 

Index configuration checklist:

 

  • Enable Direct Vector Access for low-latency retrieval in synchronous serving contexts
  • Configure metadata columns on the index (e.g., department, classification_level, document_type) to support pre-filter queries at retrieval time
  • Set embedding dimension to match the deployed embedding model – a mismatch will fail silently in several open-source vector stores, but raises a schema validation error in Unity Catalog

 

Step 4: Implement Retrieval-Augmented Generation (RAG) with the Mosaic AI Stack

 

The retrieval pipeline for a production internal AI chatbot involves more than a single vector search call. A robust implementation includes query rewriting, hybrid retrieval (dense + sparse/BM25), re-ranking, and context assembly with source attribution.

 

Architectural Guidance: Struggling with retrieval quality, governance enforcement, or productionizing your RAG pipeline on Databricks? Explore our Enterprise RAG Implementation Service to see how Dateonic implements production-ready internal AI chatbot solutions – including Unity Catalog governance, AI Gateway observability, and end-to-end MLflow traceability.

 

Retrieval pipeline components:

 

  1. Query Rewriting – Use an LLM call (lightweight model, e.g., claude-haiku or llama-3-8b via AI Gateway) to rewrite ambiguous employee queries into retrieval-optimized forms. This is especially important for internal jargon and acronym-heavy queries.
  2. Hybrid Retrieval – Combine Mosaic AI Vector Search (semantic) with BM25 keyword matching on the Delta table. Reciprocal Rank Fusion (RRF) merges the two result lists.
  3. Re-ranking – Pass the top-N candidate chunks through a cross-encoder re-ranker (e.g., ms-marco-MiniLM) served on Databricks Model Serving to improve precision before context assembly.
  4. Context Assembly – Truncate assembled context to the LLM’s effective context window, preserving source metadata for citation rendering. Never truncate silently – surface context overflow as an observable metric.

 

The full pipeline is defined as a MLflow 3 PyFunc model and registered in Unity Catalog. This makes the entire chain – retrieval + re-ranking + generation – a versioned, deployable artifact with a stable API contract.

 

Step 5: Serve the LLM via Databricks AI Gateway

 

Databricks AI Gateway (part of Mosaic AI) acts as a unified proxy layer for all LLM calls. Its role in a production internal AI chatbot architecture is not optional:

 

  • Centralized rate limiting and cost allocation – route traffic by team, use case, or service principal; enforce token budgets
  • PII detection and redaction – apply input/output guardrails before prompts reach external LLM providers (critical for regulated industries)
  • Model fallback and routing – define primary/fallback model chains (e.g., gpt-4ollama-3-70b on capacity failure)
  • Inference logging – every request/response pair is logged to a Delta table for downstream analysis, compliance audit, and RLHF data collection

 

As of April 2026, AI Gateway has been extended beyond LLM endpoints to govern MCP server interactions – enforcing access control, monitoring usage, and auditing activity across all MCP-connected tools in the workspace. For chatbot architectures that extend into agentic workflows (external API calls, tool use, data lookups), this makes AI Gateway the single governance plane for both LLM inference and tool execution.

 

AI Gateway decouples the application layer from the underlying LLM provider. Swapping from OpenAI to Azure OpenAI to a self-hosted Llama variant requires a configuration change – not a code change.

 

Step 6: Instrument the Full Stack with MLflow and Lakehouse Monitoring

 

A build internal AI chatbot initiative fails operationally when the team cannot answer: „Is retrieval quality degrading? Is the LLM drifting? Which document types produce the most hallucinations?”

 

Observability stack on Databricks:

 

  • Vector Search Built-in Retrieval Evaluation – use the native Mosaic AI Vector Search retrieval quality evaluation (GA April 2026) as the first-pass diagnostic. It measures and compares relevance across search strategies (semantic, hybrid, filtered) directly on your indexed data – before escalating to more expensive LLM-as-Judge pipelines.
  • MLflow 3 Tracing – instrument every step of the RAG chain (query rewrite, vector search, re-rank, generation) with mlflow.trace(). MLflow 3 (redesigned for GenAI) captures inputs, outputs, intermediate steps, and tool calls with OpenTelemetry-compatible traces queryable in the MLflow UI and exportable to Delta for batch analysis.
  • Lakehouse Monitoring – apply Databricks Lakehouse Monitoring to the inference log table. Configure drift detection on retrieval score distributions and output token counts. Alert on statistical deviations.
  • LLM-as-Judge Evaluation – schedule a nightly Databricks Workflow that samples inference logs, runs an LLM-as-Judge evaluation (using mlflow.evaluate() with the answer_correctness and faithfulness metrics), and writes results to a monitoring Delta table.

 

This closes the feedback loop: production behavior surfaces back into the engineering workflow as structured, queryable data.

 

Common Architectural Questions

 

How does Unity Catalog enforce access control on vector search results?

Unity Catalog propagates the access control list (ACL) of the source Delta table to the vector index at the metadata level. A service principal or user that lacks SELECT permission on the source table cannot query the derived vector index – the permission check is enforced at the platform API layer, not in application code.

 

For multi-tenant deployments, this is implemented using row-level security on the source table with metadata pre-filters on the vector index (e.g., WHERE department = current_user_department()), which are evaluated before candidate chunks are returned.

 

What is the minimum viable architecture for an internal AI chatbot on Databricks?

The minimum production-viable stack consists of: a DLT ingestion pipeline into a Unity Catalog-registered Delta table, a Mosaic AI Vector Search index with incremental sync, a Model Serving endpoint for embeddings and LLM inference, and an AI Gateway route with inference logging enabled.

 

This four-component architecture can be operational in 2–4 weeks for a scoped document corpus. Hybrid retrieval, re-ranking, and LLM-as-Judge evaluation are strongly recommended for production but can be introduced in a second iteration.

 

How should PII be handled in an internal AI chatbot for regulated industries?

PII handling must occur at two control points: ingestion and inference. At ingestion, apply a PII detection classifier (e.g., Presidio or a fine-tuned NER model served on Databricks Model Serving) as a DLT transformation step; tag or redact sensitive fields before they reach the Gold layer or vector index.

 

At inference, configure AI Gateway input/output guardrails to scan prompts and completions in real time. For Pharma and Finance, audit logs of all PII-adjacent inference requests must be retained in a compliance-scoped Delta table with appropriate data classification tags in Unity Catalog.

 

Conclusion and Next Steps

 

Building a production-grade internal AI chatbot is an exercise in layered systems engineering – not prompt design. The architectural shift is from ad-hoc LLM integrations toward a governed, observable, and incrementally maintainable data product. The Databricks Mosaic AI stack provides native primitives – Unity Catalog, Vector Search, AI Gateway, Model Serving, and MLflow 3 – that address the governance, scalability, and observability requirements of enterprise deployments without requiring a fragmented set of third-party tools.

 

The critical decisions – access control model, chunking strategy, retrieval pipeline design, LLM serving governance – must be made at the architecture phase, not retrofitted from a prototype. Organizations that treat the internal AI chatbot as an infrastructure project from day one will achieve production stability in weeks rather than months.

 

Ready to productionize your AI architecture? Contact Dateonic’s Engineering Team – we design and implement production-ready Enterprise RAG systems on Databricks for Fortune 500 organizations in regulated industries.