What is Lakehouse Architecture and How It Differs from Data Lake and Data Warehouse

Author:

Date:

26 lutego, 2025

Introduction

In today’s complex data landscape, organizations rarely have all their data in one place. Data is scattered across various platforms – cloud data warehouses like Snowflake, on-premises databases, SaaS applications, and data lakes. This fragmentation creates significant challenges for data teams trying to derive comprehensive insights from their data assets. While migrating all data to a single platform is ideal in theory, it’s often impractical due to technical constraints, cost considerations, or organizational politics.

Enter Databricks Lakehouse Federation – a groundbreaking capability that allows organizations to query data across multiple external sources without having to move or copy it. As experienced Databricks consultants, we’ve seen how extending the Databricks lakehouse architecture to federate data across various systems enables organizations to achieve a unified view of their data estate while maintaining the flexibility to keep data in its original location when necessary.

This article explores how Lakehouse Federation works, when to use it, and practical steps to implement it in your organization with the help of Databricks consulting services.

Understanding Lakehouse Federation

What is Lakehouse Federation?

Lakehouse Federation is a capability within the Databricks Data Intelligence Platform that allows you to query data across external data sources – including cloud data warehouses, databases, and even other Databricks instances – without having to move or copy the data. This is accomplished by using Unity Catalog, which provides a unified metadata layer for all your data.

The key concept behind federation is the ability to create „foreign catalogs” that represent external data sources. Once a foreign catalog is established, users can query tables and views in these external systems using standard SQL, just as they would query any table in Databricks. Our team of Databricks specialists can help you configure this efficiently.

Key Benefits of Lakehouse Federation

Lakehouse Federation offers several significant benefits:

Unified Data Access: Query data across your entire data estate from a single interface.
Reduced Data Silos: Break down barriers between different data platforms and departments.
Consistent Governance: Apply Unity Catalog’s security and governance capabilities uniformly across all data sources.
Simplified Architecture: Avoid complex ETL pipelines that merely copy data between systems.
Cost Optimization: Reduce storage costs by eliminating duplicate data copies.
Accelerated Insights: Enable analysts to access all relevant data without waiting for data movement.

When to Use Lakehouse Federation vs. Data Ingestion

While Lakehouse Federation offers powerful capabilities, it’s important to recognize that it’s not a replacement for data ingestion in all scenarios. As Databricks consultants, we recommend the following guidelines to help you decide when to use each approach:

Use Federation When:

You need to query data that changes infrequently
You have read-only access requirements to source systems
You need to join data across multiple systems for ad-hoc analysis
The data volume is too large to cost-effectively duplicate
Regulatory requirements mandate that data remains in certain systems
You need a quick solution before implementing a more permanent data architecture

Use Data Ingestion (Copy Data to Databricks) When:

You need to perform complex transformations or machine learning on the data
You require high-performance analytics on large datasets
The source system has performance constraints that could be impacted by federation queries
You need to enable real-time or near real-time processing
You want to implement a medallion architecture (Bronze, Silver, Gold layers)
You’re building production data products that require predictable performance

In many real-world scenarios, organizations will use a hybrid approach – federating some data sources while ingesting others into Databricks, depending on specific use cases and requirements. Databricks consulting services can help determine the optimal approach for your specific data landscape.

Setting Up Lakehouse Federation

Let’s walk through the process of setting up Lakehouse Federation in Databricks, using an example with an Azure SQL Database. Our Databricks specialists regularly implement these setups for clients across various industries.

Prerequisites

Before you begin, ensure you have:

A Unity Catalog-enabled Databricks workspace with Databricks Runtime 13.1 or above
A Databricks Unity Catalog metastore
Network connectivity from your Databricks Runtime cluster or SQL warehouse to the target database
Necessary permissions to create connections and foreign catalogs in Unity Catalog
A Pro or Serverless SQL Warehouse

Step 1: Set Up a Connection

The first step in the federation process is to establish a connection to your external data source. This connection contains the authentication and connection details needed to access the external system.

From the Databricks interface:

Expand the Catalog view in your Databricks workspace
Navigate to the Connections section
Click „Create Connection”
Provide a name for your connection (e.g., „azure_sql_connection”)
Select the connection type (e.g., „Azure SQL Database”)
Enter the relevant authentication details:
- Server hostname
- Database name
- Authentication method (SQL Login, Azure AD, etc.)
- Username and password (or other credentials based on auth method)
Test the connection to verify it works
Save the connection

Step 2: Create a Foreign Catalog

Once you have a working connection, you can create a foreign catalog that will represent the external data source within Databricks:

Navigate back to the Catalog view
Click „Create Catalog”
Provide essential details:
- Name for your catalog (e.g., „adventure_works_catalog”)
- Select „Foreign” as the Type
- Choose the connection you created in Step 1
- Specify any specific schema or database within the external system you want to access
Click „Create”

Step 3: Configure Access Permissions

After creating the foreign catalog, you’ll need to configure access permissions to control who can view and query the data:

Navigate to your newly created foreign catalog
Go to the „Permissions” tab
Grant appropriate permissions to users, groups, or service principals
Permissions can be granted at the catalog, schema, or table level, providing fine-grained access control

Step 4: Query Your Federated Data

With the foreign catalog set up and permissions configured, you can now query the external data:

-- Query data from the external Azure SQL Database
SELECT * FROM adventure_works_catalog.saleslt.customer LIMIT 10;

-- Join data across systems
SELECT 
  c.CustomerID, 
  c.FirstName, 
  c.LastName, 
  o.OrderNumber, 
  o.TotalAmount
FROM adventure_works_catalog.saleslt.customer c
JOIN my_databricks_catalog.orders.order_facts o
  ON c.CustomerID = o.CustomerID
WHERE o.OrderDate > '2023-01-01';

The ability to seamlessly join data across systems is where the true power of Lakehouse Federation becomes apparent. Analysts can write standard SQL queries that combine data from multiple sources without worrying about the underlying system boundaries. Our Databricks consulting services can help optimize these queries for maximum performance.

Real-World Use Cases and Examples

Let’s explore some real-world scenarios where Lakehouse Federation provides significant value. These are examples our Databricks specialists have implemented for clients.

Use Case 1: Unified Customer Analytics for a Healthcare Provider

Scenario:
A healthcare organization has patient data in an on-premises SQL Server database that cannot be moved due to regulatory requirements. However, they want to combine this data with patient feedback and operational metrics stored in Databricks to create a comprehensive view of patient care.

Solution with Lakehouse Federation:

Create a connection to the on-premises SQL Server database
Establish a foreign catalog for the patient data
Set up appropriate access controls that respect patient data privacy
Create a dashboard that joins patient demographic data from SQL Server with feedback and operational metrics from Databricks

Outcome:
The healthcare organization gains a unified view of patient care quality without moving sensitive data, maintaining compliance with healthcare regulations while improving analytical capabilities.

Use Case 2: Cross-Department Analytics at a Financial Institution

Scenario:
A bank has different departments using different data systems: the trading desk uses Snowflake, the risk management team uses Databricks, and customer data resides in Oracle.

Solution with Lakehouse Federation:

Create connections to Snowflake and Oracle
Establish foreign catalogs for each external system
Implement row-level security in Unity Catalog to ensure appropriate data access
Build cross-departmental reports and dashboards that combine data from all three sources

Outcome:
The bank can now perform integrated analytics such as customer profitability analysis that combines trading activity, risk assessments, and customer relationship data, without creating redundant copies of data across systems.

Use Case 3: Legacy System Integration During Migration

Scenario:
A retailer is gradually migrating from legacy data warehouses to Databricks, but the process will take several months. In the meantime, they need to maintain business operations and reporting.

Solution with Lakehouse Federation:

Create connections to legacy data warehouses
Establish foreign catalogs for critical data sets
Build new reports in Databricks that federate data from both migrated and yet-to-be-migrated sources
Gradually replace federation with direct queries as data migration progresses

Outcome:
The retailer maintains business continuity during migration while allowing analysts to start using Databricks immediately, creating a smooth transition path to the new platform. Our Databricks consultants typically recommend this approach for phased migrations.

Best Practices and Considerations

While Lakehouse Federation offers powerful capabilities, there are several important considerations and best practices to keep in mind. These recommendations come from our extensive experience as Databricks specialists:

Performance Optimization

Queries that span multiple systems may not perform as well as queries against data that resides directly in Databricks. To optimize performance:

Limit Query Scope: Push as much filtering and aggregation to the source system as possible
Monitor Query Performance: Track execution times and identify slow-performing queries
Consider Caching Options: For frequently accessed data, consider creating materialized views
Schedule During Off-Peak Hours: Run cross-system reports during times when source systems have less load
Test Query Plans: Review query plans to ensure they’re optimized for federated access

Security and Governance

Maintaining appropriate security controls is crucial when federating across systems:

Implement Least Privilege: Grant only necessary permissions to users and services
Audit Access Patterns: Regularly review who is accessing what data across systems
Document Data Lineage: Maintain clear documentation of where data originates
Consider Data Classification: Apply consistent data classification across all sources
Implement Row-Level Security: Use Unity Catalog’s row-level security for sensitive data

Operational Considerations

Ensuring reliable operation of federated systems requires attention to:

Monitor Source System Availability: Implement alerts for when source systems become unavailable
Handle Schema Changes: Develop procedures for when source systems change their schema
Consider Network Reliability: Ensure reliable network connectivity between Databricks and source systems
Test Failure Scenarios: Practice recovering from various failure modes
Document Dependencies: Maintain clear documentation of system dependencies

Data Architecture Strategy

Consider how federation fits into your long-term data architecture:

Define Clear Federation Criteria: Establish guidelines for when to federate vs. ingest data
Create a Migration Roadmap: For critical data, plan if and when to migrate from federation to ingestion
Balance Performance and Flexibility: Recognize the tradeoffs between keeping data in original sources vs. consolidating
Consider Data Lifecycles: Implement appropriate archiving strategies for federated data
Align with Cloud Strategy: Ensure federation approach aligns with broader cloud migration strategies

Lakehouse Federation in Action: Case Study

Let’s look at a real-world example of how an organization successfully implemented Lakehouse Federation to transform their analytics capabilities with the help of Databricks consulting services.

Powys Teaching Health Board (PTHB)

Powys Teaching Health Board (PTHB), which serves the largest county in Wales, faced a significant challenge: about 50% of the data they needed for patient-centric decisions was provided by neighboring organizations in varying formats. This fragmentation slowed their ability to connect data with patient care quality.

With the Databricks Data Intelligence Platform, PTHB implemented a federated lakehouse approach to unify their view of disparate data streams. Using Lakehouse Federation, they seamlessly connected to their on-premises SQL warehouse and remote BigQuery environment at NHS Wales, creating a single view of their data estate.

The impact was substantial:

They modernized their data infrastructure in less than a year, far ahead of their initial 5-10 year estimate
Processing efficiency increased by approximately 40% in building data pipelines for analytics
Data analysts spent 65% less time answering questions from business users with the help of Databricks Assistant

As Jake Hammer, Chief Data Officer at PTHB, noted: „The adoption of Databricks has ensured that we can future-proof our data capabilities. It has transformed and modernized the way we work, and that has a direct impact on the quality of care delivered to our community.”

PTHB’s experience demonstrates how federation can serve as a catalyst for data modernization, allowing organizations to quickly realize value while establishing a foundation for long-term data strategy. This is a pattern our Databricks specialists have seen repeatedly across industries.

The Future of Data Federation

As data ecosystems continue to evolve, we can expect several trends to shape the future of data federation:

AI-Powered Federation: Intelligent systems will automatically optimize when and how to federate vs. ingest data based on usage patterns
Enhanced Cross-Platform Optimization: Advanced query optimization across multiple platforms will improve performance
Expanded Connector Ecosystem: Support for an increasingly diverse set of data sources will broaden federation possibilities
Real-Time Federation: Capabilities will extend to real-time data streams across platforms
Semantic Federation: Federation will move beyond tables to include models, features, and other semantic elements

Conclusion

Lakehouse Federation represents a significant advancement in data architecture, allowing organizations to break down data silos without the time, cost, and complexity of full data migration. By enabling seamless querying across diverse data sources, Databricks has taken a major step toward the vision of a truly unified data platform.

However, federation is not a silver bullet for all data challenges. Organizations should approach it as part of a comprehensive data strategy, using federation where it makes sense while continuing to consolidate critical data assets directly into the lakehouse where appropriate. Our experienced team of Databricks consultants can help you make these strategic decisions.

As you embark on your federation journey, remember to balance the immediate benefits of unified access against long-term considerations of performance, governance, and scalability. With thoughtful implementation guided by Databricks consulting services, Lakehouse Federation can significantly accelerate your organization’s path to becoming truly data-driven, making the vast landscape of scattered data assets finally accessible for comprehensive analysis and insight generation.

The ability to query data where it resides – without the overhead of moving it – opens new possibilities for collaboration, analysis, and innovation. By embracing this capability, organizations can focus more on deriving value from their data and less on the mechanics of data movement and replication. In a world where data volumes continue to grow exponentially, this shift from „move and duplicate” to „connect and analyze” may well become the new standard for modern data architecture.

What is Lakehouse Architecture and How It Differs from Data Lake and Data Warehouse

Table of Contents

Introduction

Understanding Lakehouse Federation

What is Lakehouse Federation?

Key Benefits of Lakehouse Federation

When to Use Lakehouse Federation vs. Data Ingestion

Use Federation When:

Use Data Ingestion (Copy Data to Databricks) When:

Setting Up Lakehouse Federation

Prerequisites

Step 1: Set Up a Connection

Step 2: Create a Foreign Catalog

Step 3: Configure Access Permissions

Step 4: Query Your Federated Data

Real-World Use Cases and Examples

Use Case 1: Unified Customer Analytics for a Healthcare Provider

Use Case 2: Cross-Department Analytics at a Financial Institution

Use Case 3: Legacy System Integration During Migration

Best Practices and Considerations

Performance Optimization

Security and Governance

Operational Considerations

Data Architecture Strategy

Lakehouse Federation in Action: Case Study

Powys Teaching Health Board (PTHB)

The Future of Data Federation

Conclusion

Let's talk about your project!

Explore

Portfolio

Industries

Follow us