Introduction
In today’s complex data landscape, organizations rarely have all their data in one place. Data is scattered across various platforms – cloud data warehouses like Snowflake, on-premises databases, SaaS applications, and data lakes. This fragmentation creates significant challenges for data teams trying to derive comprehensive insights from their data assets. While migrating all data to a single platform is ideal in theory, it’s often impractical due to technical constraints, cost considerations, or organizational politics.
Enter Databricks Lakehouse Federation – a groundbreaking capability that allows organizations to query data across multiple external sources without having to move or copy it. As experienced Databricks consultants, we’ve seen how extending the Databricks lakehouse architecture to federate data across various systems enables organizations to achieve a unified view of their data estate while maintaining the flexibility to keep data in its original location when necessary.
This article explores how Lakehouse Federation works, when to use it, and practical steps to implement it in your organization with the help of Databricks consulting services.
Understanding Lakehouse Federation
What is Lakehouse Federation?
Lakehouse Federation is a capability within the Databricks Data Intelligence Platform that allows you to query data across external data sources – including cloud data warehouses, databases, and even other Databricks instances – without having to move or copy the data. This is accomplished by using Unity Catalog, which provides a unified metadata layer for all your data.
The key concept behind federation is the ability to create „foreign catalogs” that represent external data sources. Once a foreign catalog is established, users can query tables and views in these external systems using standard SQL, just as they would query any table in Databricks. Our team of Databricks specialists can help you configure this efficiently.
Key Benefits of Lakehouse Federation
Lakehouse Federation offers several significant benefits:
- Unified Data Access: Query data across your entire data estate from a single interface.
- Reduced Data Silos: Break down barriers between different data platforms and departments.
- Consistent Governance: Apply Unity Catalog’s security and governance capabilities uniformly across all data sources.
- Simplified Architecture: Avoid complex ETL pipelines that merely copy data between systems.
- Cost Optimization: Reduce storage costs by eliminating duplicate data copies.
- Accelerated Insights: Enable analysts to access all relevant data without waiting for data movement.
When to Use Lakehouse Federation vs. Data Ingestion
While Lakehouse Federation offers powerful capabilities, it’s important to recognize that it’s not a replacement for data ingestion in all scenarios. As Databricks consultants, we recommend the following guidelines to help you decide when to use each approach:
Use Federation When:
- You need to query data that changes infrequently
- You have read-only access requirements to source systems
- You need to join data across multiple systems for ad-hoc analysis
- The data volume is too large to cost-effectively duplicate
- Regulatory requirements mandate that data remains in certain systems
- You need a quick solution before implementing a more permanent data architecture
Use Data Ingestion (Copy Data to Databricks) When:
- You need to perform complex transformations or machine learning on the data
- You require high-performance analytics on large datasets
- The source system has performance constraints that could be impacted by federation queries
- You need to enable real-time or near real-time processing
- You want to implement a medallion architecture (Bronze, Silver, Gold layers)
- You’re building production data products that require predictable performance
In many real-world scenarios, organizations will use a hybrid approach – federating some data sources while ingesting others into Databricks, depending on specific use cases and requirements. Databricks consulting services can help determine the optimal approach for your specific data landscape.
Setting Up Lakehouse Federation
Let’s walk through the process of setting up Lakehouse Federation in Databricks, using an example with an Azure SQL Database. Our Databricks specialists regularly implement these setups for clients across various industries.
Prerequisites
Before you begin, ensure you have:
- A Unity Catalog-enabled Databricks workspace with Databricks Runtime 13.1 or above
- A Databricks Unity Catalog metastore
- Network connectivity from your Databricks Runtime cluster or SQL warehouse to the target database
- Necessary permissions to create connections and foreign catalogs in Unity Catalog
- A Pro or Serverless SQL Warehouse
Step 1: Set Up a Connection
The first step in the federation process is to establish a connection to your external data source. This connection contains the authentication and connection details needed to access the external system.
From the Databricks interface:
- Expand the Catalog view in your Databricks workspace
- Navigate to the Connections section
- Click „Create Connection”
- Provide a name for your connection (e.g., „azure_sql_connection”)
- Select the connection type (e.g., „Azure SQL Database”)
- Enter the relevant authentication details:
- Server hostname
- Database name
- Authentication method (SQL Login, Azure AD, etc.)
- Username and password (or other credentials based on auth method)
- Test the connection to verify it works
- Save the connection
Step 2: Create a Foreign Catalog
Once you have a working connection, you can create a foreign catalog that will represent the external data source within Databricks:
- Navigate back to the Catalog view
- Click „Create Catalog”
- Provide essential details:
- Name for your catalog (e.g., „adventure_works_catalog”)
- Select „Foreign” as the Type
- Choose the connection you created in Step 1
- Specify any specific schema or database within the external system you want to access
- Click „Create”
Step 3: Configure Access Permissions
After creating the foreign catalog, you’ll need to configure access permissions to control who can view and query the data:
- Navigate to your newly created foreign catalog
- Go to the „Permissions” tab
- Grant appropriate permissions to users, groups, or service principals
- Permissions can be granted at the catalog, schema, or table level, providing fine-grained access control
Step 4: Query Your Federated Data
With the foreign catalog set up and permissions configured, you can now query the external data:
-- Query data from the external Azure SQL Database
SELECT * FROM adventure_works_catalog.saleslt.customer LIMIT 10;
-- Join data across systems
SELECT
c.CustomerID,
c.FirstName,
c.LastName,
o.OrderNumber,
o.TotalAmount
FROM adventure_works_catalog.saleslt.customer c
JOIN my_databricks_catalog.orders.order_facts o
ON c.CustomerID = o.CustomerID
WHERE o.OrderDate > '2023-01-01';
The ability to seamlessly join data across systems is where the true power of Lakehouse Federation becomes apparent. Analysts can write standard SQL queries that combine data from multiple sources without worrying about the underlying system boundaries. Our Databricks consulting services can help optimize these queries for maximum performance.
Real-World Use Cases and Examples
Let’s explore some real-world scenarios where Lakehouse Federation provides significant value. These are examples our Databricks specialists have implemented for clients.
Use Case 1: Unified Customer Analytics for a Healthcare Provider
Scenario:
A healthcare organization has patient data in an on-premises SQL Server database that cannot be moved due to regulatory requirements. However, they want to combine this data with patient feedback and operational metrics stored in Databricks to create a comprehensive view of patient care.
Solution with Lakehouse Federation:
- Create a connection to the on-premises SQL Server database
- Establish a foreign catalog for the patient data
- Set up appropriate access controls that respect patient data privacy
- Create a dashboard that joins patient demographic data from SQL Server with feedback and operational metrics from Databricks
Outcome:
The healthcare organization gains a unified view of patient care quality without moving sensitive data, maintaining compliance with healthcare regulations while improving analytical capabilities.
Use Case 2: Cross-Department Analytics at a Financial Institution
Scenario:
A bank has different departments using different data systems: the trading desk uses Snowflake, the risk management team uses Databricks, and customer data resides in Oracle.
Solution with Lakehouse Federation:
- Create connections to Snowflake and Oracle
- Establish foreign catalogs for each external system
- Implement row-level security in Unity Catalog to ensure appropriate data access
- Build cross-departmental reports and dashboards that combine data from all three sources
Outcome:
The bank can now perform integrated analytics such as customer profitability analysis that combines trading activity, risk assessments, and customer relationship data, without creating redundant copies of data across systems.
Use Case 3: Legacy System Integration During Migration
Scenario:
A retailer is gradually migrating from legacy data warehouses to Databricks, but the process will take several months. In the meantime, they need to maintain business operations and reporting.
Solution with Lakehouse Federation:
- Create connections to legacy data warehouses
- Establish foreign catalogs for critical data sets
- Build new reports in Databricks that federate data from both migrated and yet-to-be-migrated sources
- Gradually replace federation with direct queries as data migration progresses
Outcome:
The retailer maintains business continuity during migration while allowing analysts to start using Databricks immediately, creating a smooth transition path to the new platform. Our Databricks consultants typically recommend this approach for phased migrations.
Best Practices and Considerations
While Lakehouse Federation offers powerful capabilities, there are several important considerations and best practices to keep in mind. These recommendations come from our extensive experience as Databricks specialists:
Performance Optimization
Queries that span multiple systems may not perform as well as queries against data that resides directly in Databricks. To optimize performance:
- Limit Query Scope: Push as much filtering and aggregation to the source system as possible
- Monitor Query Performance: Track execution times and identify slow-performing queries
- Consider Caching Options: For frequently accessed data, consider creating materialized views
- Schedule During Off-Peak Hours: Run cross-system reports during times when source systems have less load
- Test Query Plans: Review query plans to ensure they’re optimized for federated access
Security and Governance
Maintaining appropriate security controls is crucial when federating across systems:
- Implement Least Privilege: Grant only necessary permissions to users and services
- Audit Access Patterns: Regularly review who is accessing what data across systems
- Document Data Lineage: Maintain clear documentation of where data originates
- Consider Data Classification: Apply consistent data classification across all sources
- Implement Row-Level Security: Use Unity Catalog’s row-level security for sensitive data
Operational Considerations
Ensuring reliable operation of federated systems requires attention to:
- Monitor Source System Availability: Implement alerts for when source systems become unavailable
- Handle Schema Changes: Develop procedures for when source systems change their schema
- Consider Network Reliability: Ensure reliable network connectivity between Databricks and source systems
- Test Failure Scenarios: Practice recovering from various failure modes
- Document Dependencies: Maintain clear documentation of system dependencies
Data Architecture Strategy
Consider how federation fits into your long-term data architecture:
- Define Clear Federation Criteria: Establish guidelines for when to federate vs. ingest data
- Create a Migration Roadmap: For critical data, plan if and when to migrate from federation to ingestion
- Balance Performance and Flexibility: Recognize the tradeoffs between keeping data in original sources vs. consolidating
- Consider Data Lifecycles: Implement appropriate archiving strategies for federated data
- Align with Cloud Strategy: Ensure federation approach aligns with broader cloud migration strategies
Lakehouse Federation in Action: Case Study
Let’s look at a real-world example of how an organization successfully implemented Lakehouse Federation to transform their analytics capabilities with the help of Databricks consulting services.
Powys Teaching Health Board (PTHB)
Powys Teaching Health Board (PTHB), which serves the largest county in Wales, faced a significant challenge: about 50% of the data they needed for patient-centric decisions was provided by neighboring organizations in varying formats. This fragmentation slowed their ability to connect data with patient care quality.
With the Databricks Data Intelligence Platform, PTHB implemented a federated lakehouse approach to unify their view of disparate data streams. Using Lakehouse Federation, they seamlessly connected to their on-premises SQL warehouse and remote BigQuery environment at NHS Wales, creating a single view of their data estate.
The impact was substantial:
- They modernized their data infrastructure in less than a year, far ahead of their initial 5-10 year estimate
- Processing efficiency increased by approximately 40% in building data pipelines for analytics
- Data analysts spent 65% less time answering questions from business users with the help of Databricks Assistant
As Jake Hammer, Chief Data Officer at PTHB, noted: „The adoption of Databricks has ensured that we can future-proof our data capabilities. It has transformed and modernized the way we work, and that has a direct impact on the quality of care delivered to our community.”
PTHB’s experience demonstrates how federation can serve as a catalyst for data modernization, allowing organizations to quickly realize value while establishing a foundation for long-term data strategy. This is a pattern our Databricks specialists have seen repeatedly across industries.
The Future of Data Federation
As data ecosystems continue to evolve, we can expect several trends to shape the future of data federation:
- AI-Powered Federation: Intelligent systems will automatically optimize when and how to federate vs. ingest data based on usage patterns
- Enhanced Cross-Platform Optimization: Advanced query optimization across multiple platforms will improve performance
- Expanded Connector Ecosystem: Support for an increasingly diverse set of data sources will broaden federation possibilities
- Real-Time Federation: Capabilities will extend to real-time data streams across platforms
- Semantic Federation: Federation will move beyond tables to include models, features, and other semantic elements
Conclusion
Lakehouse Federation represents a significant advancement in data architecture, allowing organizations to break down data silos without the time, cost, and complexity of full data migration. By enabling seamless querying across diverse data sources, Databricks has taken a major step toward the vision of a truly unified data platform.
However, federation is not a silver bullet for all data challenges. Organizations should approach it as part of a comprehensive data strategy, using federation where it makes sense while continuing to consolidate critical data assets directly into the lakehouse where appropriate. Our experienced team of Databricks consultants can help you make these strategic decisions.
As you embark on your federation journey, remember to balance the immediate benefits of unified access against long-term considerations of performance, governance, and scalability. With thoughtful implementation guided by Databricks consulting services, Lakehouse Federation can significantly accelerate your organization’s path to becoming truly data-driven, making the vast landscape of scattered data assets finally accessible for comprehensive analysis and insight generation.
The ability to query data where it resides – without the overhead of moving it – opens new possibilities for collaboration, analysis, and innovation. By embracing this capability, organizations can focus more on deriving value from their data and less on the mechanics of data movement and replication. In a world where data volumes continue to grow exponentially, this shift from „move and duplicate” to „connect and analyze” may well become the new standard for modern data architecture.
