Introduction
Today’s businesses demand instant insights from data. Traditional batch-driven data warehouses often create reporting lags of hours—or even days—making it challenging to make data-driven decisions in real time. At Datagize, we’ve built a near-real-time data warehousing architecture on Azure that delivers 3–5 second latencies from source databases to fact tables. In this blog, we’ll walk you through the key components of our solution and show how we tackled performance, reliability, and costs—without sacrificing maintainability or security.
Who Should Read This Post?
- Chief Data Officers (CDOs), CIOs, and Directors of BI/DW: Looking to modernize data platforms or enable real-time analytics.
- Data Architects and Enterprise Architects: Evaluating Azure services for high-speed data ingestion, transformation, and reporting.
- BI/Data Warehouse Managers: Wanting to understand how near-real-time can be implemented at scale.
Key Takeaways
- Rapid Delivery: Datagize’s prebuilt Python components make near-real-time data pipelines easier and faster to implement.
- Scalable Azure Stack: Leveraging Azure SQL Database, Azure Functions, Event Hubs, and Stream Analytics for both low latency and resiliency.
- Cost-Effective & Flexible: Pay-as-you-go consumption model plus strategic tuning to keep overhead manageable.
Architecture Overview
Below is a generic architecture diagram depicting the end-to-end data flow:
- Azure SQL Database (with CDC) – The source system uses Change Data Capture (CDC) on tables that need real-time syncing.
- Azure Logic Apps – This lightweight workflow orchestrates the frequency (e.g., every 2 seconds) of calls to our first Python-based Azure Function.
- Azure Functions (CDC Reader) – A bespoke Python script pulls new or updated rows from the source database (using CDC or a sequence ID), then writes these events to Azure Event Hubs.
- Azure Event Hubs – Receives and temporarily buffers incoming data events.
- Azure Stream Analytics – Consumes events in near real-time and calls our second Azure Function.
- Azure Functions (Procedure Caller) – Another Python script that processes the events and calls a stored procedure in the target data warehouse.
- Azure SQL Data Warehouse – The final destination for fact and dimension tables, updated on a streaming basis, with specialized logic to handle asynchronous arrivals and early-arriving facts.
- Power BI – Consumes the latest data from the warehouse for dashboards, reports, and analytics.
High-Level Data Flow:
- Detect Changes: Azure SQL DB logs table changes via CDC.
- Pull Changes: An Azure Logic App triggers every 2 seconds, invoking the CDC Reader Function.
- Queue Events: The Function sends new/updated rows to Azure Event Hubs.
- Stream & Process: Azure Stream Analytics picks up the event stream and calls the second Python Function.
- Load Warehouse: The second Function executes a stored procedure in the Azure SQL Data Warehouse, updating facts and dimensions in near-real time.
- Analytics: Power BI taps into the warehouse for dashboards and reports.
Key Implementation Details
Change Data Capture Setup
We set up CDC at the table level in Azure SQL Database. This allows us to track inserts, updates, and deletes without intrusive changes to application logic. Alternatively, if a reliable timestamp or sequence column exists, that can be used as a fallback or simpler approach.
Bespoke Python Functions
- CDC Reader Function
- Pulls incremental changes from source tables (using CDC or a custom sequence/timestamp).
- Packages these changes into event payloads and pushes them to Azure Event Hubs.
- Intelligent error handling, batching, and incremental read logic are part of Datagize’s “secret sauce.”
- Procedure Caller Function
- Subscribes to streaming events from Azure Stream Analytics.
- Batches or processes row-by-row transactions as needed.
- Invokes a stored procedure in the Azure SQL Data Warehouse for the final load. The stored procedure manages fact/dimension updates, handles upserts, and addresses early-arriving facts.
Performance Tuning (High-Level)
Although the specifics of our tuning remain proprietary, we rely on standard best practices like creating the right indexes, partitioning large tables, and carefully managing concurrency. These tactics help maintain 3–5 second latencies while handling real-world data volumes.
Latency Achievements and Monitoring
We measured our 3–5 second end-to-end performance by inserting or updating small test batches (around 44 rows) in the source. Each major step (CDC detection, event publication, streaming, and data warehouse insertion) was timestamped. By comparing logs, we confirmed that data typically arrived in the warehouse in under 5 seconds—even under varying loads.
Monitoring
- We used Azure Monitor and Application Insights to track function invocations, event processing times, and throughput.
- Built-in Azure dashboards helped visualize average latency across the pipeline, alerting on potential bottlenecks.
Scalability and Resilience
Handling Spikes
Our architecture accounts for high-volume or bursty data through batching and concurrency. Azure Event Hubs and Stream Analytics can scale to handle large spikes, while the stored procedure approach in the data warehouse uses staging tables to handle large inserts efficiently.
Retries and Failures
Both Functions and Logic Apps can be configured with retry policies to handle transient errors. If an Azure Function or Event Hub experiences an outage, Azure’s built-in platform resilience ensures events aren’t lost and Functions can retry when systems recover.
Cost Management
Optimizing Azure Functions
Since Functions run on a consumption plan, cost is tied to execution time and frequency. By carefully setting polling intervals (in this case, every 2 seconds for near-real-time needs), we minimize unnecessary triggers. Also, optimizing the Python scripts reduces runtime and thus overall cost.
Other Services
- Event Hubs & Stream Analytics: Typically adds about 5–20% overhead on top of core SQL costs. With efficient scaling and batch processing, these services remain relatively cost-effective.
- Logic Apps: Minimal overhead given our lightweight approach (calls every 2 seconds).
- Azure SQL Costs: The main expense usually comes from source and target Azure SQL DB environments. Our real-time pipeline approach adds only a manageable layer of overhead on top.
Security and Governance
Our Azure SQL environments use encryption at rest by default. We can also encrypt data in transit for end-to-end protection. Standard Azure security features—like network restrictions and IP whitelisting—can be applied to Functions, Event Hubs, and Stream Analytics. While not the focus of this article, robust data governance and role-based access control are critical for any production environment, especially when multiple teams need different levels of access.
Future Roadmap
We plan to explore the following enhancements:
- Delta Lakehouse with Databricks
- Implementing a Delta Lake architecture can provide an advanced layer for structured and unstructured data.
- With Databricks, we can unify batch and streaming data, enabling more complex transformations and near-real-time analytics on a broader data set.
- Further Cost Optimization
- Exploring reserved capacity or other tiers for Event Hubs and Stream Analytics.
- Tweaking polling intervals and function runtime to balance real-time needs with cost efficiency.
- Enhanced Security
- Adding encryption in transit (TLS) for every service endpoint.
- Exploring advanced firewall/network rules for each service.
- Edge Cases & Complex Scenarios
- Continuous improvements to handle advanced use cases like multi-table transactions, referential integrity checks, and advanced data transformations.
Lessons Learned
- Preview Features: The built-in CDC (Preview) feature in Azure Data Factory (ADF) has cost inefficiencies and throttling limitations.
- Polling Intervals: Balancing near-real-time needs with cost overhead can be tricky—finding the right frequency is key.
- Proprietary Tuning: Our Python-based approach gave us more control and better performance than off-the-shelf solutions.
Conclusion
By combining Azure SQL CDC, Azure Functions, Event Hubs, and Stream Analytics with Datagize’s bespoke Python components, we’ve delivered a solution that enables near-real-time data warehousing with latencies as low as 3–5 seconds. This architecture proves that speed, flexibility, and cost-effectiveness can coexist with the right design choices.
Ready to transform your data platform? Contact Datagize to learn how we can accelerate your journey to near-real-time data analytics on Azure—and explore the possibilities of Delta Lakehouse and Microsoft Fabric in your environment.
About Datagize
Datagize specializes in building scalable, high-performance data solutions that drive actionable insights. Our team of experts has deep experience with cloud-native architectures, BI, analytics, and machine learning—empowering businesses to stay ahead in a data-driven world.