Datashift’s Post

12,091 followers

11mo

What if your data tools could think smarter, run faster, and evolve with you? Great energy and insights at the latest Analytics Engineering Belgium meetup! Here are a few highlights we can’t stop talking about: 🚢🚢 Our own colleague Nicolas Jonckheere took the stage together with Emiel Ackermann and Lars Cools to share some insider lessons from Port of Antwerp-Bruges's Masterclass in Migration: "1 dbt repo, 2 targets". It was great to hear directly from our Port of Antwerp team about their journey from SQL Server to Databricks. Their approach of using a single dbt repository to target two different data warehouses simultaneously is a brilliant strategy for a phased and controlled migration, ensuring business-critical reports remain operational throughout the transition. ✅ Smarter Data Testing with Mikkel Dengsøe from SYNQ. The second talk featured Mikkel, co-founder of SYNQ, who shared some powerful philosophies on data testing. He argued that the goal isn't just to add more tests, but to build a testing culture that people trust. When too many tests fail, it creates "alert fatigue," and the alerts get ignored. His key advice was to move beyond pass/fail and create a holistic quality score for your data products. It was impressive to see how these concepts are being leveraged directly in their tool, SYNQ. 🚀 The Future is Now (and it's Fast!): The dbt Fusion Engine. The final talk on the new features from dbt Labs by Bart van Delft was a showstopper. The dbt Fusion engine is set to revolutionize our workflows. It powers benefits for every dbt user, from Core to Cloud: 1️⃣ Blazing-Fast Performance (Core & Cloud): Rebuilt from the ground up in Rust, the new engine makes dbt project parsing and compilation incredibly fast. 2️⃣ A Game-Changing VS Code Extension (Core & Cloud): The new extension brings the power of Fusion directly into your editor. The best feature? Intelligent refactoring. Renaming a model and having all downstream dependencies update automatically is a massive win for productivity. 3️⃣ Intelligent Orchestration (dbt Cloud Exclusive): This is where it gets really powerful for dbt Cloud users. They announced a state awareness feature where dbt knows precisely what has changed in your project, down to the column level. It intelligently runs only the models impacted by code or source data changes. This means faster, more efficient runs and a direct reduction in data warehouse costs. No more wasted cycles on unnecessary builds! Exciting times ahead for the dbt community. Follow us for more real stories, lessons from the field, and practical data know-how. #dbt #AnalyticsEngineering #DataEngineering #PortOfAntwerp #Databricks #SYNQ #DataTesting #dbtFusion #dbtCloud #VSCode

1 Comment

Mikkel Dengsøe 11mo

Great being there!

To view or add a comment, sign in

More Relevant Posts

Hugo Lu
7mo
Report this post
Fivetran x dbt: Masterplan unfolding. Here's what you need to know. (Article in comments). Like many of you, my feed’s been buzzing with dbt/Fivetran merger chatter — speculation, hot takes, chaos. But beneath the noise lies something much bigger than simple consolidation. Here are the themes. 🔥 1. Fusion is here — aiming to be the new (open?) standard for data transformation. dbt Labs is betting the house on Fusion. There is a race to create as many fusion-conformant projects as possible. Idea is to create THE open standard for transformation, alongside THE open standard for the semantic layer. Some nice features ---- ✅ State-aware orchestration → 50%+ cost savings (dbt slashed their own Snowflake bills from $850k → $250K) ---- ⚡ Rust-powered performance → lightning-fast parsing, instant impact analysis - probably the demo highlight at coalesce. ---- 💡 Open engine access for partners → an aggressive play to dominate the transformation market. Want us to use more engines like the "Multi engine stack" we've enabled with Orchestra 🧩 2. Open Data Infrastructure & Interoperability The industry is waking up to modularity — but modularity comes at a price: stitching tools together. I've always known this which is why Orchestra fits this piece perfectly vs. forcing data people to learn legacy frameworks like Airflow and spin up loads of existing unnecessary tooling like catalogs, lineage, quality and so on which stops data ppl doing real work This is not going to be a vendor-lock-in situation where you HAVE to use (dbt and Fivetran) OR not at all (they already have an 80%+ customer overlap). 3. Iceberg and cost -- this is driving a lot of interest. Iceberg is small but growing exponentially. Typically has patterns where ppl are not using Snowflake or Databricks but AWS Glue or Athena. Currently 1,000 out of 90,000 prod dbt projects are iceberg Data Lakes are Fivetran's fastest-growing sgement 4. The Bigger Picture: This isn’t just about Fusion. It’s about reshaping the modern data stack. This is a play fundamentally ---- 🧊 Attack Snowflake and Databricks — win market share by owning ELT orchestration ---- Own standard for data transformation and semantic layer. Create a critical mass around (not fully OS) dbt Fusion ---- Develop orchestration and metadata into the platform to help go after non-analyst personas 💥 The Stakes: This is VC playmaking at its finest. Grow fast. Create dependency. Force an acquisition. If they hit ~$800M ARR and keep pace, Snowflake or Databricks will buy. Otherwise another 2 years, cost-cutting and IPO. Article in comments below. #dbt #Fivetran #DataEngineering #DataTransformation #Snowflake #Databricks #DataStack #Orchestration #Fusion #getorchestra

1 Comment
Like Comment
To view or add a comment, sign in
Niharika Ch
7mo
Report this post
Data Quality and Governance: The Hidden Backbone of Every Data Platform We often talk about performance, scalability, and dashboards but the real strength of any data platform lies in something less visible: data quality and governance. Over time, I’ve learned that these aren’t afterthoughts they’re what make every pipeline trustworthy. Here are a few lessons that stuck with me 👇 ✅ Validate early, validate often Schema checks, null validations, and business rule tests should live inside your ingestion logic, not after it. ⚙️ Go metadata-driven Let configuration tables define your pipelines and rules. It’s the only way to scale without rewriting code. 🤖 Automate quality checks Build Airflow/ADF steps that halt or alert when validation fails. Don’t rely on humans to catch every anomaly. 🔐 Governance = enablement, not restriction Role-based access, encryption, and data lineage build trust and compliance without blocking access. 📊 Measure it Track data freshness, failure rates, and incident MTTR to make quality tangible and actionable. When you embed quality and governance into every layer ingestion, transformation, and delivery your data becomes something better than fast: it becomes reliable. What are some ways you’ve embedded data quality or governance into your pipelines? Would love to hear your approach 👇 Medium Link: https://lnkd.in/gKsTZix8 #DataEngineering #DataGovernance #DataQuality #ETL #Azure #Databricks #CloudArchitecture #BigData #DataTrust

Data Quality and Governance: The Hidden Backbone of Every Scalable Data Platform medium.com
Like Comment
To view or add a comment, sign in
Dr. Richard E. Ewelle
7mo
Report this post
Batch jobs still dominate. But speed is killing them, one event at a time. Let’s set the record straight: Scheduled, batch-oriented processing has been the backbone of data systems for decades. → It’s simple. → It’s reliable. → It gets the job done. A well-written Airflow DAG or cron job running nightly can move terabytes of data without breaking a sweat. And in many cases, it’s still the right choice. But here’s the reality no one wants to admit: Today’s users don’t want to wait. They want dashboards to reflect changes instantly. They want machine learning models to react to new behavior. They want operational systems to make decisions based on "now", not last night’s snapshot. That’s where event-driven and incremental processing shine. So how do they really compare? 1. Scheduled & Batch-Oriented Processing Pros: → Simpler to build and maintain → Cost-effective during off-peak hours → Excellent for processing large volumes at once Use Cases: ↳ Daily business reporting ↳ Periodic data warehouse ETL ↳ Historical data reprocessing 2. Event-Driven & Incremental Processing Pros: → Lower latency for real-time use cases → Only process what changed = more efficient → More flexible, decoupled system architecture Use Cases: ↳ Real-time fraud detection ↳ Syncing user profile updates ↳ Streaming ingestion into data lakes or ML models So why isn’t everyone doing event-driven? Because it’s not free. Event-driven systems bring complexity: → More moving parts → Distributed state → Versioning challenges → Harder to debug Here’s what works well: → Change Data Capture (CDC) to detect changes → Pub/Sub systems for async communication → Kafka for buffering and replay Here’s where teams go wrong: → Triggering full loads on small updates → Ignoring event versioning and schema drift → Lack of observability and monitoring Final thoughts: You don’t need to be “real-time” to be right. You need to be *just-in-time* for your use case. That’s the difference between overengineering and true architecture. Are you already working with event-driven systems? What’s been your hardest lesson so far? I’d love to hear. --------------- PS: I share real-world data engineering tips every week. If this post helped you see more clearly, follow me so you don’t miss the next one.
Like Comment
To view or add a comment, sign in
Sai Prateek Muddasani
8mo
Report this post
🚀 Medallion Architecture in Databricks – The Backbone of Modern Data Lakehouses The Medallion Architecture (Bronze → Silver → Gold) is a powerful design pattern that organizes data pipelines for quality, scalability, and analytics-readiness. It’s a best practice at the heart of Databricks Lakehouse implementations. 🥉 Bronze Layer – Raw & Immutable Purpose: Ingest data from diverse sources (streaming events, APIs, relational DBs, IoT). Traits: Stores raw, unvalidated data in its native format (JSON, CSV, Parquet, AVRO). Benefits: Full data fidelity and auditability for future reprocessing. Example: Sensor data from connected vehicles lands as-is in Delta Lake tables. 🥈 Silver Layer – Clean & Conformed Purpose: Apply data quality and transformation rules. Actions: Deduplication, schema enforcement, type casting, reference-data joins. Outcome: A single source of truth ready for business logic or ML feature engineering. Example: Cleansed sales transactions enriched with product and customer dimensions. 🥇 Gold Layer – Curated & Business-Ready Purpose: Serve analytics, dashboards, and ML models with business metrics (KPIs). Traits: Aggregated, denormalized tables optimized for fast BI/SQL queries. Outcome: Consistent, trusted datasets for decision-makers. Example: Daily revenue summaries feeding Power BI dashboards and forecasting models. ⚙️ Why Databricks Fits Perfectly Databricks provides native building blocks to implement this layered design: Delta Lake: ACID transactions, time travel, schema evolution. Delta Live Tables (DLT): Declarative pipeline creation, data-quality expectations, auto-scaling. Unity Catalog: Central governance, lineage, and fine-grained RBAC across all layers. Workflows: Schedule or trigger complex multi-layer jobs with full observability. 💡 Business Impact Data Reliability: Catch errors early in Bronze/Silver before analytics. Agility: Evolve schemas and transformations independently. Scalability: Handle streaming or batch loads across cloud providers. Compliance: Audit trails and lineage support HIPAA, GDPR, SOC2, and other frameworks. 🔑 Pro Tips Keep layers separate and immutable for reproducibility. Use partitioning, Z-Ordering, and caching to optimize query performance. Automate promotion from Bronze to Gold using DLT expectations and Workflows. Takeaway: Medallion Architecture isn’t just a buzzword—it’s a blueprint for reliable, governed, and analytics-ready data. In Databricks, it transforms messy raw feeds into business gold at cloud scale. #Databricks #MedallionArchitecture #DataEngineering #DeltaLake #DLT #DataQuality #Lakehouse #BigData #CloudData #DataGovernance #AnalyticsEngineering
2 Comments
Like Comment
To view or add a comment, sign in
Michał Lubasiński
7mo
Report this post
𝐎𝐩𝐞𝐧 𝐒𝐨𝐮𝐫𝐜𝐞 + 𝐀𝐈 𝐑𝐞𝐝𝐞𝐟𝐢𝐧𝐢𝐧𝐠 𝐃𝐚𝐭𝐚 𝐏𝐥𝐚𝐭𝐟𝐨𝐫𝐦𝐬 🏗️ 2025 𝘮𝘢𝘳𝘬𝘴 𝘢 𝘵𝘪𝘱𝘱𝘪𝘯𝘨 𝘱𝘰𝘪𝘯𝘵 𝘧𝘰𝘳 𝘦𝘯𝘵𝘦𝘳𝘱𝘳𝘪𝘴𝘦 𝘥𝘢𝘵𝘢 𝘱𝘭𝘢𝘵𝘧𝘰𝘳𝘮𝘴 — 𝘥𝘳𝘪𝘷𝘦𝘯 𝘣𝘺 𝘈𝘐 𝘢𝘯𝘥 𝘰𝘱𝘦𝘯 𝘴𝘰𝘶𝘳𝘤𝘦. Enterprises are 𝐫𝐞𝐛𝐮𝐢𝐥𝐝𝐢𝐧𝐠 𝐝𝐚𝐭𝐚 𝐬𝐭𝐚𝐜𝐤𝐬 by 𝐜𝐨𝐦𝐛𝐢𝐧𝐢𝐧𝐠 𝐨𝐩𝐞𝐧-𝐬𝐨𝐮𝐫𝐜𝐞 frameworks like 𝐀𝐩𝐚𝐜𝐡𝐞 𝐈𝐜𝐞𝐛𝐞𝐫𝐠, 𝐃𝐞𝐥𝐭𝐚 𝐋𝐚𝐤𝐞, and dbt with AI-driven analytics powered by 𝐃��𝐭𝐚𝐛𝐫𝐢𝐜𝐤𝐬, 𝐒𝐧𝐨𝐰𝐟𝐥𝐚𝐤𝐞 𝐂𝐨𝐫𝐭𝐞𝐱, and 𝐆𝐨𝐨𝐠𝐥𝐞 #𝐕𝐞𝐫𝐭𝐞𝐱𝐀𝐈. The goal: • 𝐑𝐞𝐚𝐥-𝐭𝐢𝐦𝐞 𝐢𝐧𝐬𝐢𝐠𝐡𝐭𝐬 for decision-making • 𝐋𝐨𝐰𝐞𝐫 𝐓𝐂𝐎 through flexible cloud-native architectures • 𝐅𝐚𝐬𝐭𝐞𝐫 𝐞𝐱𝐩𝐞𝐫𝐢𝐦𝐞𝐧𝐭𝐚𝐭𝐢𝐨𝐧 with generative AI models directly on enterprise data lakes According to Forbes, companies adopting these hybrid solutions report up to 𝟑𝟓% 𝐥𝐨𝐰𝐞𝐫 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐜𝐨𝐬𝐭𝐬 and 𝟒𝟎% 𝐟𝐚𝐬𝐭𝐞𝐫 𝐭𝐢𝐦𝐞-𝐭𝐨-𝐢𝐧𝐬𝐢𝐠𝐡𝐭 versus legacy proprietary stacks. We’re also seeing 𝐩𝐚𝐫𝐭𝐧𝐞𝐫𝐬𝐡𝐢𝐩𝐬 𝐛𝐞𝐭𝐰𝐞𝐞𝐧 𝐡𝐲𝐩𝐞𝐫𝐬𝐜𝐚𝐥𝐞𝐫𝐬 𝐚𝐧𝐝 𝐨𝐩𝐞𝐧-𝐬𝐨𝐮𝐫𝐜𝐞 𝐞𝐜𝐨𝐬𝐲𝐬𝐭𝐞𝐦𝐬 — e.g., AWS supporting Iceberg, Snowflake embracing Python/AI models, and Databricks integrating Mosaic AI — to make platforms more interoperable and innovation-ready. 👉 Is your data platform AI-ready — or still bound by legacy tech debt? Did you know that 𝐛𝐲 𝟐𝟎𝟐𝟔, 𝐨𝐯𝐞𝐫 𝟔𝟓% 𝐨𝐟 𝐞𝐧𝐭𝐞𝐫𝐩𝐫𝐢𝐬𝐞 𝐝𝐚𝐭𝐚 𝐰𝐨𝐫𝐤𝐥𝐨𝐚𝐝𝐬 are projected to run on 𝐡𝐲𝐛𝐫𝐢𝐝 #𝐨𝐩𝐞𝐧𝐬𝐨𝐮𝐫𝐜𝐞 + #𝐀𝐈 platforms, up from just 28% in 2023? 📎 Read more at Forbes: https://lnkd.in/d9SiZCuv #AIinDataPlatforms, #OpenSourceInnovation, #EnterpriseAnalytics, #CloudNative, #DataGovernance, #HybridCloud

AI And Open Source Redefine Enterprise Data Platforms In 2025 social-www.forbes.com
Like Comment
To view or add a comment, sign in
Obinna Charles Ordi
7mo
Report this post
Beyond Pipelines: The Next Era of Data Engineering Not long ago, the most popular definition of what we do as data engineers is essentially designing, building, and maintaining data pipelines — making data accessible and usable for analytics, reporting, and machine learning, etc. But in today’s fast-changing data ecosystem, that definition feels a bit narrow. With automation, AI, and new architectures reshaping how we work with data, it’s worth asking: Does the traditional definition of data engineering still hold true? Today’s data engineers are no longer just pipeline builders — they’re system thinkers, automation architects, and AI enablers. The work now extends far beyond moving data from one point to another. It’s about designing intelligent, scalable systems that ensure data is not only available but also trusted, discoverable, and ready for real-time use. As cloud platforms mature and tools like Airflow, dbt, and modern data stacks evolve, we’re seeing a shift from manual orchestration to event-driven, automated, and self-healing data workflows. The focus is moving from “how do we move data?” to “how do we make data work for us intelligently?” The boundaries between data engineering, AI, and governance are blurring. The rise of generative AI and machine learning has made data quality, lineage, and context more critical than ever. Data is no longer just a byproduct of operations; it’s becoming a product in itself. Forward-looking teams are embracing concepts like data mesh and data fabric, where ownership is decentralized and each domain treats its datasets as reliable, well-documented products. This shift demands not just technical skill, but a mindset change: data engineers must think like product managers — ensuring usability, reliability, and business value at every stage of the data lifecycle. For the “old guards” of data engineering, this shift isn’t a threat; It’s an opportunity. The foundational skills that built today’s data systems remain invaluable; they just need to evolve. By embracing automation, cloud-native architectures, and AI-driven workflows, classical data engineers can position themselves as the bridge between traditional infrastructure and the intelligent data platforms of the future. The next wave of opportunity lies in real-time data streaming, data observability, AI/ML infrastructure, and data governance automation. The future belongs to those who combine engineering discipline with strategic thinking — professionals who understand not only how to build data systems, but how to make them learn, adapt, and create value on their own. What do you think would be a better, broader title that truly reflects everything Data Engineers do today?
Like Comment
To view or add a comment, sign in
Stefan Hege
7mo
Report this post
This is a great read if you’re into how real-time data and AI are changing the game. Qlik open Lakehouse, powered by AWS and Apache Iceberg, makes the Lakehouse model faster, smarter, and more open than ever. Love seeing how #Qlik is helping shape the future of data architecture.

Building Your Next-Gen Lakehouse with Qlik, AWS, and Apache Iceberg qlik.com
Like Comment
To view or add a comment, sign in
Justin Segal
7mo
Report this post
This is a great read if you’re into how real-time data and AI are changing the game. Qlik open Lakehouse, powered by AWS and Apache Iceberg, makes the Lakehouse model faster, smarter, and more open than ever. Love seeing how #Qlik is helping shape the future of data architecture.

Building Your Next-Gen Lakehouse with Qlik, AWS, and Apache Iceberg qlik.com
Like Comment
To view or add a comment, sign in
Jochem Zwienenberg
7mo
Report this post
This is a great read if you’re into how real-time data and AI are changing the game. Qlik open Lakehouse, powered by AWS and Apache Iceberg, makes the Lakehouse model faster, smarter, and more open than ever. Love seeing how #Qlik is helping shape the future of data architecture.

Building Your Next-Gen Lakehouse with Qlik, AWS, and Apache Iceberg qlik.com
Like Comment
To view or add a comment, sign in

12,091 followers

View Profile Follow

Datashift’s Post

More from this author

2024.2 #DataOnTheRocks

2024.1 #DataOnTheRocks

2023.3 #DataOnTheRocks

Explore content categories

Datashift’s Post

More Relevant Posts

More from this author

2024.2 #DataOnTheRocks

2024.1 #DataOnTheRocks

2023.3 #DataOnTheRocks

Explore related topics

Explore content categories