Do you want to build advanced observability system for your Databricks platform? You have to check this blog post, with complete guide on how to do it 👇 Key Highlights: • 📊 Queryable telemetry from jobs, pipelines, and clusters in system tables • 🔍 Identify jobs that produce unused data to reduce cost • ⏱ Detect missing timeouts and long runtimes before SLA breaches • 🛡 Spot legacy runtime versions for security and performance • 👥 Pull job owners instantly for faster remediation I've been using the new System Tables for Lakeflow jobs to turn platform telemetry into a single source of truth. Because the tables are read‑only and live in the system catalog, I can write ordinary SQL to pull job configs, task timelines, cost attribution and lineage. With a few joins I can surface jobs that write tables nobody reads, flag runs that exceed expected duration, and list tasks still on deprecated runtimes. The results feed directly into the Lakeflow dashboard template, giving the whole engineering team a shared view of reliability, cost and hygiene. Since everything is queryable, audits and alerts become repeatable queries rather than ad‑hoc scripts. In practice this has reduced the time spent chasing 3 AM alerts and made ownership clear for every pipeline. Which observability signals have helped you catch issues before they impact SLAs? Tutorial - https://lnkd.in/dQwYciy7 #DatabricksAdministration #DataObservability #Lakeflow
Bartosz Gajda’s Post
More Relevant Posts
-
This post powerfully reframes “good” Databricks pipelines as those that can fail safely at 2 AM and recover automatically, not just run fast when everything is perfect. Manjot, you've done an excellent job here highlighting concrete resilience practices that many teams overlook while chasing performance tweaks. I especially like how you connect these to Databricks-native capabilities and still emphasize that true robustness must be engineered, not assumed from the platform.
Senior Data Engineer | Building Lakehouse & ETL/ELT Pipelines at Scale | AWS · Azure · Databricks · PySpark · Delta Lake · Unity Catalog · Python · SQL
🚀 “Your Databricks Pipeline is only as good as its failure strategy.” We talk a lot about performance in Databricks — faster clusters, Photon, partitioning, Z-ORDER, caching… But we don’t talk enough about what actually makes a pipeline production-grade: How it behaves at 2 AM when something breaks. The real question is: 👉 What happens next — automatically? In production Databricks data engineering, I now prioritize: ✅ Idempotent writes (no duplicates on rerun) ✅ Atomic loads with Delta Lake (transactional commits, safe MERGE patterns) ✅ Checkpointing & restartability (Structured Streaming / Auto Loader checkpoints) ✅ Control & audit tables (run_id, batch_id, source watermark, row counts, status) ✅ Retry-safe orchestration (Databricks Workflows task retries + clear dependencies) ✅ Partial failure handling (task-level isolation, “fail fast” vs “continue” by design) ✅ Alerting + observability (job notifications, SQL alerts, logs, metrics dashboards) Databricks gives us powerful building blocks: • Workflows (Jobs) + multi-task pipelines • Auto Loader + schema evolution options • Delta Lake: MERGE, OPTIMIZE, VACUUM, constraints • Time Travel + Change Data Feed (CDF) for recovery and reconciliation • Unity Catalog lineage + audit/system tables for traceability But resilience doesn’t happen automatically. It’s engineered. If a pipeline fails and you need manual cleanup before rerun… That’s not “bad luck” — that’s technical debt. 💬 How do you design for failure in your Databricks pipelines — what’s your go-to pattern (MERGE idempotency, audit tables, streaming checkpoints, DLT, etc.)? #DataEngineer #Data #Databricks #automation #DataEngineering
To view or add a comment, sign in
-
🚀 “Your Databricks Pipeline is only as good as its failure strategy.” We talk a lot about performance in Databricks — faster clusters, Photon, partitioning, Z-ORDER, caching… But we don’t talk enough about what actually makes a pipeline production-grade: How it behaves at 2 AM when something breaks. The real question is: 👉 What happens next — automatically? In production Databricks data engineering, I now prioritize: ✅ Idempotent writes (no duplicates on rerun) ✅ Atomic loads with Delta Lake (transactional commits, safe MERGE patterns) ✅ Checkpointing & restartability (Structured Streaming / Auto Loader checkpoints) ✅ Control & audit tables (run_id, batch_id, source watermark, row counts, status) ✅ Retry-safe orchestration (Databricks Workflows task retries + clear dependencies) ✅ Partial failure handling (task-level isolation, “fail fast” vs “continue” by design) ✅ Alerting + observability (job notifications, SQL alerts, logs, metrics dashboards) Databricks gives us powerful building blocks: • Workflows (Jobs) + multi-task pipelines • Auto Loader + schema evolution options • Delta Lake: MERGE, OPTIMIZE, VACUUM, constraints • Time Travel + Change Data Feed (CDF) for recovery and reconciliation • Unity Catalog lineage + audit/system tables for traceability But resilience doesn’t happen automatically. It’s engineered. If a pipeline fails and you need manual cleanup before rerun… That’s not “bad luck” — that’s technical debt. 💬 How do you design for failure in your Databricks pipelines — what’s your go-to pattern (MERGE idempotency, audit tables, streaming checkpoints, DLT, etc.)? #DataEngineer #Data #Databricks #automation #DataEngineering
To view or add a comment, sign in
-
𝐃𝐨 𝐖𝐞 𝐒𝐭𝐢𝐥𝐥 𝐍𝐞𝐞𝐝 𝐀𝐳𝐮𝐫𝐞 𝐃𝐚𝐭𝐚 𝐅𝐚𝐜𝐭𝐨𝐫𝐲 𝐢𝐧 𝐚 𝐃𝐚𝐭𝐚𝐛𝐫𝐢𝐜𝐤𝐬 𝐏𝐥𝐚𝐭𝐟𝐨𝐫𝐦? 𝐹𝑜𝑟 𝑎 𝑙𝑜𝑛𝑔 𝑡𝑖𝑚𝑒, 𝑡ℎ𝑒 𝑎𝑟𝑐ℎ𝑖𝑡𝑒𝑐𝑡𝑢𝑟𝑒 𝑤𝑎𝑠 𝑠𝑖𝑚𝑝𝑙𝑒: 🟢𝗔𝘇𝘂𝗿𝗲 𝗗𝗮𝘁𝗮 𝗙𝗮𝗰𝘁𝗼𝗿𝘆 = ingestion & orchestration 🟢𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 = transformation & analytics 𝐵𝑢𝑡 𝑖𝑛 𝑝𝑟𝑎𝑐𝑡𝑖𝑐𝑒, 𝑡ℎ𝑖𝑠 𝑜𝑓𝑡𝑒𝑛 𝑚𝑒𝑎𝑛𝑡: 🟢 Logic split across multiple tools 🟢Increased operational overhead 🟢Complex debugging and monitoring 🟢Fragmented security and access control 𝗧𝗼𝗱𝗮𝘆, 𝘁𝗵𝗲 𝘀𝗵𝗶𝗳𝘁 𝗶𝘀 𝗰𝗹𝗲𝗮𝗿. 𝑊𝑒’𝑟𝑒 𝑚𝑜𝑣𝑖𝑛𝑔 𝑡𝑜𝑤𝑎𝑟𝑑 𝑎 𝐷𝑎𝑡𝑎𝑏𝑟𝑖𝑐𝑘𝑠-𝑐𝑒𝑛𝑡𝑟𝑖𝑐 𝑑𝑎𝑡𝑎 𝑝𝑙𝑎𝑡𝑓𝑜𝑟𝑚: 🟢Where ingestion via Auto Loader / Lakeflow 🟢Orchestration via Databricks Workflows 🟢Processing via Spark + Delta Lake 🟢Governance via Unity Catalog 🟢Security is Centralised, consistent, and easier to manage 🟢Everything lives in one place. 𝐖𝐡𝐲 𝐬𝐞𝐜𝐮𝐫𝐢𝐭𝐲 𝐛𝐞𝐜𝐨𝐦𝐞𝐬 𝐬𝐢𝐦𝐩𝐥𝐞𝐫? When data stays within Databricks: • Centralised access control (single governance layer) • Unified data lineage & auditing • Less data movement and lower risk exposure • Fewer tools and smaller attack surface 𝑺𝒆𝒄𝒖𝒓𝒊𝒕𝒚 𝒊𝒔 𝒏𝒐𝒕 𝒂𝒏 𝒂𝒅𝒅-𝒐𝒏 𝒊𝒕’𝒔 𝒃𝒖𝒊𝒍𝒕 𝒊𝒏𝒕𝒐 𝒕𝒉𝒆 𝒑𝒍𝒂𝒕𝒇𝒐𝒓𝒎. So, do we still need ADF? 𝑁𝑜𝑡 𝑎𝑙𝑤𝑎𝑦𝑠. If your pipelines are fully within Databricks: 🟢ADF becomes optional 🟢Architecture becomes simpler, more secure, and easier to scale 🟢We’re designing simpler, smarter, and more secure data platforms. #Databricks #AzureDataFactory #DataEngineering #Lakehouse #DataArchitecture #DataPlatform #DataSecurity #UnityCatalog #ModernDataStack #BigData #CloudData #Analytics 𝑊𝑜𝑟𝑡ℎ 𝑟𝑒𝑎𝑑𝑖𝑛𝑔 𝑙𝑖𝑛𝑘 𝑏𝑒𝑙𝑜𝑤: https://lnkd.in/gKiTzJwV
To view or add a comment, sign in
-
-
🚨 Databricks is quietly redefining Data Engineering with Lakeflow… For years, we’ve been doing this 👇 ✔ Writing custom PySpark ingestion jobs ✔ Managing CDC logic manually ✔ Using multiple tools like ADF, Airflow, Fivetran ✔ Handling failures, retries, and monitoring ourselves It worked… but it was complex, slow, and high maintenance. Now comes Lakeflow Connect (Databricks) - and it changes the game. 💡 What’s different? 👉 BEFORE Lakeflow Connect: • Heavy coding for ingestion • Custom incremental logic (CDC headaches) • Multiple tools + fragmented architecture • High operational overhead 👉 AFTER Lakeflow Connect: • Plug-and-play connectors (Databases, SaaS, Files) • Automatic incremental ingestion • Serverless + fully managed pipelines • Built-in monitoring & governance (Unity Catalog) 🔥 The real shift: We are moving from 👉 “Engineering pipelines” to 👉 “Configuring data products” This is not just a feature - it’s a paradigm shift in Data Engineering ⚠️ Reality check: Lakeflow Connect is still evolving, and many teams still use DLT, Auto Loader, ADF - but this is clearly the direction Databricks is heading. 📌 My takeaway: The future Data Engineer will spend less time writing ingestion code… and more time delivering value from data. What do you think? Are you still building ingestion pipelines manually or exploring Lakeflow? #Databricks #Lakeflow #DataEngineering #BigData #DataPlatform #Analytics #LakeflowConnect
To view or add a comment, sign in
-
⚠️ Our Databricks pipeline was running… but the data stopped moving. No job failures. No alerts. Files were still landing in storage. But downstream tables were not updating. This is one of the most dangerous situations in a streaming pipeline — when everything looks healthy but the pipeline silently falls behind. Last week while working with Databricks Auto Loader, I ran into this exact issue. What was happening? The Auto Loader stream was active, but micro-batches were not triggering for newly arriving files. After digging into the streaming query progress and checkpoint metadata, I found the real problems: 🔹 Stale checkpoint metadata The checkpoint location contained previous run state, causing the stream to skip newly arriving files. 🔹 Schema inference overhead Auto Loader kept re-checking schema changes which slowed down batch processing. 🔹 Trigger configuration issue The micro-batch trigger interval was not optimized for the file arrival rate. How I fixed it Instead of restarting everything blindly, I did three things: ✔ Cleaned and reset the checkpoint metadata carefully ✔ Added schema hints to control schema evolution ✔ Tuned the stream trigger interval for better batch execution After the changes, the stream started processing files consistently and downstream tables caught up. Lesson learned In streaming systems, a running pipeline doesn’t mean a healthy pipeline. If you're using Auto Loader, always monitor: • Streaming query progress • Batch duration • Checkpoint health • File ingestion rate Ignoring these can lead to silent data delays in production pipelines. I help teams design reliable, scalable Azure & Fabric data platforms by focusing on architecture, ownership, and long-term thinking. #AzureDatabricks #AutoLoader #StreamingData #DeltaLake #DataEngineering #AzureDataFactory #DataPlatform
To view or add a comment, sign in
-
-
#DailyDataDose ☕💁♀️ Day 68 Advanced Optimization in Databricks —> Think Architecture, Not Tricks Optimizing Databricks is not about random tuning. It’s about mastering the full stack. At the core: Performance = Data Layout + Execution Engine + Compute + Memory + Workload Strategy Here’s the clean framework 👇 1. Data Foundation Design before scaling. • Delta optimization • Smart partitioning • File size control • Small files mitigation 2. Spark Execution Engine Optimize how queries run. • Join strategy • Shuffle tuning • Skew handling 3. Cluster & Compute Right sizing is everything. • Autoscaling • Instance selection • Photon • Job vs All-purpose clusters 4. Memory & Caching Cache with intention. • Delta cache • Persist levels • Memory tuning 5. Workload Strategy Architecture drives cost control. • Batch vs Streaming • BI vs ML isolation • SLA-driven design • Cost governance 💡 True optimization is not a command. It’s a system-level mindset. Performance and cost are two sides of the same architecture. #DailyDataDose #Databricks #DeltaLake #ApacheSpark #DataEngineering #DataArchitecture #BigData #DataPlatform #CloudEngineering #Lakehouse #PerformanceOptimization #DataGovernance #FinOps #DataStrategy #AnalyticsEngineering #TechLeadership
To view or add a comment, sign in
-
-
Recently while working on a data pipeline project using Azure Data Factory, Event Hub, Databricks, and Snowflake, I ran into a challenge that many data engineers face. The requirement was to process near real-time event data coming from multiple sources. The initial issue: The incoming data stream from Event Hub was highly variable. Some batches contained incomplete records, and others created duplicate entries when pipelines retried after failures. This caused two major problems: • Data inconsistencies in Snowflake • Increased processing time in Databricks jobs To solve this, I implemented a few improvements: 1️⃣ Used Azure Data Factory orchestration to manage pipeline retries and dependencies more reliably 2️⃣ Added data validation and deduplication logic in Databricks using PySpark 3️⃣ Used Azure Functions to handle lightweight event preprocessing before data entered the main pipeline 4️⃣ Optimized Snowflake loading using staged files and controlled batch loads The result: ✔ Cleaner datasets ✔ Faster processing times ✔ A much more reliable end-to-end pipeline This experience reinforced something important: Data engineering isn’t just about moving data it’s about designing systems that handle real-world imperfections in data streams. Always learning and improving with every project. #Azure #AzureDataFactory #Databricks #Snowflake #EventHub #AzureFunctions #DataEngineering #BigData #CloudEngineering #DataP
To view or add a comment, sign in
-
The hype around new data capabilities often ignores the reality of managing costs at scale. I saw this post from Anurag Sharma highlighting a common pain point: the Databricks bill that starts looking like a phone number because engineers are spinning up clusters just to debug basic transformations. The solution is a return to local development fundamentals. By using a local Spark setup in IDEs like IntelliJ or PyCharm combined with a strict Reader-Transformer-Writer template, his team moved 75% of development work off the cloud. What caught my eye in the architecture: The Reader forces early filtering. If you are not predicate pushing or filtering at the source, you are just wasting memory and IO regardless of your cluster size. The Transformer uses the Spark transform function for modularity. This makes unit testing business logic significantly easier than dealing with massive, monolithic notebooks. The Writer then enforces columnar formats like Parquet or Delta to ensure downstream performance. From an architectural standpoint, this reduces the feedback loop. Waiting five minutes for a cluster to spin up just to find a syntax error is a productivity killer. Moving the heavy lifting to local environments and only using the cloud for final integration testing via the Databricks CLI is a win for both cost and developer experience. If you are running large Spark workloads, how are you balancing local development vs. cloud-native features? Do you find local Spark environments drift too much from production cluster configurations? Source here: https://lnkd.in/eAM2J66m #DataEngineering #ApacheSpark #Databricks #CloudArchitecture #CostOptimization
To view or add a comment, sign in
-