Spark Processing 5TB Files: A Real-World Breakdown

This title was summarized by AI from the post below.

3mo Edited

Ever wondered how Spark actually processes massive files like 5TB of data? 🤔 Let me break it down with a real-world scenario I recently worked through. Picture this: You've got a 5TB file sitting in Databricks. Here's exactly what happens behind the scenes: 📂 The Breakdown: Your 5 TB file gets split into 128MB chunks (default partition size) That's roughly 40,000 partitions to process Each partition = one task for Spark to handle 🖥️ The Cluster Setup: Let's say you spin up a cluster with 10 nodes Each node has 8 cores = 80 cores total Spark can now process 80 partitions in parallel at once Time to process all 40,000 partitions? 40,000 ÷ 80 = 500 waves of execution ⚡ File Type Matters: Parquet or Delta? You're golden (columnar storage, compression, predicate pushdown) CSV or JSON? Expect slower reads and more memory pressure Partitioned by date/region? Even better – Spark skips irrelevant data chunks 🔧 Join Optimization Tips: 1️⃣ Broadcast small tables (< 10MB) to avoid shuffles 2️⃣ Use bucketing for repeated joins on the same keys 3️⃣ Partition both datasets on the join key before joining 4️⃣ Increase shuffle partitions for large joins (spark.sql.shuffle.partitions) 💡 The Reality Check: More cores = faster processing, but only up to a point 40,000 partitions on 80 cores is manageable Too few cores? You're waiting forever Too many tiny partitions? Overhead kills performance RAM plays a critical role in Spark performance, especially for shuffle-heavy workloads, joins, aggregations, and caching. Each executor’s memory is divided into: Execution memory (for joins, shuffles, sorting) Storage memory (for caching) Simply increasing RAM does NOT always improve performance. For example: ❌ 1 executor → 32 cores + 64GB RAM ✅ 4 executors → 8 cores + 16GB RAM each Balanced executors: Improve parallelism Reduce GC pressure Improve fault tolerance Utilize cluster resources better For PySpark workloads or heavy shuffles, tuning spark.executor.memoryOverhead is also important to prevent out-of-memory errors. Pro tip: Monitor your Spark UI to see if tasks are skewed or if you're hitting memory limits. That's where the real optimization begins. Working with large-scale data? What's your biggest bottleneck – compute, memory, or shuffle operations? #DataEngineering #Spark #Databricks #BigData #DistributedComputing #CloudComputing

8 Comments

Siddharth Panda 3mo

enable AQE that can automatically take care of below operations: 1. broadcasting 2. manage shuffle partitions 3. handle skewness during join operations also, good idea to have more workers with appropriate cores as less workers with large cores might result in GC due to memory pressure. re-partition to ensure data is evenly distributed across executors to avoid network transfer & memory overhead caused due to shuffling. If possible, run optimise on your 5 TB dataset to ensure we are not circling back to Small File Problem. just thought to add these points as well 🙂.

6 Reactions

Ravi Mogha 3mo

But I think you missed one point: what will be the RAM you will choose for your executors as the cores resides on executors only.

Eshita G. 3mo

Very interesting way to break down this topic.

Bikash D. 3mo

Well written 👏🏻

Sanket Chaudhari 2mo

Well explained!!

Saiyam Jain 3mo

Well explained!

See more comments

To view or add a comment, sign in

More Relevant Posts

Sagar Suthar
2mo
Report this post
Understanding Spark Partitioning Through a Small Hands-On Experiment Today, I explored how Spark partitions data and how that impacts parallel processing and storage layout in Databricks. Using a retail sales dataset (~8.5k rows), I experimented with three concepts that frequently come up in data engineering discussions: • runtime partitioning • storage partitioning • partition strategy design 1️⃣ Checking initial runtime partitions Using spark_partition_id(), I discovered that the dataset was initially loaded into a single partition, which means Spark would process the entire dataset sequentially. 2️⃣ Increasing parallelism with repartition() To improve distribution, I tested: sales_df.repartition(4, "Item_Type") This redistributed rows across partitions so Spark could process tasks in parallel. One key observation is that repartition() introduces a shuffle, so it should be used carefully for large datasets. 3️⃣ Reducing partitions with coalesce() After repartitioning, I reduced partitions using: sales_df.coalesce(2) coalesce() is useful when reducing partitions before writes because it avoids a full shuffle. Important learning: coalesce() cannot increase partitions — it only reduces them. 4️⃣ Storage partitioning experiments I tested different partition strategies using partitionBy(). • Partition by Outlet_Type → clean folder structure and efficient filtering • Partition by Item_Identifier → created many small folders/files due to high cardinality (inefficient design) • Partition by Outlet_Establishment_Year + Outlet_Type → produced a more balanced hierarchical layout This showed how partition column choice directly impacts storage structure and query performance. 5️⃣ Simple data skew observation I also explored how uneven data distribution can affect partitions: sales_df.groupBy("Item_Type").count().orderBy("count", ascending=False) Some categories contained significantly more rows than others. When partitions are built on skewed keys, Spark tasks can become unbalanced, causing slow tasks and uneven workloads. Key Takeaways:- • Spark datasets may initially load with very low parallelism • repartition() improves distribution but introduces shuffle • coalesce() efficiently reduces partitions before writes • partitionBy() controls storage layout in data lakes • High-cardinality columns can create too many small files • uneven data distribution can lead to data skew This experiment helped me better understand how partition design affects both Spark execution and data lake structure. #PySpark #Spark #Databricks #DataEngineering #BigData #Partitioning
2 Comments
Like Comment
To view or add a comment, sign in
Rajeev Kumar
2mo
Report this post
🚨Your Databricks job is not slow. Your data layout might be. Many Data Engineers working on Databricks try to fix slow pipelines by: ⚙️ Increasing cluster size ⚙️ Adding more executors ⚙️ Tweaking Spark configs But they forget one critical thing… 👉 Data Layout. Let’s understand Partitioning vs Bucketing and why it can dramatically impact performance. 📊 Example Dataset | order_id | customer_id | country |order_date | amount | | 1 | 101 | India | 2025-01-10 | 500 | | 2 | 102 | USA | 2025-01-10 | 300 | | 3 | 103 | India | 2025-01-11 | 700 | | 4 | 104 | UK | 2025-01-11 | 200 | | 5 | 105 | India | 2025-01-12 | 400 1️⃣ The Problem Imagine your orders table is partitioned by country: 📂 orders/ 📁 country=India 📁 country=USA 📁 country=UK Looks optimized, right? 🤔 But inside country=India, there could still be millions of customers and huge files. When Spark reads that partition: ❌ Limited parallelism ❌ Large file processing ❌ Slower job execution 2️⃣ The Solution → Bucketing Now introduce bucketing on customer_id. 📂 orders/ 📁 country=India 📦 bucket_0 📦 bucket_1 📦 bucket_2 📦 bucket_3 Now Databricks executors can process these buckets in parallel ⚡ Example execution: 🧠 Executor 1 → bucket_0 🧠 Executor 2 → bucket_1 🧠 Executor 3 → bucket_2 🧠 Executor 4 → bucket_3 🎯 Result ✅ Better parallelism ✅ Faster processing ✅ Improved join performance 3️⃣ Where Bucketing Helps the Most Bucketing is especially powerful when: 🔹 Tables are frequently joined on the same key 🔹 Data size is very large (TB scale) 🔹 You want better distribution across executors Example query: SELECT * FROM orders o JOIN customers c ON o.customer_id = c.customer_id If both tables are bucketed on customer_id, the shuffle during joins can reduce significantly 🚀 4️⃣ Partitioning vs Bucketing 📂 Partitioning → Splits data into folders → Best for filtering (WHERE clause) 📦 Bucketing → Splits data into fixed files → Best for parallel processing & joins 💡 Using both together can dramatically improve Spark job performance. 💡Final Thought Before increasing your Databricks cluster size, ask yourself: ❓Is my data layout optimized? Because sometimes the fastest pipeline improvement is not compute power. It’s how your data is stored. 🔥 #DataEngineering⚡ #Databricks🚀 #ApacheSpark📊 #BigData⚙️ #SparkOptimization🔄 #ETL 🏗️ #DataArchitecture
2 Comments
Like Comment
To view or add a comment, sign in
Quratulain Afzal
3mo
Report this post
🚨 Delta Lake vs Parquet — One is just a file. The other changes everything. A few years ago, most data platforms stored data like this: data.parquet data2.parquet data3.parquet Everything looked perfect… Until something went wrong. ❌ Pipeline failed halfway ❌ Duplicate data loaded ❌ No way to rollback ❌ No idea what changed And suddenly… Your “data lake” became a data swamp. That’s exactly the problem Delta Lake was built to solve. 📦 First, let’s understand Parquet Parquet is an incredibly efficient columnar file format. It is: ⚡ Fast 💾 Highly compressed 📊 Perfect for analytics This is why almost every modern data platform supports it. But Parquet has one major limitation: It is just a file. It does NOT know: • Who changed the data • When it was changed • How to recover previous versions No history. No transactions. No safety. ⚡ Now enter Delta Lake Delta Lake is not a replacement for Parquet. It is an enhancement. It adds something incredibly powerful: 🧠 A Transaction Log This changes everything. Now your data can: ✅ Update safely ✅ Delete safely ✅ Merge safely ✅ Track history ✅ Rollback mistakes Your data becomes reliable. 🔥 The Feature That Blows Everyone’s Mind: Time Travel Yes… Time Travel. You can query your data from the past. SELECT * FROM table VERSION AS OF 10 You can literally go back in time. Imagine fixing yesterday’s broken pipeline in seconds. 🧠 Simple Visual Difference PARQUET Just Files --------- data.parquet DELTA LAKE Files + Brain ------------- _delta_log data.parquet Delta Lake makes data intelligent. 💻 Real Difference in Code Writing Parquet: df.write.format("parquet").save("/data") Writing Delta: df.write.format("delta").save("/data") Looks similar. But capability difference is massive. Delta supports: UPDATE DELETE MERGE ROLLBACK Parquet does not. 📈 Why Delta Lake is dominating in 2026 Delta Lake is now: ✔ Default in Microsoft Fabric ✔ Default in Azure Databricks ✔ Core of Lakehouse architecture ✔ Used in modern enterprise platforms Because modern data is: Not static Not simple Not forgiving It needs reliability. 🎯 The simplest explanation ever: Parquet = Storage Delta Lake = Storage + Reliability + History + Intelligence 🏆 Final Thought If your data is critical… Delta Lake is no longer optional. It’s essential. #DataEngineering #DeltaLake #Databricks #MicrosoftFabric #Lakehouse #BigData #Azure #DataEngineer #PySpark #ETL

2 Comments
Like Comment
To view or add a comment, sign in
Anjali Kamboj
3mo
Report this post
🔥 If You’re Not Reading Spark Physical Plans, You’re Guessing — Not Optimizing Most people say they “optimize Spark.” What they actually do: Increase cluster size Add executors Repartition randomly Hope it runs faster That’s not optimization. That’s scaling inefficiency. If you're using Spark in Microsoft Fabric, Azure Databricks, or Azure Synapse Analytics, real optimization starts here: df.explain("formatted") 🧠 1️⃣ Read the Physical Plan (Not Just Logical) Look for: Exchange → shuffle BroadcastHashJoin vs SortMergeJoin WholeStageCodegen AdaptiveSparkPlan Multiple Exchange nodes = shuffle tax. Shuffle is Spark’s most expensive operation: Disk spill Network IO Serialization GC pressure ⚡ 2️⃣ Control Shuffle Intentionally Default: spark.sql.shuffle.partitions = 200 That’s arbitrary. Too low → large partitions → OOM Too high → scheduler overhead Better: spark.conf.set("spark.sql.adaptive.enabled", "true") Let AQE coalesce partitions dynamically. 🚀 3️⃣ Broadcast > Sort Merge (When Valid) If one table is small: from pyspark.sql.functions import broadcast df_large.join(broadcast(df_small), "id") Avoid shuffling the large dataset. But monitor: spark.sql.autoBroadcastJoinThreshold 🧩 4️⃣ Partitioning Strategy Is Everything Bad: df.repartition(1000) Better: df.repartition("customer_id") Repartition before wide transformations. Use coalesce() only to reduce partitions. 🧨 5️⃣ The Silent Killer: Skew If one task runs 40 minutes while others finish in seconds — that’s skew. Fix with: Salting Pre-aggregation AQE skew handling In lakehouses using Delta Lake, bad partition columns often cause skew. 💾 6️⃣ File Size Matters Ideal Parquet file size: ~128MB–1GB Too many small files in Azure Data Lake Storage cause: Metadata overhead Slow listing Scheduler pressure Use compaction / OPTIMIZE. 🎯 Senior Spark Mindset Stop asking: “How big is the cluster?” Start asking: Where is the shuffle? Why this join strategy? Are partitions balanced? What does the physical plan show? Real optimization = understanding shuffle mechanics and execution plans. Cluster size hides inefficiency. Execution plans expose it. #Spark #DataEngineering #Azure #MicrosoftFabric #Lakehouse #PerformanceTuning
Like Comment
To view or add a comment, sign in
Gajarajan Jain
3mo
Report this post
🔥 Change Data Feed (CDF) vs Time Travel in Databricks Two powerful features. Very different purposes. Frequently confused. After working with Databricks across multiple data environments, one misconception I still see, even in experienced teams, is mixing up Delta Time Travel and Change Data Feed (CDF). They sound related. They’re both part of Delta Lake. But architecturally, they serve completely different roles. Understanding this difference can significantly improve how you design scalable pipelines. 🕰 Time Travel = Historical State Reconstruction Time Travel allows you to query a table exactly as it existed at a previous version or timestamp. This is extremely powerful for: • Debugging unexpected metric changes • Reproducing financial reports • Supporting audit requests • Investigating data discrepancies If someone asks: “Why did revenue change from yesterday?” You can reconstruct yesterday’s exact dataset. But here’s the important distinction: Time Travel rebuilds the entire table snapshot. It does not isolate which rows changed. It answers: “What did the data look like?” It’s primarily a data trust and observability tool. 🔄 Change Data Feed (CDF) = Row-Level Change Propagation CDF captures the actual row-level changes between table versions: • Inserts • Updates • Deletes • Commit timestamps • Change types Instead of scanning the full table, downstream systems can process only what changed. This is critical for: • Incremental ETL • Warehouse synchronization • Efficient BI refresh • Near-real-time reporting • Event-driven architectures CDF answers: “What changed?” It’s a performance and scalability tool. ⚖️ The Architectural Difference Time Travel → Investigate the past CDF → Efficiently move forward Time Travel improves confidence. CDF improves efficiency. One strengthens governance. The other strengthens pipeline design. Mature Lakehouse platforms use both intentionally, not interchangeably. 🚨 Why Many Teams Miss This From what I’ve observed: • CDF isn’t enabled by default • Teams rely on legacy full-refresh patterns • Analysts aren’t exposed to change-based processing • Time Travel gets mistaken as “good enough.” The result? Full table reloads. Unnecessary compute cost. Slow dashboards. Scalability bottlenecks. 🎯 The Real Shift Modern data platforms are moving from: “Can we rebuild everything nightly?” to “Can we process only what changed?” That shift is where CDF becomes foundational. 💬 Curious how others are designing pipelines: What pattern are you currently using? A️Full table reloads B️ Timestamp-based incremental logic C️ Delta Change Data Feed D️ Streaming + CDC architecture E️ Hybrid approach Please drop your answer below I'm always interested in seeing how teams are evolving their Lakehouse strategy.
Like Comment
To view or add a comment, sign in
Pooja Jain
2mo
Report this post
Your data stack used to look like this: One system to store data. Another system to query it. Like keeping files in a warehouse… then hauling boxes to an office every time someone asked a question. It worked for a while. But every copy added cost, delays, and broken pipelines. That split defined data architecture for years. The old split • Databases: fast queries, strict structure, high cost • Data lakes: cheap storage, messy files, hard to query So teams copied data back and forth. Pipelines broke. Costs doubled. Everyone had a slightly different version of the truth. The simple idea What if the storage layer could behave like both? That’s the lakehouse. Think of it like turning a warehouse into a smart home. The data stays in one place, but different rooms serve different needs. What changed in practice? → Your raw JSON sits next to your clean tables → Data scientists and analysts read from the same source → ACID transactions on object storage (S3, ADLS) → Time travel without expensive snapshots → Schema enforcement without rigid structures Explore some powerful tools to start building your own lakehouse: • Apache Iceberg – Table format with time travel & schema evolution • Delta Lake – ACID transactions on data lakes • Apache Hudi – Real-time ingestion and upserts • LakeSoul – Rust-based lakehouse with streaming support • Nessie – Git-like catalog versioning Why AI teams care? → Training data, feature stores, embeddings, and batch analytics can live in the same place. → Less copying. Fewer pipelines. Fewer "which dataset is correct?" moments. Reality check? Performance still depends on design. Some workloads still fit warehouses better. But for many teams, the trade‑off between structure and flexibility is no longer necessary. One storage layer. Many ways to use it. Learn more about Lakehouse with these resources to get hands-on exposure: 1. 𝗣𝗮𝗰𝗸𝘁 𝗗𝗮𝘁𝗮𝗯𝗿𝗶𝗰𝗸𝘀 𝗟𝗮𝗸𝗲𝗵𝗼𝘂𝘀𝗲 by Will Girten - https://lnkd.in/gx6Hpt_y 2. 𝗥𝗲𝗮𝗹𝘁𝗶𝗺𝗲 𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗗𝗮𝘁𝗮 𝗟𝗮𝗸𝗲𝗵𝗼𝘂𝘀𝗲 by Yusuf Ganiyu- https://lnkd.in/gN9_Bnb7 3. 𝗡𝗬𝗖 𝗧𝗮𝘅𝗶 𝗗𝗮𝘁𝗮 𝗟𝗮𝗸𝗲𝗵𝗼𝘂𝘀𝗲 𝗣𝗿𝗼𝗷𝗲𝗰𝘁 - https://lnkd.in/gjaJuiZp Curious to hear from other data engineers: What broke during your lakehouse adoption? What finally worked?
45 Comments
Like Comment
To view or add a comment, sign in
Mahesh Patil
2mo
Report this post
For Data buffs like me -> DuckDB 1.5.0 shipped on March 9th what cant miss... I read the release notes and I think many in data community are excited for the right reasons. The 17% TPC-H throughput improvement is the headline, and it's legitimate. The VARIANT type is genuinely useful for semi-structured data without an upfront schema commitment — it's DuckDB's answer to Snowflake VARIANT, but local and free. The GEOMETRY type (now built-in instead of spatial extension) with automatic shredding is a nice addition for spatial data workloads. These are all real improvements. The buried featured that caught my eye 𝐧𝐨𝐧-𝐛𝐥𝐨𝐜𝐤𝐢𝐧𝐠 𝐜𝐡𝐞𝐜𝐤𝐩𝐨𝐢𝐧𝐭𝐢𝐧𝐠. Before 1.5.0, DuckDB would pause reads and writes while it wrote its checkpoint to disk. For an embedded database inside a running application that users interact with in real time, it's a latency spike that shows up as timeouts. Non-blocking checkpointing means reads and writes continue concurrently while the checkpoint happens. No pause window. No "wait for the save to complete." Here's why this is more than a performance note... I used DuckDB's few years ago and returning back, I see few interesting things happening. It started as a fast query tool — great for ad-hoc analytics on Parquet files, notebooks, local data exploration. But it's increasingly showing up 𝐢𝐧𝐬𝐢𝐝𝐞 applications, not alongside them: ↳ Embedded in Python data pipeline services as a local aggregation engine ↳ Inside Rust and Go services handling analytical queries without spinning up a separate database ↳ As the embedded query layer in new tooling that used to depend on Spark or Trino for even simple aggregations ↳ In serverless functions where startup time and binary size matter 𝐍𝐨𝐧-𝐛𝐥𝐨𝐜𝐤𝐢𝐧𝐠 𝐜𝐡𝐞𝐜𝐤𝐩𝐨𝐢𝐧𝐭𝐢𝐧𝐠 removes blocking constraint when using its library embedded in your service. Managed warehouses — Snowflake, BigQuery, Databricks — charge real money DuckDB 1.5.0 ships that guarantee for free, in-process, with no external dependency. That's an architectural unlock, not a benchmark improvement. My read of where DuckDB is heading: 1️⃣ 𝐄𝐦𝐛𝐞𝐝𝐝𝐞𝐝 𝐚𝐧𝐚𝐥𝐲𝐭𝐢𝐜𝐬 𝐥𝐚𝐲𝐞𝐫 — doesnt replace warehouse, but the local query engine 2️⃣ 𝐒𝐜𝐡𝐞𝐦𝐚-𝐟𝐥𝐞𝐱𝐢𝐛𝐥𝐞 𝐢𝐧𝐠𝐞𝐬𝐭𝐢𝐨𝐧 𝐭𝐨𝐨𝐥 — VARIANT serious option for semi-structured data 3️⃣ 𝐏𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐨𝐧-𝐫𝐞𝐚𝐝𝐲 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 — "interesting tool" to "buildable platform component" I kept thinking "when do I reach for DuckDB instead of Spark et al?" The better question is: "what analytical workloads have I been running on heavyweights due to lack of credible embedded alternatives?" DuckDB 1.5.0 is the answer to more of those than before. Are you using DuckDB? What's the use case, and architectural decision it unlocked? PS: Link to my python notebook with these features in comments below #DuckDB #DataEngineering #Engineering #EmbeddedAnalytics #Cloud #DataPlatform
1 Comment
Like Comment
To view or add a comment, sign in
Kirk Broadhurst
2mo
Report this post
We didn't choose Databricks for our purpose built data platform. I'm not closing the door on Spark, but we'll wait until we actually need it before we introduce it. It's too expensive, complicated, and slow. I wanted some numbers to back that up so I spent a couple of hours running group-by benchmarks across DuckDB, Polars, and Spark. I built files of different sizes containing a few low cardinality dimensions (one hundred values), a few high cardinality dimensions (hundreds of thousands or more values), and a handful of numeric columns - and then ran a selection of grouping queries across those dimensions. Pretty basic "database-like" stuff. I don't find these results surprising, but if you're new to this then it might be a shock. We won't try to load data and cache. That's not a real world scenario anyway - the typical distributed query pattern is scanning across large quantities of data in object storage. If you need everything running "hot" on a server use something better like Clickhouse. At 10M rows, DuckDB is 15x faster than Spark. Polars is 18x faster. At 100M rows, DuckDB is 4.5x and Polars is 3.5x faster - Spark is starting to catch up, but from a long way back. Even at 250m rows both DuckDB and Polars are 3.5x faster. At 1B rows, aggregating into 10m groups, Polars was 1.5x faster and DuckDB was 2.5x faster. Spark ran out of memory on the first attempt, so I gave it a second chance. I'm not optimizing and tuning Spark for performance and stability. But that's the point - I don't want to, and I shouldn't have to. For typical analysis it just isn't necessary. In a real scenario, like classic "datalakehouse" mode, you'll have a bunch of medium sized objects in your data lake. Even if you have enormous tables, when partitioned apporpriately you're querying a fraction of the data at any time. It doesn't matter if your table is 10 exabytes; if you're only looking at a few partitions at a time, then DuckDB is more than capable. I've seen many teams shoot themselves in the foot with Spark. Unfortunately it's almost a default data processing tool, and we all just ignore its immense complexity to configure, monitor, and manage. I ran Spark for years back in the v1 days and it was good fun - I learnt a lot. But the big picture learning is that I rarely actually need it, and if that need arises later it's easy to bolt on.

28 Comments
Like Comment
To view or add a comment, sign in
Shah K
3mo
Report this post
🚨 Databricks Data Engineer Associate - MASTER Failure Scenarios 1️⃣8️⃣ Cluster Autoscaling Fails During Skewed Joins Root Cause: Data skew overloads a single executor. Autoscaling adds nodes. Skew keeps hammering one node. Result: One executor OOM Others idle Job stalls Fix Strategy: Detect skew (spark.sql.adaptive.skewJoin.enabled=true) Salting Broadcast smaller table Repartition on high-cardinality keys Autoscaling ≠ skew solution. 1️⃣9️⃣ Unity Catalog External Tables Need Credentials Unity Catalog enforces secure access boundaries. To access cloud storage: Storage Credential External Location Grants No credential mapping → no access. Security by design. Not optional. 2️⃣0️⃣ MERGE INTO Creates Duplicates Cause: Non-unique merge condition. If target row matches multiple source rows → duplicates. Hard rule: Your merge key must be logically unique. Otherwise: Deduplicate source Aggregate before merge MERGE does exactly what you ask — even if it’s wrong. 2️⃣1️⃣ Why Shallow Clone is Instant Shallow clone copies: Metadata Transaction log references It does NOT copy data files. Data files are referenced from source table. That’s why: Instant creation Low storage cost Not safe if source deleted 2️⃣2️⃣ Streaming Checkpoint Grows Unlimited Cause: Stateful streaming + poor watermark design. If: No watermark Watermark too large Late data allowed indefinitely State store keeps growing. Impact: Driver memory pressure Checkpoint bloat Performance degradation Streaming requires lifecycle control. 2️⃣3️⃣ Auto Loader Stuck Listing Files Cause: Notification mode disabled. Directory listing on massive folder = bottleneck. Event-based ingestion is mandatory at scale. If you're scanning millions of files, your design is flawed. 2️⃣4️⃣ ZORDER Improves Only Selective Queries ZORDER clusters similar values. It benefits: Range filters Equality filters It does NOT: Reduce shuffle Optimize joins Replace partitioning Use it for predicate pruning only. 2️⃣5️⃣ Broadcast Join Fails to Broadcast Spark disables broadcast when: Table stats missing Table larger than threshold Adaptive execution decides otherwise Fix: ANALYZE TABLE COMPUTE STATISTICS Adjust spark.sql.autoBroadcastJoinThreshold Use broadcast hint explicitly Spark doesn’t guess blindly. 2️⃣6️⃣ Delta VACUUM Failure Default retention = 7 days (168 hours). Safety check blocks recent file deletion. Override carefully: spark.databricks.delta.retentionDurationCheck.enabled=false Disable only when lineage is guaranteed. 2️⃣7️⃣ ACID Without Locks Delta uses: Optimistic concurrency control _delta_log transaction records Commit validated against latest version. Conflict → transaction aborted. No distributed locking required. 2️⃣8️⃣ Auto Loader vs COPY INTO Auto Loader: Incremental discovery Event notifications Maintains state COPY INTO: Full directory scan At petabyte scale → COPY INTO collapses.
Like Comment
To view or add a comment, sign in
Debojit Basak
3mo
Report this post
Built my first end-to-end data engineering pipeline on Databricks — and here's everything I learned. 🧵 🏗️ What I built: A full Medallion Architecture (Bronze → Silver → Gold) pipeline for an FMCG sports nutrition company, processing customer, product, pricing, and order data — all the way from raw CSVs to a production-ready analytics layer. ⚙️ The tech stack: → Databricks (PySpark + Delta Lake) → Unity Catalog with a dedicated fmcg catalog → Databricks Workflows (Jobs) for orchestration → Change Data Feed enabled across all tables → Delta MERGE (UPSERT) for idempotent writes 📐 The architecture: Four notebooks run in strict dependency order via a Workflow DAG: 1️⃣ dim_processing_customers 2️⃣ dim_processing_products 3️⃣ dim_processing_prices 4️⃣ fact_processing_orders Dimensions always load before facts. Products must exist before prices or orders can enrich with product_code. The DAG enforces this automatically. 🧹 The messiest part? The data quality. Real-world data is ugly. Here's what I had to fix before a single row hit Gold: ✅ City name typos — Bengaluruu, Hyderbad, NewDheli (a city mapping dictionary + allowed-list filter cleaned these up) ✅ Dates arriving in 4 different formats + weekday prefixes like "Tuesday, July 01, 2025" (regex strip + multi-format coalesce parsing) ✅ 'Protien' misspelled consistently across product names and categories (case-insensitive regex replace) ✅ Negative gross prices (flipped to absolute values) ✅ Missing city values confirmed with the business team and patched via a lookup join ✅ Non-numeric IDs replaced with a 999999 fallback to preserve downstream join integrity 13 distinct data quality issues handled across 5 notebooks. 📦 The incremental load challenge: The parent company needed monthly-level data, but our child data was at daily granularity. Simple enough, right? Wrong. The tricky part was late-arriving records. If a new daily order lands for a month that's already been aggregated and pushed to the parent — you can't just add the new quantity on top. That would double-count. The fix: identify the affected months from the new batch → re-fetch all daily records for those months from the child gold table → recalculate monthly totals from scratch → merge into parent. Always correct. Always complete. 📊 Final output: 19 Delta tables across Bronze, Silver, and Gold layers — feeding into a shared parent company analytics catalog. This project taught me that data engineering is 20% writing transformations and 80% understanding why your data is broken. #DataEngineering #Databricks #DeltaLake #PySpark #MedallionArchitecture #FMCG #Python #DataPipeline #Analytics #LearningInPublic
1 Comment
Like Comment
To view or add a comment, sign in

3,828 followers

17 Posts

View Profile Follow

Spark Processing 5TB Files: A Real-World Breakdown

More Relevant Posts

Explore related topics

Explore content categories