Why data teams should consider Apache Iceberg

This title was summarized by AI from the post below.

Just read a fun new post from Benjamin Rogojan (Seattle Data Guy) on why data teams shouldn’t blindly sprint toward Apache Iceberg. As always, Ben’s writing is provocative in the best way - challenging us to pause and ask why we’re reaching for the latest tech before we commit. I'm going to counter Ben on this one: Firsthand, I’d argue that there is a clear place for it. One where it not only accelerates time-to-insight, but also helps reduce the need to pile more junk onto the dump. In the analytics world, we’re constantly under pressure to answer questions fast. Stakeholders needed insights yesterday, so we chase whatever gets us to a solution quickest. Most of the time, those swift moves leave behind a heap of tech debt to clean up next year. But is Iceberg just another patch for deeper infrastructure issues …or does it actually have a practical home in your stack? Take 6sense: in 2022 calendar year alone, we went from two distinct warehouse environments to 5+. Why? M&A of 4 product companies and numerous new feature launches with evolving architecture needs.... But more environments don’t change expectations. Everyone still wants centralized analytics. A real question just two weeks after acquisition of company #3: “Can we just see how usage of feature X in warehouse A + adoption of capability Y in environment B impacts customer health stored in warehouse C?” Sounds easy, right? Just write a cross-environment query and wrap it up before lunch? In reality, you’re scrambling to stitch everything together, suddenly managing dozens (or hundreds) of EL jobs just to centralize data. At our peak, we had over 100 of these running nightly - each one draining compute, delaying other analytics pipelines, and straining SLA expectations. Would Iceberg have solved all of that? No. Integrating acquired systems and legacy stacks is never simple. But had we adopted Iceberg earlier - with strong governance and a cross-functional design approach that prioritized analytics from the start - we could’ve avoided a lot of duplication and orchestration overhead. Iceberg is NOT a silver bullet. But in the right context, it’s a powerful architectural lever for data democratization and it can absolutely drive both efficiency and cost savings. Curious to hear from others: Has your team adopted Iceberg? What made it worth it…or not? Highly recommend giving Ben a follow if you work in data. The post in reference in comments!

Thanks for tagging and sharing my work! I do agree that Iceberg could help companies see faster time to value and remove friction if they are finding themselves porting data from Databricks to Snowflake and then to BigQuery. Like you brought up, I've also been brought into multiple M&A situations like you described, maybe you do get your 4 companies all on Iceberg! Then you buy another two and have to start again. Less of a reason against but more of a reality. I do think Paul Dudley also has a point that there is a certain spend where it's likely far more expensive to wrangle and you don't get the benefits of Iceberg. But it can also be a distraction from actually trying to figure out business problems, the "Hey I know we just migrated to Snowflake two years ago, but you know what would make things better, Iceberg!" Mentality. So I do believe in most points we likely would be on the same page.

“rather than focusing on messier, more ambiguous challenges of understanding the business.” —> that’s what data teams should be focusing on. Dealing with the ambiguity is a people problem. Going back to the question you received after the acquisition, the strategy should be to identify up front the categories of questions executives need to answer and optimize the work for that. This is why I argue that focusing on the semantics is what provides higher leverage because you are optimizing on reusability (solve for the first integration and then you have a pattern for the second, and third..)

See more comments

To view or add a comment, sign in

Explore content categories