Snowflake, Databricks and the Fight for Apache Iceberg Tables

The market for data lakes and data lakehouses is clearly being disrupted by open source software, given recent news from Databricks and Snowflake.

Jun 10th, 2024 9:46am by Joab Jackson

Featued image for: Snowflake, Databricks and the Fight for Apache Iceberg Tables

For these past two weeks, San Francisco has hosted back-to-back data lake conferences from both Snowflake and Databricks. Feature images by TNS.

SAN FRANCISCO — Last week, Snowflake announced it had adopted Apache Iceberg tables as a native format. Now customers can put their Snowflake data lakes into Iceberg, and even create external tables on a cloud provider of their choice, and have Snowflake manage them.

In addition, Snowflake released Polaris, a catalog for Iceberg tables that could be called by any data processing engine that could read the format (Spark, Dremio, Snowflake).

With the catalog, using the engine of your choice, you could do joins across tables gathering info heretofore much more difficult to obtain. Permissions, for who can see what, are managed by the catalog itself. And shortly, you will be able to pull in metadata from other catalogs.

The company discussed these interoperability initiatives during its own user conference, the Snowflake Data Cloud Summit, held last week in San Francisco,

But the company was not alone in its eager adoption of Iceberg.

Also, last week, chief Snowflake rival Databricks announced it had purchased Iceberg distribution provider Tabular, a company that offers an Iceberg distribution that was founded by the three people who created the technology, Ryan Blue, Daniel Weeks, and Jason Reid.

How did Apache Iceberg become the Belle of the Ball? Clearly, the data lakes and data lake houses are about to undergo a fundamental shift to open source.

Apache Iceberg Came from Netflix

“I think in this space, we have a classic customer who wants control of their solution. “–Snowflake’s Ron Ortloff.

Iceberg grew from a series of frustrations on the part of Netflix engineers to scale their data operations, with existing file formats not reliable in distributed scenarios.

Netflix open sourced the project in 2018 and donated it to the Apache Software Foundation. Since then, AirBnB, Amazon Web Services, Alibaba, Expedia, and others have contributed.

The advantage that Iceberg brings is that it allows data to be stored once — eliminating a whole mess of compliance and security issues around having data copies in multiple places — and queried by any one of a number of Iceberg-compliant engines.

A large number of Iceberg distributions are available these days, from Celerdata, Clickhouse, Cloudera, Dremio, Starburst, and of course Tabular. Earlier this month, Microsoft announced that it would support Snowflake’s Iceberg tables on its own Microsoft Fabric, an analytics service on Azure.

Customers are very, very sensitive about lock-in these days, said Ron Ortloff, Snowflake’s senior product manager. “I think in this space, we have a classic customer who wants control of their solution,” he said in an interview with The New Stack. “So we want to give those customers a choice.”

Snowflake has traditionally been a company that manages a client’s data from the cloud, relieving the customer of the considerable burden of managing it themselves. So why risk the customer base with an offer to allow customers to manage their own data?

“We think there’s 100 to 200 times more data outside of Snowflake in data lakes that we can tap into with Iceberg,” Ortloff said. Instead, the company sees itself competing on a “great platform experience,” especially as the stakes grow richer as more enterprises adopt AI in a big way.

“If we build great platform experiences, that data gravity is going to flow right there through,” he said.

A diagram showing how Polaris integrates with the rest of the Snowflake infrastructure. From Ron Ortloff’s presentation.

Databricks Solidifies Its Iceberg Expertise

Databricks’ acquisition of Tabular was indeed spurred by customer demand for better interoperability for formats for data lakes.

“This is a long journey, one that will likely take several years to achieve in those communities,” the company admits in a blog. To this end, Databricks has released Delta Lake UniForm, which is a set of tables that work across Iceberg and Databricks’ own Delta Lake format, and the Apache Hudi transactional data lake format.

Others have weighed in on the significance of the Databricks purchase in light of Snowflake’s activity.

“After storage and compute became decoupled, all of the layers from storage through analytics began to be similarly unbundled, a process currently taking place with tables,” wrote New Relic CTO Siva Padisetty, in a a statement. “Databricks seeks to match open-source parity with Iceberg and Tabular is how they expect to achieve it.”

The competition will shift to which company can, in the open source format, process data most quickly and cost-effectively, with all the governance and security safeguards in place, Padisetty summarized.

Last week, we covered the news from the Snowflake event, and this week, TNS will continue its coverage of the Iceberg wars with TNS data correspondent Andrew Brust covering Databricks’ Data + AI Summit, taking place this week in San Francisco. There we will hear more about Databricks’ own plans for the the future of Iceberg.

Joab Jackson is a senior editor for The New Stack, covering cloud native computing and system operations. He has reported on IT infrastructure and development for over 25 years, including stints at IDG and Government Computer News. Before that, he...