From the course: Apache Iceberg: From Zero to Production Data Lakehouse

Modeling data into an Apache Iceberg table

The way we model our Apache Iceberg tables is extremely important to query performance because the less data we read, the faster our query will be able to run. Iceberg was built with this in mind and combines partition-based and column metric-based file skipping. This means a query can skip files if a filter can exclude a file based on a partition value or any other column metric. Not only does it have these capabilities, but Iceberg is also capable of doing this while planning the query. before ever touching the data files themselves. This is very important when working with cloud storage where list operations and file IO are expensive and slow. In this video, I'll introduce you to the basics of iceberg table modeling and the kinds of decisions you'll need to make for optimal performance. But the important point to remember as we get started is the less data we read, the faster a query will run. Note that most of this video is Spark specific and every engine will have its own syntax and options for creating iceberg tables. There are analogs in just about every query language, but you should not expect to be able to cut and paste from my example without a little translation. To start with, let's take a look at the New York City taxi dataset and see what kind of modeling decisions we could make when adding this data to an iceberg table and what impacts those decisions would have on query performance. In this case, the dataset exists in the Parquet file format, which means we won't have to deal with any schema translation to iceberg. If it was in another file format, we would have to map data types to the correct iceberg types when we ingest into the table. For example, a CSV file is by definition all string type data. And while we could import all columns as strings, we probably would want to change numeric data to integers or floats. Iceberg as a table format has support for all of the types you would expect from a relational database, including complex structs, lists, and maps. So this is usually a straightforward process. Iceberg also has support in V3 for a semi-structured data type called variant, which is useful for ingesting from sources with variable schema, but lacks the performance of a strongly typed primitive without using an ancillary maintenance or a special write time capability called shredding. Only certain engines support shredding and they will have different heuristics to choose what is shredded. In later videos, we'll talk more about migrating data in place to Iceberg, but for demonstration purposes, we'll be using create table as select statements here to write completely new copies of the data into our table. This not only allows us to keep our experimental tables separate, but also lets us try different types of partitioning to restructure how the data is laid out in files. Let's start by creating a table without any partitioning to walk through what an iceberg table actually looks like. Looking into our object store, we'll see we've created two directories, a data folder with data files and a metadata folder with metadata files. Within the metadata folder, there are three flavors of metadata files for Iceberg v3 and below. A metadata.json file, which contains the table schema, partitioning information, and table properties, such as target file size or compression codec. There's also a manifest list, which has a file name starting with snap and ending with .avro, which represents a single snapshot of the table and lists what manifests are part of this snapshot. Finally, there are manifest files, which contain m some number.avro, which lists the data files that currently make up the table. Since we only have a single commit, our metadata.json points to this snapshot, which in turn points to this manifest, which contains one entry for every data file in our slash data slash directory. Each entry also contains the column metrics and partition value, if there was one, for that data file. Iceberg uses only the records in the metadata to plan query scans. So the naming information for the directory is just for human observers. Iceberg itself will not use the directory structure and will never perform a list operation during query planning. Since both past and current data files are present in the object store, a file being present in the directory does not necessarily mean that it is currently part of the table. To see metadata about what is actually current in the table, we use a special set of metadata tables, which are virtual tables built from the contents of the metadata.json, manifest list, and the manifests we saw above. We can see that table by executing a command like this. This structure allows engines to efficiently access the data files within the table. The metrics stored within the metadata allow Iceberg to skip unnecessary information when planning a query. And this starts with partitioning. We'll explain Iceberg's partitioning in the next video and explain why it's described as hidden partitioning. ♪♪

Contents