From the course: Apache Iceberg for Data Analytics and Machine Learning
Why Apache Iceberg? - Apache Tutorial
From the course: Apache Iceberg for Data Analytics and Machine Learning
Why Apache Iceberg?
(upbeat music) - Hi. Welcome to this tutorial on using Apache Iceberg with Dremio, Power BI and Jupiter Notebooks for advanced analytics and machine learning. Today we'll explore how Dremio can transform files into the Apache Iceberg format and leverage this for powerful analysis. I'm Andrew Madson, an AI, machine learning, and analytics evangelist with Dremio. Before we dive in, getting our hands-on tutorial, let me share a little bit more about why Apache Iceberg has become such a game changer in the world of analytics and machine learning. Imagine you're building a complex machine learning pipeline. You've got data streaming inconsistently, your feature requirements are evolving, and you need to ensure model reproducibility while maintaining performance. This is exactly where Iceberg shines. What makes Iceberg particularly powerful is how it handles schema evolution. In traditional data lakes, changing your data structure is like changing the structure of a house while people are living in it. It's messy, things often break, but Iceberg changes this completely. You can add new features, remove columns that you no longer need, or even modify data types, all while your existing queries and machine learning pipelines continue to run smoothly. For data scientists, this means that you can experiment with new features without disrupting your production models. One of my favorite features is Iceberg's time travel capability. It's like having version control for your data, similar to how GIT works for code. This is enabled through Project Nessie, the Iceberg catalog. If you need to reproduce the exact results from a model you trained three months ago, no problem. Do you want to run AB tests with consistent data? Easy. This feature is invaluable for model validation and regulatory compliance, debugging model performance and degradation. You can snapshot your training data, tag specific versions for production models, and even create branches for experimental features. When it comes to performance, Iceberg is like having a smart assistant that automatically organizes your data for optimal access. Its hidden partitioning system means you don't need to manually organize your data and it learns your query patterns and optimizes accordingly. This is particularly crucial for ML workflows when you're frequently accessing subsets of data for training and validation. The system maintains statistics and indexes that make common ML operations like feature importance, calculations and correlation analysis lightning fast. Governance and compliance capabilities in Iceberg are robust enough for even the most regulated industries. Every transformation is tracked, creating a clear audit trail that can trace any prediction back to the training data. You can get column level security for sensitive features and row level filtering for data privacy, all while maintaining high performance. This is essential for teams working in finance, healthcare, or any industry where model explainability and data lineage are non-negotiables. The scalability aspect of Iceberg is remarkable. It's like having a system that grows with you without requiring constant maintenance. So this is incredibly helpful for both analytics and machine learning. Let's dive in.
Contents
-
-
Why Apache Iceberg?3m 44s
-
(Locked)
Setting up your project2m 19s
-
(Locked)
Dremio login and overview2m 42s
-
(Locked)
Adding sources and creating data products with SQL15m 23s
-
(Locked)
Connecting to Power BI4m 5s
-
(Locked)
Data analysis and machine learning with Jupyter Notebook18m 57s
-
(Locked)
Conclusion1m 21s
-