From the course: Complete Guide to Data Lakes and Lakehouses

Dremio walkthrough

- Now we are ready to start running queries and connecting systems to analyze the data in our lake house. But before we do that, let me do a quick walkthrough of Dremio. Go back to your ports and open Dremio at port 9047. Remember, if your ports aren't running, you may need to restart the services using Docker Compose. Let me start with the sources. As the name suggests, sources are connections to storage systems such as S3 or other types of object storage, as well as databases. Here are all the sources we can connect to Dremio. The datasets in a source are called physical datasets. A physical dataset can be a group of files in a street, like in our case, or a table in a database. Physical datasets are identified by the purple blue icon. They have some distinguishing characteristics. For example, physical datasets are immutable. If data analysts want to curate data or build business logic, they will do it in virtual datasets built on top of physical datasets. We are going to explore this later. It is the best practice to limit access to physical datasets only to a small group of data engineers or the Dremio cluster administrator. This is because physical datasets may contain confidential or personally-identifying information. Then we have virtual datasets, which are similar to table view, but with more features. They're built on top of one or more physical datasets or other virtual datasets. In our case, these gold datasets are built on top of the silver physical datasets. We created our virtual datasets through DVT by indicating we wanted the gold layer models to be materialized as views in our data platform. Virtual datasets are saved in our home space or a shared space. Having this layered approach enables you to create sets of reusable virtual datasets for multiple projects and maximize both data sharing and data security as well as performance. And then a space is a location that organizes virtual datasets shared with other users. Spaces have several important characteristics. For example, they allow grouping datasets by a common theme such as a project, business unit, or geographical region. Each space can be configured for sharing and other privileges. Users will not see spaces for which they have no authorization. And lastly, the home space is a private user work area for your physical and virtual datasets. You can curate data in the home space until it's ready to be shared by moving it to a shared space. Now that you are familiar with Dremio, let's build some virtual datasets.

Contents