From the course: Build a No-Code ETL Pipeline with Google BigQuery

How we will manage data - BigQuery Tutorial

From the course: Build a No-Code ETL Pipeline with Google BigQuery

Start my 1-month free trial Buy for my team

How we will manage data

“

- [Instructor] Before we go on to design our data transfer configuration, we need to have a clear plan for how we'll manage our data. So I have already mentioned that our data is follows a versioning model. So in this example, on the 2nd of October, 2024, we get version 130 of our data. And let's assume that this version has 1000 rows. That's not the real number, but I'm simplifying here. And then on the next day, we get version 131, which has 1,150 rows. So what's happening is that the second version contains the same data as before, and additionally it contains 150 rows that are new. Now you can see that if I simply loaded all the data every day into my table, I would get a lot of duplicates. So we need a better strategy to manage this. So here's how we're going to do it. Now, on day one, we're going to have version 130, and we are going to update this into our Google storage bucket. Next, we're going to take this file, and make a copy of it in the same place, and we will call this copy latest.CSV. From now on, latest will always point to the latest version of the data that we have in our folder. Next, we will configure BigQuery Data Transfer Service to transfer the content of latest into our BigQuery table. So at the end of day one, we will have our original file here, and we will have latest. And then in day two, we're going to add a new file with the version 131. And once again, we're going to take the latest version of our data and copy it into latest.CSV. And after that, we will take the contents of latest, and we will copy them into the BigQuery table. But what's really important is that when we do this, we will not append our rows, but we will override all the rows in the table. This ensures that every time we have the latest version of our data in the BigQuery table while avoiding duplicates. And at the same time in our bucket folder over here, we are storing all the different versions. We are creating an archive of all the different versions of our data. Here you can see how this approach achieves our objectives. And finally, on day three, we will add yet another file, copy this file into latest, and then transfer latest into the BigQuery table. This approach also makes our life easier when writing the transfer configuration because we can simply tell the transfer configuration to always point to the latest file. So every day, the transfer will get the data from the same file, and this file will contain up-to-date data. Now, let us simulate this data load in practice. So here on the left, you can see the cloud storage bucket that we created before. And here on the right, you can see the two versions of my data that I have saved in my local computer. So let's pretend that it's October 2nd, and what I have here is the latest version of the data. What I want to do is to upload this data into my bucket. So I will just drag the file, and now the upload will take a while, and now my file has been uploaded. So what I want to do is to create a copy, and call it latest. To do that, I can click on these three dots, select copy, and here I'm given the full name of the file including the address showing where it lives in the bucket. And what I need to do is to change the last part of the name to say latest. I'm going to select copy. And you can see now that I have the latest file, and now you need to imagine BigQuery Transfer Service coming to this file, taking all the data and putting it into my table. Now, another day has passed, and I have a new version of my data. So once again, I want to upload this into my bucket. Now the newest file has been uploaded into my bucket, and I want to update latest to reflect this latest version. So once again, I will select copy, and I will name my copy latest. Now, Cloud Storage asks me if I want to override the previous version of latest, and yes, I want to do that. And now latest points to the latest version of my data. And once again, BigQuery Transfer Service will get the data from latest, and override the table. And the result is that I will have the latest data in BigQuery without any duplicates.

How we will manage data - BigQuery Tutorial

From the course: Build a No-Code ETL Pipeline with Google BigQuery

How we will manage data

Download courses and learn on the go

Contents

Start learning today.

Explore Business Topics

Explore Creative Topics

Explore Technology Topics