Standardization, handling duplicates, and missing values

From the course: ETL in Python and SQL

Start my 1-month free trial Buy for my team

Standardization, handling duplicates, and missing values

“

- [Tutor] In the last video, we discussed the importance of the transformation step in an ETL process. Now, let's transform the customer's data from H+ Sport. To transform our data, first, we need to extract it from the Excel file it currently resides in. Do you remember how? First things first, let's import Pandas as PD and make sure it runs. Next, we would import our customer's data. So that's customers, is equal to pd.excel, pd.read, excuse me, or excel. Let's copy the path for this, or we can just copy it this way, .xlss, and the sheet name is data. Let's run this to make sure we're good. Awesome, and we are. And now we can take a look at what customers look like by running head-on customers. Let's run this. Awesome, this is what it looks like. To begin our transformation, let's check for duplicates. Duplicates can exist in a couple of places and sometimes it can be as simple as checking the ID if it appears multiple times, where the ID is unique for each customers. Other times…

Unlock the full course today

Join today to access over 24,800 courses taught by industry experts.

Standardization, handling duplicates, and missing values

From the course: ETL in Python and SQL

Standardization, handling duplicates, and missing values

Practice while you learn with exercise files

Download courses and learn on the go

Contents

Explore Business Topics

Explore Creative Topics

Explore Technology Topics