From the course: ETL in Python and SQL
Unlock the full course today
Join today to access over 24,800 courses taught by industry experts.
Standardization, handling duplicates, and missing values
From the course: ETL in Python and SQL
Standardization, handling duplicates, and missing values
- [Tutor] In the last video, we discussed the importance of the transformation step in an ETL process. Now, let's transform the customer's data from H+ Sport. To transform our data, first, we need to extract it from the Excel file it currently resides in. Do you remember how? First things first, let's import Pandas as PD and make sure it runs. Next, we would import our customer's data. So that's customers, is equal to pd.excel, pd.read, excuse me, or excel. Let's copy the path for this, or we can just copy it this way, .xlss, and the sheet name is data. Let's run this to make sure we're good. Awesome, and we are. And now we can take a look at what customers look like by running head-on customers. Let's run this. Awesome, this is what it looks like. To begin our transformation, let's check for duplicates. Duplicates can exist in a couple of places and sometimes it can be as simple as checking the ID if it appears multiple times, where the ID is unique for each customers. Other times…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.
Contents
-
-
-
-
(Locked)
Loading data from different sources4m 1s
-
(Locked)
Extracting your data2m 15s
-
(Locked)
Cleaning, preprocessing data, and data formatting3m 52s
-
(Locked)
Standardization, handling duplicates, and missing values6m 1s
-
(Locked)
Challenge: Extract and transform data using pandas34s
-
(Locked)
Solution: Extract and transform data using pandas3m 47s
-
(Locked)
-
-
-