From the course: PySpark Essential Training: Introduction to Building Data Pipelines

Unlock this course with a free trial

Join today to access over 25,300 courses taught by industry experts.

Challenge: Essential data manipulation

Challenge: Essential data manipulation

(bright upbeat music) - [Instructor] Okay, it's your turn now to try this hands-on. Write some PySpark code that does some data cleansing and apply several of the methods you've learned before. Step one, create two new dataframes called df_jan_2025 and df_feb_2025 from the corresponding data files. Step two, create a new dataframe called df_2025_combined as a union of these two files. Step three, only select the following columns from this combined dataframe and rename the columns as indicated in the parentheses. Reassign the result of this to df_2025_combined. For example, take the tpep_pickup_datetime column and rename it to pu_datetime. Step four, create a new dataframe called taxi_zones from the taxi_zone_lookup.csv file. Step five, join the taxi_zones dataframe onto the df_2025_combined dataframe using the do_location_id and the LocationID columns. Reassign the result of this join to df_2025_combined. Step six, the final cleanup. Drop the superfluous LocationID, Zone, and…

Contents