Python vs PySpark

Preetpal Singh

Published May 25, 2020

I have written this article in collaboration with Himanshu Negi which will give you a kick start on how to use PySpark for a very basic preprocessing of dataframe for your machine learning models running on Spark clusters. By now most of you had hit the limitations of working with big data in Pandas dataframe and now exploring the journey of working with Spark dataframe. You can use pandas library for data preprocessing when the data is small (under 100 MB) but when you move to a larger (100 MB to multiple gigabytes) dataset the performance issue can make run-time much longer, and cause code to fail entirely because of the insufficient memory. This is where Spark comes into the picture for big data, we won’t go into Spark much but for starters Spark is a general-purpose distributed data processing engine and it can deal with large (100 gigabytes to multiple terabytes) datasets.

Previously working with small or medium dataset performance was not an issue while using Python but as the data gets bigger day by day you have to catch up with the new technologies for working with big data for faster and parallel processing engines and that is where using PySpark is important. PySpark is basically a Python API to work with Spark.

Initially, while changing code from Python to PySpark you might have few questions like how much changes need to be made in order to convert the Python code to PySpark and if this question sounds familiar to you here is a Jupyter notebook (written in Databricks) showing the change in the syntax you will need to make while moving your code from Python to PySpark. Below are the screenshots of the simple data transformation jobs which show the difference between both of them.

The following code is written in Databricks. You can use Databricks community edition and remember that it is free of cost.

Setup your cluster

Databricks is very easy to setup and use just like your Jupyter notebook. Once you create your account, you can create your own single-node cluster. Just give your cluster name, select your availability zone and click on create cluster.

Here is an example:

Get the data

You can use the iris data which is publicly available. Download the Iris.csv file and upload it in your Databricks database by browsing through your local file system. Below is the link to get the dataset.

Click on ‘Create Table using UI’ option, select your cluster that you created and click on ‘Preview Table’. Now, it will give you some more options, and you could choose them as per your own data and click on ‘Create Table’.

Once the data is uploaded, you’re ready to create your own Workspace and start writing the code. Please find the link to the code here Python_vs_PySpark.

1. Read Iris.csv file into Pandas and Spark DataFrame

# Pandas

import pandas as pd
pdf = pd.read_csv("/dbfs/FileStore/tables/Iris.csv")

# Spark dataframe 

sdf = spark.read \
  .options(infer_schema = True, header = True) \
  .csv("/FileStore/tables/Iris.csv")

2. Get top 5 rows

3. Getting column names and datatype is similar in both Python and PySpark

4. Renaming the columns

5. Get the shape of the Dataframe

6. Filter the columns

7. Adding a new column in the Dataframe:

8. Get unique values inside a column:

9. Filling the null values:

10. Drop a column

11. Describe function

Ha Luong 3y

Thank you so much for the great writing. I am new to Pyspark and find this post very useful.

Jimmy L. 4y

Very informative. It has been very helpful seeing this side by side; Pandas DF and Spark DF. Thanks for sharing!

Mehak Mahajan 5y

Very informative ....

1 Reaction

Yakshup Goyal 5y

Very informative.. Thanks for sharing

Deepraj Singh Chauhan 5y

Awesome bhai.

See more comments

Python vs PySpark

Preetpal Singh

Setup your cluster

Get the data

Others also viewed

Pyspark RDD logging

Will Koalas replace PySpark?

Parsing XML file using Pyspark : Part 1

PySpark Lambda Functions - Complete Guide

Dockerize Your PySpark: A Streamlined Local Workflow

Pandas vs Dask: Which is a Better Tool for Your Data

Pandas vs. Polars: A Detailed Comparison for Data Enthusiasts & introduction to pandasAi

PySpark Linear Regression Model

How to dispatch multiple PySpark jobs on a Databricks Notebook using Python multithreading and Spark pools

Explore content categories