Python vs PySpark

Python vs PySpark

I have written this article in collaboration with Himanshu Negi which will give you a kick start on how to use PySpark for a very basic preprocessing of dataframe for your machine learning models running on Spark clusters. By now most of you had hit the limitations of working with big data in Pandas dataframe and now exploring the journey of working with Spark dataframe. You can use pandas library for data preprocessing when the data is small (under 100 MB) but when you move to a larger (100 MB to multiple gigabytes) dataset the performance issue can make run-time much longer, and cause code to fail entirely because of the insufficient memory. This is where Spark comes into the picture for big data, we won’t go into Spark much but for starters Spark is a general-purpose distributed data processing engine and it can deal with large (100 gigabytes to multiple terabytes) datasets.

Previously working with small or medium dataset performance was not an issue while using Python but as the data gets bigger day by day you have to catch up with the new technologies for working with big data for faster and parallel processing engines and that is where using PySpark is important. PySpark is basically a Python API to work with Spark.

Initially, while changing code from Python to PySpark you might have few questions like how much changes need to be made in order to convert the Python code to PySpark and if this question sounds familiar to you here is a Jupyter notebook (written in Databricks) showing the change in the syntax you will need to make while moving your code from Python to PySpark. Below are the screenshots of the simple data transformation jobs which show the difference between both of them.

The following code is written in Databricks. You can use Databricks community edition and remember that it is free of cost.


Setup your cluster

Databricks is very easy to setup and use just like your Jupyter notebook. Once you create your account, you can create your own single-node cluster. Just give your cluster name, select your availability zone and click on create cluster.

Here is an example:

No alt text provided for this image


Get the data

You can use the iris data which is publicly available. Download the Iris.csv file and upload it in your Databricks database by browsing through your local file system. Below is the link to get the dataset.

Click on ‘Create Table using UI’ option, select your cluster that you created and click on ‘Preview Table’. Now, it will give you some more options, and you could choose them as per your own data and click on ‘Create Table’.

No alt text provided for this image

Once the data is uploaded, you’re ready to create your own Workspace and start writing the code. Please find the link to the code here Python_vs_PySpark

No alt text provided for this image


1. Read Iris.csv file into Pandas and Spark DataFrame

# Pandas

import pandas as pd
pdf = pd.read_csv("/dbfs/FileStore/tables/Iris.csv")



# Spark dataframe 

sdf = spark.read \
  .options(infer_schema = True, header = True) \
  .csv("/FileStore/tables/Iris.csv")


2. Get top 5 rows

No alt text provided for this image

3.    Getting column names and datatype is similar in both Python and PySpark

No alt text provided for this image
No alt text provided for this image


4.    Renaming the columns

No alt text provided for this image


5.    Get the shape of the Dataframe

No alt text provided for this image


6.    Filter the columns

No alt text provided for this image


7.    Adding a new column in the Dataframe:

No alt text provided for this image


8.    Get unique values inside a column:

No alt text provided for this image


9.    Filling the null values:

No alt text provided for this image


10. Drop a column

No alt text provided for this image


11. Describe function

No alt text provided for this image


Thank you so much for the great writing. I am new to Pyspark and find this post very useful.

Like
Reply

Very informative. It has been very helpful seeing this side by side; Pandas DF and Spark DF. Thanks for sharing!

Like
Reply

Very informative.. Thanks for sharing

To view or add a comment, sign in

Others also viewed

Explore content categories