People has asked me several times how to setup a good/clean/code organization for Python project with PySpark. I didn't find a fully feature project, so this is my attempt for one. Moreover, have a simple integration with Jupyter Notebook inside the project too.
Table of Contents
- https://mungingdata.com/pyspark/chaining-dataframe-transformations/
- https://medium.com/albert-franzi/the-spark-job-pattern-862bc518632a
- https://pawamoy.github.io/copier-poetry/
- https://drivendata.github.io/cookiecutter-data-science/#why-use-this-project-structure
All you need is the following configuration already installed:
- Git
- The project was tested with Python 3.10.18 managed by pyenv:
- Use
make pyenvgoal to launch the automated install of pyenv
- Use
JAVA_HOMEenvironment variable configured with a JavaJDK11SPARK_HOMEenvironment variable configured with Spark versionspark-3.5.6-bin-hadoop3packagePYSPARK_PYTHONenvironment variable configured with"python3.10"PYSPARK_DRIVER_PYTHONenvironment variable configured with"python3.10"- Install Make to run
Makefilefile - Why
Python 3.10becausePySpark 3.5.6doesn't work withPython 3.11at the moment it seems (I haven't tried with Python 3.12)
- pyenv prerequisites for ubuntu. Check the prerequisites for your OS.
sudo apt-get update; sudo apt-get install make build-essential libssl-dev zlib1g-dev \ libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \ libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev pyenvinstalled and available in path pyenv installation with Prerequisites- Install python 3.10 with pyenv on homebrew/linuxbrew
CONFIGURE_OPTS="--with-openssl=$(brew --prefix openssl)" pyenv install 3.10-
Auto format via IDE https://github.com/psf/black#pycharmintellij-idea
-
[Optional] You could setup a pre-commit to enforce Black format before commit https://github.com/psf/black#version-control-integration
-
Or remember to type
black .to apply the black rules formatting to all sources before commit -
Add integratin with Jenkins and it will complain and tests will fail if black format is not applied
-
Add same mypy option for vscode in
Preferences: Open User Settings -
Use the option to lint/format with black and flake8 on editor save in vscode
Checked optional type with Mypy PEP 484
Configure Mypy to help annotating/hinting type with Python Code. It's very useful for IDE and for catching errors/bugs early.
- Install mypy plugin for intellij
- Adjust the plugin with the following options:
"--follow-imports=silent", "--show-column-numbers", "--ignore-missing-imports", "--disallow-untyped-defs", "--check-untyped-defs" - Documentation: Type hints cheat sheet (Python 3)
- Add same mypy option for vscode in
Preferences: Open User Settings
- isort is the default on pycharm
- isort with vscode
- Lint/format/sort import on save with vscode in
Preferences: Open User Settings:
{
"editor.formatOnSave": true,
"python.formatting.provider": "black",
"[python]": {
"editor.codeActionsOnSave": {
"source.organizeImports": true
}
}
}
- isort configuration for pycharm. See Set isort and black formatting code in pycharm
- You can use
make lintcommand to check flake8/mypy rules & apply automatically format black and isort to the code with the previous configuration
isort .
- Show a way to treat json erroneous file like
data/pubmed.json
- Create a poetry env with python 3.10
poetry env use 3.10- Install pyenv
make pyenv - Install dependencies in poetry env (virtualenv)
make deps - Lint & Test
make build - Lint,Test & Run
make run - Run dev
make dev - Build binary/python whell
make dist
poetry run drugs_gen --help
Usage: drugs_gen [OPTIONS]
Options:
-d, --drugs TEXT Path to drugs.csv
-p, --pubmed TEXT Path to pubmed.csv
-c, --clinicals_trials TEXT Path to clinical_trials.csv
-o, --output TEXT Output path to result.json (e.g
/path/to/result.json)
--help Show this message and exit.
- Use
spark-submitwith the Python Wheel file built bymake distcommand in thedistfolder.