[WIP] dlt offers an init command that will clone and inject any pipeline from this repository into your project, setup the credentials and python dependencies. Please follow our docs
Join our slack by following the invitation link
For people using the pipelines: technical-help channel
For contributors: dlt-contributors channel
python-dlt uses poetry to manage, build and version the package. It also uses make to automate tasks. To start
make install-poetry # will install poetry, to be run outside virtualenvthen
make dev # will install all deps including devExecuting poetry shell and working in it is very convenient at this moment.
Use python 3.8 for development which is the lowest supported version for python-dlt. You'll need distutils and venv:
sudo apt-get install python3.8
sudo apt-get install python3.8-distutils
sudo apt install python3.8-venvYou may also use pyenv as poetry suggests.
python-dlt uses mypy and flake8 with several plugins for linting. We do not reorder imports or reformat code. To lint the code do make lint.
Code does not need to be typed - but it is better if it is - mypy is able to catch a lot of problems in the code. If your pipeline is typed file named py.typed to the folder where your pipeline code is. (see chess pipeline for example)
Function input argument of sources and resources should be typed that allows dlt to validate input arguments at runtime, say which are secrets and generate the secret and config files automatically.
Linting step requires properly constructed python packages so it will ask for __init__ files to be created. That can be automated with
./check-package.sh --fixexecuted from the top repo folder
- Create an issue that describes the pipeline or the problem being fixed
- Make a feature branch
- Follow the guidelines from Repository structure chapter
- Commit to that branch when you work. Please use descriptive commit names
- Make a PR to master branch
You can find the official dlt documentation at our docs site. This documentation is oriented at newcomers that often are not professional programmers. In other words: it is good to get first grasp on how to create a pipeline.
For contributors we have in-depth technical documentation that may not be polished but is much more comprehensive. The chapter on config and credentials is a must-read.
All repo code reside in pipelines folder. Each pipeline has its own pipeline folder (ie. chess) where the dlt.source and dlt.resource functions are present. The internal organization of this folder is up to the contributor. For each pipeline there's a also a script with the example usages (ie. chess_pipeline.py). The intention is to show the user how the sources/resources may be called and let the user to copy the code from it.
- Create a folder (pipeline folder) with your pipeline name in
pipelines. Place all your code in that folder. - Place (decorated) source/resource functions in the main module named as pipeline folder (the
__init__.pyalso works) - Try to separate your code where the part that you want people to hack stays in main module and the rest goes to some helper modules.
- Create a demo/usage script with the name
<pipeline_folder>_pipeline.pyand place it inpipelines. Make it work withpostgresorduckdbso it is easy to try them out - Add pipeline specific dependencies as described below
- Place your tests in
tests/<pipeline folder>. To run your tests you'll need to create test accounts, data sets, credentials etc. Talk to dlt team on slack. We may provide you with the required accounts and credentials. - Add example credentials to this repo as described below.
- The pipeline must pass CI: linter and tests stage. If you created any accounts or credentials, this data must be shared or via this repo or as is described later. We'll add it to our CI secrets
If pipeline requires additional dependencies that are not available in python-dlt they may be added as follows:
- Use
poetryto add it to the group with the same name as pipeline. Example: chess pipeline usespython-chessto decode game moves. Dependency was added withpoetry add -G chess python-chess - Add
requirements.txtfile in pipeline folder and add the dependency there.
Use relative imports. Your code will be imported as source code and everything under pipeline folder must be self-contained and isolated. Example (from google_sheets)
from .helpers.data_processing import get_spreadsheet_id
from .helpers.api_calls import api_auth
from .helpers import api_callsAs mentioned above the tech doc on config and credentials is a must-read.
All pipeline tests and usage/example scripts share the same config and credential files that are present in pipelines/.dlt.
This makes running locally much easier and dlt configuration is flexible enough to apply to many pipelines in one folder.
Please look at example.secrets.toml in .dlt folder on how to configure postgres, redshift and bigquery destination credentials. Those credentials are shared by all pipelines.
Then you can create your secrets.toml with the credentials you need. The duckdb and postgres destinations work locally and we suggest you use them for initial testing.
As explained in technical docs, both native form (ie. database connection string) or dictionary representation (a python dict with host database password etc.) can be used.
If you add a new pipeline that require a secret value, please add a placeholder to example.secrets.toml. When adding the source config and secrets please follow the section layout for sources. We have a lot of pipelines so we must use precise section layout (up to module level):
[sources.<python module name where source and resources are placed>]
This way we can isolate credentials for each pipeline.
Your working dir must be pipelines otherwise dlt will not find the .dlt folder with secrets.
- If you are contributing and want to test against
redshiftandbigquery, ping the dlt team on slack. You'll get atomlfile fragment with the credentials that you can paste into yoursecrets.toml - If you contributed a pipeline and created any credentials, test accounts, test dataset please include them in the tests or share them with
dltteam so we can configure the CI job. If sharing is not possible please help us to reproduce your test cases so CI job will pass.
TBD. but is seems you need all destination and source credentials. Please ping us on slack and you'll obtain two toml fragments which need to be added to forked repo as Repository Secrets:
- DESTINATIONS_SECRETS
- SOURCES_SECRETS
The reason for the structure above is to use dlt init command to let user add the pipelines to their own project. dlt init is able to add pipelines as pieces of code, not as dependencies, see explanation here: https://github.com/dlt-hub/python-dlt-init-template
Please read the detailed information on our distribution model
We use pytest for testing. Every test is running within a set of fixtures that provide the following environment (see conftest.py):
- they load secrets and config from
pipelines/.dltso the same values are used when you run your pipeline from command line and in tests - it sets the working directory for each pipeline to
_storagefolder and makes sure it is empty before each test - it drops all datasets from the destination after each test
- it runs each test with the original environment variables so you can modify
os.environ
Look at tests/chess/test_chess_pipeline.py for an example. The line
@pytest.mark.parametrize('destination_name', ALL_DESTINATIONS)makes sure that each test runs against all destinations (as defined in ALL_DESTINATIONS global variables)
The simplest possible test just creates pipeline and then issues a run on a source. More advanced test will use sql_client to check the data and access the schemas to check the table structure.
Your tests will be run both locally and on CI. It means that a few instances of your test may be executed in parallel and they will be sharing resources. A few simple rules make that possible.
- Always use
full_refreshwhen creating pipelines in test. This will make sure that data is loaded into new schema/dataset. Fixtures inconftest.pywill drop datasets created during load. - When creating any fixtures for your tests, make sure that fixture is unique for your test instance.
If you create database or schema or table, add random suffix/prefix to it und use in your test
If you create an account ie. an user with a name and this name is uniq identifier, also add random suffix/prefix
- Cleanup after your fixtures - delete accounts, drop schemas and databases
- Add code to
tests/utils.pyonly if this is helpful for all tests. Put your specific helpers in your own directory.
TBD.
- When developing, limit the destinations to local ie. duckdb by setting the environment variable:
ALL_DESTINATIONS='["duckdb"]' pytest tests/chess
there's also make test-local command that will run all the tests on duckdb and postgres
There's compose file with fully prepared postgres instance here
We have CI on github actions. Workflows need full set of credentials for sources and destinations to run. We put those as toml fragments in
- DESTINATIONS_SECRETS - fragment with all destination credentials
- SOURCES_SECRETS - fragment with all sources credentials
If you are contributing from fork ping us on slack to get those
Selective running of tests is not yet implemented. When done we'll run only the tests for the pipelines that were modified by given PR.