SlideShare a Scribd company logo
1
by Teemu Kurppa
www.teemukurppa.net
Metrics Monday at Custobar, Helsinki,
30.5.2016
Managing data workflows
with Luigi
2
Customer analytics and
marketing tool for retailers
I’m an advisor at your host:
teemu@ouraring.com
www.ouraring.com
the world's first wellness ring
Head of Software: Cloud & Mobile
I work at
Introducing
Data Workflows
4
gunzip -c /var/log/syslog.3.gz | grep -e UFW
Complex data workflow
Let’s analyse if the weather affects sleep quality:
• Get sleep data of all study participants
• Get location data of all study participants
• Fetch weather data for each day and location
• Fetch historical weather data for each location
• Calculate difference from an average weather for each
data point
• Do a statistical analysis over users and days, comparing
weather data and sleep quality data
A lot of can go wrong on each step. Rerunning takes time
Case Custobar:ETL
Extract - Transform - Load
Case Custobar:ETL
Fetch
custom sales.csv
from SFTP
Transform
custom sales.csv
to standard sales.json
Validate and
throw away invalid
fields
Load valid sales
data to database
Case Custobar:ETL
Fetch
custom sales.csv
from SFTP
Transform
custom sales.csv
to standard sales.json
Validate and
throw away invalid
fields
Load valid sales
data to database
Transform
Load
Extract
Case Custobar:ETL
Fetch
custom sales.csv
from SFTP
Transform
custom sales.csv
to standard sales.json
Validate and
throw away invalid
fields
Load valid sales
data to database
Do this, for
millions of rows of initial data,
and continue doing it every day, for
products
customers
sales
Luigi
by Spotify
11
Data workflow tools
Pinball
by Pinterest
Luigi
by Spotify
Airflow
by AirBnB
Luigi Concepts
13
Luigi Concepts
Get Changed
Customers
sql: Customers
table
Tasks
Targets
Export Changed
Customers to FTP
file://data/
customers.csv
sftp://data/
customers.csv
Dependencies
Luigi Concepts
Get Changed
Customers
sql: Customers
table
Tasks
Targets
Export Changed
Customers to FTP
file://data/
customers.csv
sftp://data/
customers.csv
Dependencies
output()input() input() output()
requires()
Luigi Concepts
Get Changed
Customers
sql: Customers
table
Tasks
Targets
Export Changed
Customers to FTP
file://data/
customers.csv
sftp://data/
customers.csv
Dependencies
company: Parameter
date: DateParameter
company: Parameter
date: DateParameter
Parameters
Concepts: Target
17
Target
Target is simply something that exists or doesn’t exist
For example
• a file in a local file system
• a file in a remote file system
• a file in an Amazon S3 bucket
• a database row in a SQL database
Target
class MongoTarget(Luigi.Target):
def __init__(self, database, collection, predicate):
self.client = MongoClient()
self.database = database
self.collection = collection
self.predicate = predicate
def exists(self):
db = self.client[self.database]
one = db[self.collection].find_one(self.predicate)
return one is not None
Target
Lots of ready-made targets in Luigi:
• local file
• HDFS file
• S3 key/value target
• SSH remote target
• SFTP remote target
• SQL table row target
• Amazon Redshift table row target
• ElasticSearch target
Concepts: Task
21
Task: basic structure
class TransformDailySalesCSVtoJSON(Luigi.Task):
def requires(self): #…
def run(self): # …
def output(self): #…
Task: parameters
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self): #…
def run(self): # …
def output(self): #…
Task: requires
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self):
return ImportDailyCSVFromSFTP(self.date)
def run(self): # …
def output(self): #…
Task: output
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self): # …
def run(self): # …
def output(self):
path = “/d/sales_%s.json” % (self.date.stftime(‘%Y%m%d’))
return luigi.LocalTarget(path)
Task: run
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self): #…
def run(self):
# Note: luigi’s input() and output() takes care of atomicity
with self.input().open(‘r’) as infile:
data = transform_csv_to_dict(infile)
with self.output().open(‘w’) as outfile:
json.dump(data, outfile)
def output(self): #…
Task
class TransformDailySalesCSVtoJSON(Luigi.Task):
date = luigi.DateParameter()
def requires(self):
return ImportDailyCSVFromSFTP(self.date)
def run(self):
with self.input().open(‘r’) as infile:
data = transform_csv_to_dict(infile)
with self.output().open(‘w’) as outfile:
json.dump(data, outfile)
def output(self):
path = “/d/sales_%s.json” % (self.date.stftime(‘%Y%m%d’))
return luigi.LocalTarget(path)
Tasks
Lots of ready-made tasks in Luigi:
• dump data to SQL table
• copy to Redshift Table
• run Hadoop job
• query SalesForce
• copy to Redshift Table
• Load ElasticSearch index
• …
Dependency patterns
29
Multiple dependencies
class TransformAllSales(Luigi.Task):
def requires(self):
for i in range(1000):
return [ImportInitialSaleFile(index=i)]
def run(self): #…
def output(self): #…
Dynamic dependencies
class LoadDailyAPIData(Luigi.Task):
date = luigi.DateParameter()
def run(self):
for filepath in os.listdir(‘/d/api_data/*.json’):
TransformDailyAPIData(filepath)
Wrapper task
class LoadAllDailyData(Luigi.WrapperTask):
date = luigi.DateParameter()
def run(self):
yield LoadDailyProducts(self.date)
yield LoadDailyCustomers(self.date)
yield LoadDailySales(self.date)
Why to use
data workflow tools?
33
34
1. Resume the data workflow after a failure
2. Parametrize and rerun tasks every day
3. Organise code with shared patterns
35
Thanks! Questions?
Custobar is hiring!
Approach Juha, Tatu or me to learn more
Follow @teemu on Twitter to stay in touch.

More Related Content

PDF
Luigi presentation NYC Data Science
Erik Bernhardsson
 
PDF
Luigi presentation OA Summit
Open Analytics
 
PDF
Python as part of a production machine learning stack by Michael Manapat PyDa...
PyData
 
PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 
PDF
Introduction to Apache Calcite
Jordan Halterman
 
PDF
Airflow tutorials hands_on
pko89403
 
PPTX
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
 
PDF
Streaming SQL with Apache Calcite
Julian Hyde
 
Luigi presentation NYC Data Science
Erik Bernhardsson
 
Luigi presentation OA Summit
Open Analytics
 
Python as part of a production machine learning stack by Michael Manapat PyDa...
PyData
 
A Beginner's Guide to Building Data Pipelines with Luigi
Growth Intelligence
 
Introduction to Apache Calcite
Jordan Halterman
 
Airflow tutorials hands_on
pko89403
 
Hive and Apache Tez: Benchmarked at Yahoo! Scale
DataWorks Summit
 
Streaming SQL with Apache Calcite
Julian Hyde
 

What's hot (20)

PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
PPT
Running Spark in Production
DataWorks Summit/Hadoop Summit
 
PDF
MongoDB Fundamentals
MongoDB
 
PDF
Common MongoDB Use Cases
DATAVERSITY
 
PPTX
Apache spark 소개 및 실습
동현 강
 
PDF
Airflow presentation
Ilias Okacha
 
PPTX
Modeling Data and Queries for Wide Column NoSQL
ScyllaDB
 
PDF
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Databricks
 
PDF
Test strategies for data processing pipelines
Lars Albertsson
 
PDF
Streaming SQL for Data Engineers: The Next Big Thing?
Yaroslav Tkachenko
 
PDF
Data Lineage with Apache Airflow using Marquez
Willy Lulciuc
 
PDF
Introduction to Node.js
Rob O'Doherty
 
PDF
Spark로 알아보는 빅데이터 처리
Jeong-gyu Kim
 
PDF
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Burasakorn Sabyeying
 
PDF
Spark shuffle introduction
colorant
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PDF
Mongodb replication
PoguttuezhiniVP
 
PDF
Spark Summit EU talk by Ted Malaska
Spark Summit
 
PDF
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
PDF
Cassandra Introduction & Features
DataStax Academy
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Sachin Aggarwal
 
Running Spark in Production
DataWorks Summit/Hadoop Summit
 
MongoDB Fundamentals
MongoDB
 
Common MongoDB Use Cases
DATAVERSITY
 
Apache spark 소개 및 실습
동현 강
 
Airflow presentation
Ilias Okacha
 
Modeling Data and Queries for Wide Column NoSQL
ScyllaDB
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
Databricks
 
Test strategies for data processing pipelines
Lars Albertsson
 
Streaming SQL for Data Engineers: The Next Big Thing?
Yaroslav Tkachenko
 
Data Lineage with Apache Airflow using Marquez
Willy Lulciuc
 
Introduction to Node.js
Rob O'Doherty
 
Spark로 알아보는 빅데이터 처리
Jeong-gyu Kim
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Burasakorn Sabyeying
 
Spark shuffle introduction
colorant
 
Apache Spark Architecture
Alexey Grishchenko
 
Mongodb replication
PoguttuezhiniVP
 
Spark Summit EU talk by Ted Malaska
Spark Summit
 
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Cassandra Introduction & Features
DataStax Academy
 
Ad

Similar to Managing data workflows with Luigi (20)

PDF
Spark Workflow Management
Romi Kuntsman
 
PDF
Workflow Engines + Luigi
Vladislav Supalov
 
PDF
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
PDF
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
PDF
Reproducibility and automation of machine learning process
Denis Dus
 
PDF
Data Science Pipelines in Python using Luigi
Shivam Bansal
 
PPTX
Building Data Science Pipelines in Python using Luigi
Shwet Kamal Mishra
 
PDF
Data ops in practice - Swedish style
Lars Albertsson
 
PPTX
More Data, More Problems: Evolving big data machine learning pipelines with S...
Alex Sadovsky
 
PPTX
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Aaron Knight
 
PDF
Luigi Presentation at OSCON 2013
Erik Bernhardsson
 
PPT
Agile Data Science: Building Hadoop Analytics Applications
Russell Jurney
 
PPTX
ETL Pipeline for the snowflake problem statement
JayantAsudhani1
 
PDF
Data pipelines from zero to solid
Lars Albertsson
 
PDF
Building Data Pipelines in Python
C4Media
 
PPTX
Making the Case for Legacy Data in Modern Data Analytics Platforms
Precisely
 
PPTX
RIGA COMM 2022 AI driven data quality v2.pptx
Muntis Rudzitis
 
PPTX
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
InfluxData
 
PPTX
Hadoop Summit - Sanoma self service on hadoop
Sander Kieft
 
PPT
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
The Hive
 
Spark Workflow Management
Romi Kuntsman
 
Workflow Engines + Luigi
Vladislav Supalov
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dan Lynn
 
Dirty data? Clean it up! - Datapalooza Denver 2016
Dan Lynn
 
Reproducibility and automation of machine learning process
Denis Dus
 
Data Science Pipelines in Python using Luigi
Shivam Bansal
 
Building Data Science Pipelines in Python using Luigi
Shwet Kamal Mishra
 
Data ops in practice - Swedish style
Lars Albertsson
 
More Data, More Problems: Evolving big data machine learning pipelines with S...
Alex Sadovsky
 
Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Aaron Knight
 
Luigi Presentation at OSCON 2013
Erik Bernhardsson
 
Agile Data Science: Building Hadoop Analytics Applications
Russell Jurney
 
ETL Pipeline for the snowflake problem statement
JayantAsudhani1
 
Data pipelines from zero to solid
Lars Albertsson
 
Building Data Pipelines in Python
C4Media
 
Making the Case for Legacy Data in Modern Data Analytics Platforms
Precisely
 
RIGA COMM 2022 AI driven data quality v2.pptx
Muntis Rudzitis
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
InfluxData
 
Hadoop Summit - Sanoma self service on hadoop
Sander Kieft
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
The Hive
 
Ad

More from Teemu Kurppa (7)

PDF
React + Redux + d3.js
Teemu Kurppa
 
PDF
fast.ai - Learning Deep Learning
Teemu Kurppa
 
KEY
Quick'n'Dirty Tornado Intro
Teemu Kurppa
 
KEY
Early stage startups
Teemu Kurppa
 
PDF
Mobile Startups - Why to focus on mobile?
Teemu Kurppa
 
PDF
Platform = Stage. How to choose a mobile development platform?
Teemu Kurppa
 
PDF
Leaks & Zombies
Teemu Kurppa
 
React + Redux + d3.js
Teemu Kurppa
 
fast.ai - Learning Deep Learning
Teemu Kurppa
 
Quick'n'Dirty Tornado Intro
Teemu Kurppa
 
Early stage startups
Teemu Kurppa
 
Mobile Startups - Why to focus on mobile?
Teemu Kurppa
 
Platform = Stage. How to choose a mobile development platform?
Teemu Kurppa
 
Leaks & Zombies
Teemu Kurppa
 

Recently uploaded (20)

PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PPTX
IoT Sensor Integration 2025 Powering Smart Tech and Industrial Automation.pptx
Rejig Digital
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
Coupa-Overview _Assumptions presentation
annapureddyn
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PDF
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
PDF
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
PPTX
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
PDF
Beyond Automation: The Role of IoT Sensor Integration in Next-Gen Industries
Rejig Digital
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
IoT Sensor Integration 2025 Powering Smart Tech and Industrial Automation.pptx
Rejig Digital
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Coupa-Overview _Assumptions presentation
annapureddyn
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
Unlocking the Future- AI Agents Meet Oracle Database 23ai - AIOUG Yatra 2025.pdf
Sandesh Rao
 
BLW VOCATIONAL TRAINING SUMMER INTERNSHIP REPORT
codernjn73
 
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Advances in Ultra High Voltage (UHV) Transmission and Distribution Systems.pdf
Nabajyoti Banik
 
Beyond Automation: The Role of IoT Sensor Integration in Next-Gen Industries
Rejig Digital
 

Managing data workflows with Luigi