Big Data [sorry] 
       Data Science:
What Does a Data Scientist Do?	


                                                                 Carlos Somohano	

                                                         Founder Data Science London	

                                                                         @ds_ldn	

                                                            datasciencelondon.org	





The Cloud and Big Data: HDInsight on Azure London 25/01/13
Man on the Moon – 1,969
Man on the Moon – Small Data! 	


Computer Program	

          Apollo X1	

              Man on the Moon	

Date: 1,969	

               Speed: 3,500 km/hour	

   Distance: 356,000 Km	

64 Kb, 2Kb RAM, Fortran	

   Weight: 13,500 kg	

      Never been there before	

Must work 1st time	

        Lots of complex data	

   Must return to Earth
Apollo XI, 1969	

    SkyDive Stratos, 2012	


       64 Kb	

            Tens of Gigabytes	





Think About It – We live in Crazy Times!
Big Data is not about Data Volume
What is Big Data? IT mumbo-jumbo	



  A fashionable term typically used by some IT
  vendors to remarket old fashioned software 
  hardware
What is Big Data? The n-Vs	

        Volume …	

        Variety …	

        Velocity …	

        (add your own V here…)	


        So What?
Change! Water Cooler Chat	

We need to parallelize data operations but it’s too costly  complex …	


The business can’t get access to all the relevant data, we need external data…	


We can’t match customer master data to live customer interactions…	


We can’t just force everything into a star-schema…	


These BI reports and charts don’t tell us anything we didn’t know…	


We are missing the ETL window, the data we needed didn’t arrive on time…	


We can’t predict with confidence if we can’t explore data  develop our own models
What is Big Data? Force of Change	



 Big Data forces you to change the way you collect,
 store, manage, analyze and visualize data
Crude Oil
Big Data = Crude Oil [not New Oil]	


Think data as ‘crude oil.’	


Big Data is about extracting the ‘crude oil,’
transporting it in ‘mega-tankers,’ siphoning it through
‘pipelines,’ and storing it in massive ‘silos’… 	


All ‘this’ is about IT Big Data… fine and well…	


… BUT
You need to refine the ‘crude oil’	


       Enter Data Science…
The Science [and Art] of… 	


	

Discovering what we don’t know from data	


	

Obtaining predictive, actionable insight from data	


	

Creating Data Products that have business impact now	


	

Communicating relevant business stories from data	


	

Building confidence in decisions that drive business value
Brief History of Data Science	

6th C BC - 1st C BC – The Greeks! Pyrrhonism, Skepticism  Empiricism… 	

1974 – Peter Naur @UoC Datalogy  Data Science	

2001 – William S. Cleveland @CSU Data Science: An Action Plan …: 	

2002 – Committee on Data for Science  Technology (CODATA) 	

2003 – Journal of Data Science 	

2009 – Jeff Hammerbacher @ Facebook What does a Data Scientist Do? 	

2010 – Drew Conway @NYU The Data Science Venn Diagram	

2010 – Hillary Mason  Chris Wiggins @Dataists “	

2010 – Mike Loukadis @O’Reilly “What is Data Science?” 	

2011 – DJ Patil @LinkedIn data scientist vs. data analyst
Jeff Hammerbacher, 2009	

“... on any given day, a team member could author a
multistage processing pipeline in Python, 	

	

design a hypothesis test, perform a regression analysis
over data samples with R, 	

	

design and implement an algorithm for some data-
intensive product or service in Hadoop, or
communicate the results of our analyses to other
members of the organization.
Mike Loukides, 2010	


Data science enables the creation of data
products.	

	

Whether... data is search terms, voice samples, or
product reviews,... users are in a feedback loop in
which they contribute to the products they use. 	

	

That's the beginning of data science.
Hilary Mason  Chris Wiggins,2010	


  Data science is clearly a blend of the hackers’ arts, statistics
  and machine learning...; 	

  	

  and the expertise in mathematics and the domain of the
  data for the analysis to be interpretable... 	

  	

  It requires creative decisions and open-mindedness in a
  scientific context.
Drew Conway, 2010
DJ Patil, 2011	

”We realized that as our organizations grew, we both had to figure out
what to call the people on our teams. Business analyst” and Data analyst”
seemed too limiting. 	

   	

The focus of our teams was to work on data applications that would have
an immediate and massive impact on the business. 	

	

The term that seemed to fit best was data scientist: those who use both
data and science to create something new”
What is a Data Scientist?
The Duck – Billed Platypus	





       The Data Scientist – Billed Platypus
The Platypus – Billed Data Scientist	

                                                   Machine Learning	

     Hacking	

                                                        Statistics	




                                                                          Math	

                                                    Visualization	

                Science	


   Programming	

                 Data Mining	



                    The Data Scientist – Billed Platypus
Josh Wills, 2012
Class DataScientist {	

 Is skeptical, curious. Has inquisitive mind 	

 Knows Machine Learning, Statistics, Probability	

 Applies Scientific Method. Runs Experiments	

 Is good at Coding  Hacking	

 Able to deal with IT Data Engineering	

 Knows how to build data products	

 Able to find answers to known unknowns	

 Tells relevant business stories from data	

 Has Domain Knowledge 	


}
What Does a Data Scientist Do?
10 Things [most] Data Scientists Do	

      1  Ask Good Questions. What is What… 	

           …we don’t know?	

           …we’d like to know?	

      2  Define and Test an Hypothesis. Run experiments	

      3  Scoop, Scrap, Sink,  Sample Business Relevant Data	

      4  Munge and Wrestle Data. Tame Data	

      5  Explore Data, Discover Data Playfully. Discover unknowns.	

      6  Model Data. Model Algorithms.	

      7  Understand Data Relationships	

      8  Tell the Machine How to Learn from Data	

      9  Create Data Products that Deliver Actionable Insight 	

      10  Tell Relevant Business Stories from Data
[Sort of a] Data Scientist Toolkit	

   §  Java, R, Python… (bonus: Clojure, Haskell, Scala)	

   §  Hadoop, HDFS  MapReduce… (bonus: Spark, Storm)	

   §  HBase, Pig  Hive… (bonus: Shark, Impala, Cascalog)	

   §  ETL, Webscrapers,Flume, Sqoop… (bonus: Hume) 	

   §  SQL, RDBMS, DW, OLAP…	

   §  Knime, Weka, RapidMiner…(bonus: SciPy, NumPy, scikit-learn, pandas)	

   §  D3.js, Gephi, ggplot2, Tableu, Flare, Shiny…	

   §  SPSS, Matlab, SAS… (the enterprise man)	

   §  NoSQL, Mongo DB, Couchbase, Cassandra…	

   §  And Yes! … MS-Excel: the most used, most underrated DS tool
Foundations of Data Science
[Some] Data Science Principles	

    1    Socio-Technical Systems (STS) are complex!	

    2    Data is never at rest	

    3    Data is dirty, deal with it	

    4    SVoT = LOL!	

    5    Data munging  data wrestling  70% time	

    6    Simplification. Reduction. Distillation	

    7    Curiosity. Empiricism. Skepticism
Knowns  Unknowns	


There are known knowns. These are things we know
that we know. 	

There are known unknowns. That is to say, there are
things that we know we don't know.	

But there are also unknown unknowns. There are
things we don't know we don't know	

                                    Donald Rumsfeld
DIKUW FTW!	

  D                      I                      K                       U                      W

 Data              Information               Knowledge           Understanding              Wisdom


                                      PAST                                                   FUTURE

Data Engineer	

    Data Analyst	

                          Data Miner	

      Data Scientist	


        Raw                  What               How to                  Why                   When

    Numbers            Description            Experience          Cause  Effect           Prediction

     Letters             Context                 Tested                Proven             What’s best

                                                                        Known               Unknown
     Symbols          Relationship             Instruction              Unknowns	

         Unknowns	

                      Known Knowns	

      Signals            Reports               Programs                models
Data Discovery	



                                      Data Analyst	



                                                        Data Scientist	





The new reality for Business Intelligence and Big Data, Applied Data Labs
Data Models vs. Algorithmic Models	

           Data Modeling	

                                  VS.	

          Algorithmic Modeling	


 Y ß F( X, random noise, parameters) 	

                                 Y ß 	

        Black Box	

         ß X	

                                                                                         Random Forests	





          We understand the world	

                                            We don’t understand the world	

    How well ‘my data model’ works	

                                       The world produces data in a black-box 	

    Statisticians, Data Analysts, Data Miners	

                            Data Scientists	

    Linear Regression	

                                                    Machine Learning, AI  Neural Nets	

    Logistic Regression	

                                                  Random Forests, SVM, GBT	

    Known Distributions	

                                                  Unknown Multivariate Distributions	

    Confidence Intervals	

                                                  Iterative	

    Predictor Variables  Goodness of Fit	

                                Predictive Accuracy	

     	

                                                                     	

     	


                                             “Statistical Modeling: The Two Cultures” Leo Breiman, 2001
Learning from Data is Tricky	

      Statistical vs. Machine Learning	

      Supervised vs. Unsupervised Learning	

      Induction vs. Deduction	

      Sampling  Confidence Intervals 	

      Probability  Distribution	

      Deviation  Variance	

      Correlation vs. Causation	

      Causation  Prediction
More Data or Better Models?	

More Data Beats Better Algorithms, Omar Tawakoi @BlueKai	

	

Better Algorithms Beat More Data, Mark Torrance @RocketFuel	

	

More Data or Better Models, Xavier Armitrain @Netflix	

	

On Chomsky  2 Cultures of Statistical Learning, Peter Norvig @Google 	

	

Specialist Knowledge is Useless  Unhelpful, Jeremy Howard @Kaggle
Data Science Process – An approach
Data Science Process - 1	

      1  Known Unknowns? 	

      2  We’d like to know…?	

      3  Outcomes?	

      4  What Data?	

      5  Hypothesis?	




         The World 	

            Ingest Raw Data	

     Munch Data	

           The Dataset	

Product Manufactured	

           Transactions	

        MapReduce	

            Independency?	

Goods shipped	

                  Web-Scraping	

        ETL, ELT	

             Correlation?	

Product purchased	

              Web-clicks  logs	

   Data Wrangle 	

        Covariance?	

Phone Calls Made	

               Sensor Data	

         Data Cleansing	

       Causality?	

Energy Consumed 	

               Mobile Data	

         Data Jujitsu	

         Dimensionality?	

Fraud Committed	

                Docs, Emails, XLS	

   Dim Reduction	

        Missing Values?	

Repair Requested	

               Social Feeds, RSS	

   Sample	

               Relevant?	

System 	

                        Flume  Sink HDFS	

   Select, Join, Bind
Data Science Process - II	

The Dataset	

   Explore Data	

                 Represent Data	

                 Discover Data	

                                                                    Deliver Insight 	

                 Learn From Data	

              Data Product	

                                                                    Visualize Insight 	

                 Description  Inference	

      Objectives	

                 Data  Algorithm Models	

      Levers	

          Actionable	

                 Machine Learning	

             Modeling	

        Predictive	

                 Networks  Graphs	

            Simulation	

      Immediate Impact	

                 Regression  Prediction	

      Optimization	

    Business Value	

                 Classification  Clustering	

   Visualization	

   Easy to explain	

                 Experiments  Iteration
What is a Data Product?
A Data Product Is… 	

… Curated and crafted from raw data	

… A result of exploration and iterations	

… A machine that learns from data 	

… An answer to known unknowns or unknown unknowns	

… A mechanism that triggers immediate business value	

… A probabilistic window of future events or behavior
Data Jiu-Jitsu	

                                      Data	


                                                    Jiu Jitsu Fight 	

                     $$$$	


                                                                     Data Product	

 Data Scientist	




Data Jiu-Jitsu: ability to turn big data into data products that generate immediate business value	

                                                                             (DJ Patil @LinkedIn)
Developing Data Products	



          Objectives	

                          Levers	

                           Data	

                         Models	


       What Outcome                       What Inputs Can                    What Data Can                     How the Levers
       Am I Trying to                     We Control?	

                     We Collect?	

                    Influence the
       Achieve?	

                                                                                             Objectives	





Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products”	

 Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
Objective-Based Data Products	

What Outcome Am I                                                                                                          Actionable
Trying to Achieve?	

                                                                                                      Outcome	



                            Data 	

                 Modeler	

                 Simulator	

                 Optimizer	



                                                 The Model Assembly Line 	





Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products”	

 Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
5 Great Data Products
Customer Lifecycle Value	

             Optimize CLV	

                         Product Recommendations	

                               Visualizer	




                            Data 	

                  Modeler	

                Simulator	

                 Optimizer	



                                  1  Products the customer may like	

                                  2  Price Elasticity	

                                  3  Probability of Purchase w/o Recommendation	

                                  4  Purchase Sequence	

                                  5  Causality Model	

                                  6  Patience Model	



Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products”	

 Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
Automated Fruits Procurement	

                                Confirm Purchase Orders	

                                In less than 2 hours	



                                Safety Stock levels?	

                                Demand vs Stock?	

                                Price vs. Demand?	

 12,000 stores	

               Anomalies?	

 300 Fruits	

                  Fruit Shortages?	

 Avg. Shelf life  3 days 	

   Fruit Write-offs?	



 Adapted from Blueyonder
Strawberries  the Weather	


                                         No sales vs X,XXX sales predicted	

Why these huge stock write-offs?	





                                       A Predictive Model that calculates
                                       strawberry purchases based on	

                                          	

                                          Weather forecast	

   Sudden increase in temperature	

      Store temperature	

                                          Freezer sensor data	

                                          Remaining stock per shelf live	

                                          Sales TPoS feeds	

                                          Web searches, social mentions 	


   Adapted from Blueyonder
Personalized Social Recommendations	


 Collaborative Filtering: Matching Skills to People	

             Prediction: Personalized Skills Recommendation	





 Adapted from “Developing Data Products” by Peter Skomoroch 5 Dec, 2012 Copyright LinkedIn
Colas- In Which US State I Invest Mktg. $? 	


            What the Business Analyst Sent	





                                                What the Data Scientist did…
The Great Pop vs. Soda Page	





              http://www.popvssoda.com/
Pop vs. Soda vs. Coke
Raw Data Will Drive You Car
Interested in Data Science?	

Join our community	

    http://www.meetup.com/Data-Science-London/	

Follow us on Twitter 	

    @ds_ldn	

Check out our blog	

    	

http://datasciencelondon.org
Thanks for your time

Big Data [sorry] & Data Science: What Does a Data Scientist Do?

  • 1.
    Big Data [sorry] Data Science: What Does a Data Scientist Do? Carlos Somohano Founder Data Science London @ds_ldn datasciencelondon.org The Cloud and Big Data: HDInsight on Azure London 25/01/13
  • 2.
    Man on theMoon – 1,969
  • 3.
    Man on theMoon – Small Data! Computer Program Apollo X1 Man on the Moon Date: 1,969 Speed: 3,500 km/hour Distance: 356,000 Km 64 Kb, 2Kb RAM, Fortran Weight: 13,500 kg Never been there before Must work 1st time Lots of complex data Must return to Earth
  • 4.
    Apollo XI, 1969 SkyDive Stratos, 2012 64 Kb Tens of Gigabytes Think About It – We live in Crazy Times!
  • 5.
    Big Data isnot about Data Volume
  • 6.
    What is BigData? IT mumbo-jumbo A fashionable term typically used by some IT vendors to remarket old fashioned software hardware
  • 7.
    What is BigData? The n-Vs Volume … Variety … Velocity … (add your own V here…) So What?
  • 8.
    Change! Water CoolerChat We need to parallelize data operations but it’s too costly complex … The business can’t get access to all the relevant data, we need external data… We can’t match customer master data to live customer interactions… We can’t just force everything into a star-schema… These BI reports and charts don’t tell us anything we didn’t know… We are missing the ETL window, the data we needed didn’t arrive on time… We can’t predict with confidence if we can’t explore data develop our own models
  • 9.
    What is BigData? Force of Change Big Data forces you to change the way you collect, store, manage, analyze and visualize data
  • 10.
  • 11.
    Big Data =Crude Oil [not New Oil] Think data as ‘crude oil.’ Big Data is about extracting the ‘crude oil,’ transporting it in ‘mega-tankers,’ siphoning it through ‘pipelines,’ and storing it in massive ‘silos’… All ‘this’ is about IT Big Data… fine and well… … BUT
  • 12.
    You need torefine the ‘crude oil’ Enter Data Science…
  • 13.
    The Science [andArt] of… Discovering what we don’t know from data Obtaining predictive, actionable insight from data Creating Data Products that have business impact now Communicating relevant business stories from data Building confidence in decisions that drive business value
  • 14.
    Brief History ofData Science 6th C BC - 1st C BC – The Greeks! Pyrrhonism, Skepticism Empiricism… 1974 – Peter Naur @UoC Datalogy Data Science 2001 – William S. Cleveland @CSU Data Science: An Action Plan …: 2002 – Committee on Data for Science Technology (CODATA) 2003 – Journal of Data Science 2009 – Jeff Hammerbacher @ Facebook What does a Data Scientist Do? 2010 – Drew Conway @NYU The Data Science Venn Diagram 2010 – Hillary Mason Chris Wiggins @Dataists “ 2010 – Mike Loukadis @O’Reilly “What is Data Science?” 2011 – DJ Patil @LinkedIn data scientist vs. data analyst
  • 15.
    Jeff Hammerbacher, 2009 “...on any given day, a team member could author a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm for some data- intensive product or service in Hadoop, or communicate the results of our analyses to other members of the organization.
  • 16.
    Mike Loukides, 2010 Datascience enables the creation of data products. Whether... data is search terms, voice samples, or product reviews,... users are in a feedback loop in which they contribute to the products they use. That's the beginning of data science.
  • 17.
    Hilary Mason Chris Wiggins,2010 Data science is clearly a blend of the hackers’ arts, statistics and machine learning...; and the expertise in mathematics and the domain of the data for the analysis to be interpretable... It requires creative decisions and open-mindedness in a scientific context.
  • 18.
  • 19.
    DJ Patil, 2011 ”Werealized that as our organizations grew, we both had to figure out what to call the people on our teams. Business analyst” and Data analyst” seemed too limiting. The focus of our teams was to work on data applications that would have an immediate and massive impact on the business. The term that seemed to fit best was data scientist: those who use both data and science to create something new”
  • 20.
    What is aData Scientist?
  • 21.
    The Duck –Billed Platypus The Data Scientist – Billed Platypus
  • 22.
    The Platypus –Billed Data Scientist Machine Learning Hacking Statistics Math Visualization Science Programming Data Mining The Data Scientist – Billed Platypus
  • 23.
  • 24.
    Class DataScientist { Is skeptical, curious. Has inquisitive mind Knows Machine Learning, Statistics, Probability Applies Scientific Method. Runs Experiments Is good at Coding Hacking Able to deal with IT Data Engineering Knows how to build data products Able to find answers to known unknowns Tells relevant business stories from data Has Domain Knowledge }
  • 25.
    What Does aData Scientist Do?
  • 26.
    10 Things [most]Data Scientists Do 1  Ask Good Questions. What is What… …we don’t know? …we’d like to know? 2  Define and Test an Hypothesis. Run experiments 3  Scoop, Scrap, Sink, Sample Business Relevant Data 4  Munge and Wrestle Data. Tame Data 5  Explore Data, Discover Data Playfully. Discover unknowns. 6  Model Data. Model Algorithms. 7  Understand Data Relationships 8  Tell the Machine How to Learn from Data 9  Create Data Products that Deliver Actionable Insight 10  Tell Relevant Business Stories from Data
  • 27.
    [Sort of a]Data Scientist Toolkit §  Java, R, Python… (bonus: Clojure, Haskell, Scala) §  Hadoop, HDFS MapReduce… (bonus: Spark, Storm) §  HBase, Pig Hive… (bonus: Shark, Impala, Cascalog) §  ETL, Webscrapers,Flume, Sqoop… (bonus: Hume) §  SQL, RDBMS, DW, OLAP… §  Knime, Weka, RapidMiner…(bonus: SciPy, NumPy, scikit-learn, pandas) §  D3.js, Gephi, ggplot2, Tableu, Flare, Shiny… §  SPSS, Matlab, SAS… (the enterprise man) §  NoSQL, Mongo DB, Couchbase, Cassandra… §  And Yes! … MS-Excel: the most used, most underrated DS tool
  • 28.
  • 29.
    [Some] Data SciencePrinciples 1  Socio-Technical Systems (STS) are complex! 2  Data is never at rest 3  Data is dirty, deal with it 4  SVoT = LOL! 5  Data munging data wrestling 70% time 6  Simplification. Reduction. Distillation 7  Curiosity. Empiricism. Skepticism
  • 30.
    Knowns Unknowns Thereare known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know Donald Rumsfeld
  • 31.
    DIKUW FTW! D I K U W Data Information Knowledge Understanding Wisdom PAST FUTURE Data Engineer Data Analyst Data Miner Data Scientist Raw What How to Why When Numbers Description Experience Cause Effect Prediction Letters Context Tested Proven What’s best Known Unknown Symbols Relationship Instruction Unknowns Unknowns Known Knowns Signals Reports Programs models
  • 32.
    Data Discovery Data Analyst Data Scientist The new reality for Business Intelligence and Big Data, Applied Data Labs
  • 33.
    Data Models vs.Algorithmic Models Data Modeling VS. Algorithmic Modeling Y ß F( X, random noise, parameters) Y ß Black Box ß X Random Forests We understand the world We don’t understand the world How well ‘my data model’ works The world produces data in a black-box Statisticians, Data Analysts, Data Miners Data Scientists Linear Regression Machine Learning, AI Neural Nets Logistic Regression Random Forests, SVM, GBT Known Distributions Unknown Multivariate Distributions Confidence Intervals Iterative Predictor Variables Goodness of Fit Predictive Accuracy “Statistical Modeling: The Two Cultures” Leo Breiman, 2001
  • 34.
    Learning from Datais Tricky Statistical vs. Machine Learning Supervised vs. Unsupervised Learning Induction vs. Deduction Sampling Confidence Intervals Probability Distribution Deviation Variance Correlation vs. Causation Causation Prediction
  • 35.
    More Data orBetter Models? More Data Beats Better Algorithms, Omar Tawakoi @BlueKai Better Algorithms Beat More Data, Mark Torrance @RocketFuel More Data or Better Models, Xavier Armitrain @Netflix On Chomsky 2 Cultures of Statistical Learning, Peter Norvig @Google Specialist Knowledge is Useless Unhelpful, Jeremy Howard @Kaggle
  • 36.
    Data Science Process– An approach
  • 37.
    Data Science Process- 1 1  Known Unknowns? 2  We’d like to know…? 3  Outcomes? 4  What Data? 5  Hypothesis? The World Ingest Raw Data Munch Data The Dataset Product Manufactured Transactions MapReduce Independency? Goods shipped Web-Scraping ETL, ELT Correlation? Product purchased Web-clicks logs Data Wrangle Covariance? Phone Calls Made Sensor Data Data Cleansing Causality? Energy Consumed Mobile Data Data Jujitsu Dimensionality? Fraud Committed Docs, Emails, XLS Dim Reduction Missing Values? Repair Requested Social Feeds, RSS Sample Relevant? System Flume Sink HDFS Select, Join, Bind
  • 38.
    Data Science Process- II The Dataset Explore Data Represent Data Discover Data Deliver Insight Learn From Data Data Product Visualize Insight Description Inference Objectives Data Algorithm Models Levers Actionable Machine Learning Modeling Predictive Networks Graphs Simulation Immediate Impact Regression Prediction Optimization Business Value Classification Clustering Visualization Easy to explain Experiments Iteration
  • 39.
    What is aData Product?
  • 40.
    A Data ProductIs… … Curated and crafted from raw data … A result of exploration and iterations … A machine that learns from data … An answer to known unknowns or unknown unknowns … A mechanism that triggers immediate business value … A probabilistic window of future events or behavior
  • 41.
    Data Jiu-Jitsu Data Jiu Jitsu Fight $$$$ Data Product Data Scientist Data Jiu-Jitsu: ability to turn big data into data products that generate immediate business value (DJ Patil @LinkedIn)
  • 42.
    Developing Data Products Objectives Levers Data Models What Outcome What Inputs Can What Data Can How the Levers Am I Trying to We Control? We Collect? Influence the Achieve? Objectives Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
  • 43.
    Objective-Based Data Products WhatOutcome Am I Actionable Trying to Achieve? Outcome Data Modeler Simulator Optimizer The Model Assembly Line Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
  • 44.
    5 Great DataProducts
  • 45.
    Customer Lifecycle Value Optimize CLV Product Recommendations Visualizer Data Modeler Simulator Optimizer 1  Products the customer may like 2  Price Elasticity 3  Probability of Purchase w/o Recommendation 4  Purchase Sequence 5  Causality Model 6  Patience Model Adapted from “Designing Great Data Products. The Drivetrain Approach: A Four Step Approach to Building Data Products” Jeremy Howard, Margit Zwemer, Mike Loukides, 2012
  • 46.
    Automated Fruits Procurement Confirm Purchase Orders In less than 2 hours Safety Stock levels? Demand vs Stock? Price vs. Demand? 12,000 stores Anomalies? 300 Fruits Fruit Shortages? Avg. Shelf life 3 days Fruit Write-offs? Adapted from Blueyonder
  • 47.
    Strawberries theWeather No sales vs X,XXX sales predicted Why these huge stock write-offs? A Predictive Model that calculates strawberry purchases based on Weather forecast Sudden increase in temperature Store temperature Freezer sensor data Remaining stock per shelf live Sales TPoS feeds Web searches, social mentions Adapted from Blueyonder
  • 48.
    Personalized Social Recommendations Collaborative Filtering: Matching Skills to People Prediction: Personalized Skills Recommendation Adapted from “Developing Data Products” by Peter Skomoroch 5 Dec, 2012 Copyright LinkedIn
  • 49.
    Colas- In WhichUS State I Invest Mktg. $? What the Business Analyst Sent What the Data Scientist did…
  • 50.
    The Great Popvs. Soda Page http://www.popvssoda.com/
  • 51.
    Pop vs. Sodavs. Coke
  • 52.
    Raw Data WillDrive You Car
  • 53.
    Interested in DataScience? Join our community http://www.meetup.com/Data-Science-London/ Follow us on Twitter @ds_ldn Check out our blog http://datasciencelondon.org
  • 54.