Slide 1

Slide 1 text

1 Managing Data Science in the Enterprise Strata NYC, September 2018

Slide 2

Slide 2 text

© 2018 Domino Data Lab, Inc. 2 Who are we? Josh Poduska Chief Data Scientist Domino Data Labs Patrick Harrison Director of Data Science S&P Global Market Intelligence

Slide 3

Slide 3 text

© 2018 Domino Data Lab, Inc. Agenda Introduction and welcome Motivation: why this matters Common challenges to managing data science in the enterprise Guiding principles and framework Process Breakout Exercise: Project pre-flight checklist Break People Breakout Exercise: Team-building plan Managing Technology and X-Factors Summary

Slide 4

Slide 4 text

© 2018 Domino Data Lab, Inc. 4 Data Science Why is it different; 
 why does this matter?

Slide 5

Slide 5 text

© 2018 Domino Data Lab, Inc. 5 At the heart of data science 
 lies an innocuous sounding thing…

Slide 6

Slide 6 text

© 2018 Domino Data Lab, Inc. 6 called a model.

Slide 7

Slide 7 text

7 9

Slide 8

Slide 8 text

8 1 0

Slide 9

Slide 9 text

9 1 1

Slide 10

Slide 10 text

10 1 2 The implications of not becoming a Model-Driven business are existential. The implications of not becoming a Model-Driven Business are existential.

Slide 11

Slide 11 text

© 2018 Domino Data Lab, Inc. 11 1 Breakthroughs open new revenue streams, expand into new markets, create and deliver new products. 2 Operational efficiency gains that compound through constant incremental improvement.

Slide 12

Slide 12 text

© 2018 Domino Data Lab, Inc. 12 Jeff Bezos’s 2016 Annual Letter to Shareholders: At Amazon, we’ve been engaged in the practical application of machine learning for many years now. Some of this work is highly visible: our autonomous Prime Air delivery drones; the Amazon Go convenience store that uses machine vision to eliminate checkout lines; and Alexa, our cloud- based AI assistant. But much of what we do with machine learning happens beneath the surface. Machine learning drives our algorithms for demand forecasting, product search ranking, product and deals recommendations, merchandising placements, fraud detection, translations, and much more.

Slide 13

Slide 13 text

© 2018 Domino Data Lab, Inc. 13 1 5 90% of companies want to make data science an operational part of their business 30% have 5+ models in production And Yet…

Slide 14

Slide 14 text

© 2018 Domino Data Lab, Inc. 14 Many organizations treat data science 
 as a technical practice — instead of an organizational capability. 1 6

Slide 15

Slide 15 text

© 2017 Domino Data Lab, Inc. 15 Ad-hoc, one-off production How most organizations operate Data Science Leaders Data Scientists Risk and Compliance Business stakeholders and IT system owners Business Leaders & Data Science Leaders CIOs and IT leaders Slow, unclear path ”black hole” wild west of desktop data science $$$$ ? ? @#!

Slide 16

Slide 16 text

© 2017 Domino Data Lab, Inc. 16 Model Delivery/ Deployment How model-driven organizations operate Data science leaders Risk and compliance leaders Business stakeholders and IT system owners Business leaders & Data science leaders CIOs and IT leaders Validation & Review Monitoring & Feedback Model Development Data Scientists

Slide 17

Slide 17 text

Barriers 17

Slide 18

Slide 18 text

Data-Era Infrastructure Mentality
 
 Data science demands new degrees of infrastructure flexibility and scalability. 18 1

Slide 19

Slide 19 text

Garage Silos
 
 Data scientists’ work is bespoke, 
 ad hoc, and siloed. 19 2

Slide 20

Slide 20 text

Broken Loops 20 3 Companies struggle to put models and model-backed products into production. Or if they make it into production, companies struggle to measure their impact and drive subsequent improvement.

Slide 21

Slide 21 text

Model Liability
 
 Models built without proper checks and controls have the potential to do significant harm to a company’s profits, brand, and reputation. 21 4

Slide 22

Slide 22 text

22 Speed of iteration Big breakthroughs > Tool agility Any one tool > Process and culture Any one piece of technology > Producing an answer > Reusable knowledge Mindsets of the most effective data science organizations

Slide 23

Slide 23 text

© 2018 Domino Data Lab, Inc. 23 A framework for managing data science as a capability People • Attract, hire, onboard, retain, and organize world-class talent Technology • Productivity and best practices to enable scale X-Factors • Managing model liability • Navigate organizational politics Process • Deliver measurable, reliable, scalable outcomes

Slide 24

Slide 24 text

24 Process

Slide 25

Slide 25 text

Process 25 • Deciding what we do • Doing projects • Wrapping up projects

Slide 26

Slide 26 text

© 2018 Domino Data Lab, Inc. 26 Typical approach • Data -> Analysis -> Product Development -> KPI • Common Pitfalls = Scope creep, loss of stakeholder enthusiasm, no crisp measure of success Better method • Problem -> Relevant KPIs -> Product Requirements -> Analysis Necessary -> Data • Result = Greater focus, lower risk • Business process map • Educate stakeholders on what is possible (avoid perception of magic) • Allow all stakeholders to submit ideas • Publish monthly to all stakeholders, re-prioritize at least quarterly Deciding what we do: engage the business

Slide 27

Slide 27 text

© 2018 Domino Data Lab, Inc. 27 • Calculate Value at Stake • Order of magnitude value capture ($100k, $1mln, $10mln, etc.) • How much improvement is realistic? • Estimate Effort • Order of magnitude cost estimation (1hr, 1 day, 1 week, 1 quarter, 1 year) • Forecast Risks • Barriers to adoption • Potential consequences of errors or performance degradation • Project Prioritization Risk EFFORT VALUE Low High Low High Do! Don’t!

Slide 28

Slide 28 text

© 2018 Domino Data Lab, Inc. 28 • Embark on never-ending science projects • Overlook linkages between model insight and business action • Focus on what’s easy or clever instead of what’s valuable • Cost estimates fail to consider integration, maintenance, retraining Prioritization Pitfalls

Slide 29

Slide 29 text

© 2018 Domino Data Lab, Inc. 29 • “We don’t fail because of the math… we fail because we don’t anticipate how the math will be used.” • Time saved here pays 10x in development and 100x in prod • “Product management” principles apply to data science projects just as much as engineering projects Project kick-off

Slide 30

Slide 30 text

© 2018 Domino Data Lab, Inc. 30 • Business case definition • Stakeholder mapping • Technology needs • Data availability • Prior art review • Model delivery plan • Success measures • Compliance and regulatory checks Kickoff checklist

Slide 31

Slide 31 text

© 2018 Domino Data Lab, Inc. 31 PROJECT #1 – CHURN PREDICTION PROJECT #2 – FRAUD CLASSIFIER Value at Stake 100,000 customers * $1000 ARR * 10% current churn = $10mln problem 50,000 applications * 1% fraud rate * $2000 avg. resolution cost = $1mln problem Potential for Improvement Low (already quite low churn) High (not doing anything today, no headcount ) Dependencies Enough support staff to act? App dev team integration Level of Effort 1 month 1 quarter Risk of False Positive Low (extra support) High (bad customer experience) Risk of False Negative Medium (lost revenue) High (more lost revenue) Re-training Requirements Medium (marketing mix changes slowly) High (adversarial) Change Management Requirements Low (educate support team, currently use random / intuition) High (modify real-time application flow) ROI Math Example

Slide 32

Slide 32 text

© 2018 Domino Data Lab, Inc. 32 • Best Practices • Define responsible parties from each group: data science, business, DevOps, application dev, compliance, etc. • Common Pitfalls • Lack empathy with goal of actual end user • Throw results “over the fence” to IT with no context Stakeholder Mapping

Slide 33

Slide 33 text

© 2018 Domino Data Lab, Inc. 33 • Best Practices • Consider opportunities to accelerate research • Identify dependencies early • Common Pitfalls • “One size fits all” tooling • Underpowered infrastructure Technology Needs

Slide 34

Slide 34 text

© 2018 Domino Data Lab, Inc. 34 • Best Practices • Leverage existing sources first to build baseline • Create synthetic data with realistic characteristics • Track engagement with datasets to automatically discover experts • Common Pitfalls • Wait for “perfect” data • Buy external data without clear onboarding plan Data Availability

Slide 35

Slide 35 text

© 2018 Domino Data Lab, Inc. 35 • Best Practices • Review state of the art — internally and externally • Common Pitfalls • Culture of NIH • Nose-to-the-ground mindsets • No single source of truth Prior Art Review

Slide 36

Slide 36 text

© 2018 Domino Data Lab, Inc. 36 • Best Practices • Design multiple mock-ups of different form factors • Designate approvers in advance (IT, DS, biz) • Create process flow to precisely show where model will impact • Consider agile approach • Common Pitfalls • Fail to educate end-users • Over-engineer relative to the requirements use case Model Delivery Plan

Slide 37

Slide 37 text

© 2018 Domino Data Lab, Inc. 37 • Best Practices • Pre-emptively answer “how will we know if this worked?” • Frame in terms of business KPIs not statistical measures • Define needs for holdout groups, A/B testing, etc. • Common Pitfalls • Not knowing when it is “good enough” • Fail to establish testing infrastructure and culture Success Measures

Slide 38

Slide 38 text

© 2018 Domino Data Lab, Inc. 38 • Best Practices • Consider consequences of errors (e.g., false positives / negatives) • State likely biases in training data • Track ongoing usage to prevent inappropriate consumers • Common Pitfalls • Assume no regulation today will last • Conflate model interpretability with model provenance • Model misuse Risk Mitigation

Slide 39

Slide 39 text

© 2018 Domino Data Lab, Inc. 39 • Best Practices • Defend the scientific method • Store positive and negative results • Preserve synthesis, intermediate results, code, data, and environment • Common Pitfalls • Repeated quiet failures • Old analysis doesn’t run Wrapping up projects

Slide 40

Slide 40 text

40 Group Exercise #1: Fill out a pre-flight checklist for one of your projects • Spend 15 minutes filling out template • Discuss in groups of five for 20 minutes

Slide 41

Slide 41 text

© 2018 Domino Data Lab, Inc. 41 8 Factor Pre-Launch Checklist: Questions to Ask Business Case What’s the desired outcome of the project, in terms of a business metric? What are the linkages from your project to impacting that ultimate business metric?
 What is the order-of-magnitude ROI and cost? Stakeholders Lead DS, Proposed Validator, Data Engineers, Product Manager, Business Executive, Business End User (internal or external), Application Developer, DevOps Engineer, Compliance Technology What compute hardware and software infrastructure do you anticipate being necessary? Would your project benefit from specialized or parallelized computing? Data What relevant data exists today? Who are the subject matter experts? What other data would we potentially want to capture, create, or buy externally? Prior Art Who has worked on this business topic before (internal and external)? Who are the relevant experts in the techniques I will likely use? Model Delivery Who will consume this and what form factor will the final product take (report, app, API)? What dependencies or resources will you require to deliver work this way (e.g., IT)? What are other possible delivery mechanisms, especially ones that are lighter weight or easier to test first? What user training is necessary to ensure adoption? Success Measure How will you know if it’s working as expected, or otherwise get feedback? What’s your “monitoring” plan, even if it’s manual and subjective? Risk Mitigation How could this model be mis-used by end users? Any constraints on modeling approach (e.g., interpretability requirements)?

Slide 42

Slide 42 text

© 2018 Domino Data Lab, Inc. 42 • Don’t be overwhelmed into paralysis by complex process • Look for low-hanging fruit to buy political capital for more headcount and risky projects • Find senior sponsor • Most important takeaway: engage the business as partners early and often A note for early teams

Slide 43

Slide 43 text

© 2018 Domino Data Lab, Inc. 43 Process: The Data Science Lifecycle

Slide 44

Slide 44 text

© 2018 Domino Data Lab, Inc. 44 Common pitfalls in the data science lifecycle 2 1 3 4

Slide 45

Slide 45 text

© 2018 Domino Data Lab, Inc. 45 DATA SCIENCE BOTTLENECKS CHALLENGES • X% of projects have little / no impact • Y number of weeks lost by employees identifying what projects have been done before and understanding that work 1 Inconsistent Project Prioritization and Kickoff • Duplication of work wastes time and slows down progress • Inability to leverage past work and customize across locations • Scope creep and loss of stakeholder enthusiasm 2 No Access to Technology On- Demand • approval for required infrastructure takes weeks per project • Insufficient infrastructure prevents differentiated innovation • 4-6 weeks delays for resource requests spread between approvals and implementation • X time wasted replacing Data Scientists 3 No Ability to Easily Deploy Results to Business • Data Scientists waste time on mundane tasks to expose models • Business stakeholders complain about lengthy delays to business value • Z time lost by employees setting up dashboard servers BUSINESS IMPACT Bottlenecks and pitfalls have quantified negative impact 4 Failure to preserve knowledge upon completion • Lack of documentation and reproducibility of code hurts iteration • Projects just fade away, so null results aren’t known for future collaborators • Model iteration velocity slowed by average of 1m

Slide 46

Slide 46 text

46 People

Slide 47

Slide 47 text

© 2018 Domino Data Lab, Inc. 47 • Talent gap commonly cited as obstacle to being model-driven • Typical tenure <2 years with 3+ month ramp • Overwhelmed by resumes, underwhelmed by output Why focus on people?

Slide 48

Slide 48 text

© 2018 Domino Data Lab, Inc. 48 • Attract – How to lure the best talent • Assess – Hire systematically • Train – Focus on mindset, not just skills • Retain – Build community and mentorship • Organize - Define optimal roles and structure Framework for People

Slide 49

Slide 49 text

© 2018 Domino Data Lab, Inc. 49 • Best Practices • Have a differentiated offering and strategy • Advertise projects, not just the company • Offer modern tools and commitment to open source • Common Pitfalls • Write unrealistic job descriptions • Seek PhDs when need hackers (or vice versa) Attracting the best and brightest

Slide 50

Slide 50 text

© 2018 Domino Data Lab, Inc. 50 • Best Practices • Be systematic: identify required attributes, design assessments for each • Be analytical: track interviewer and interview type efficacy • Include EQ and non-technical assessments • Sell while assessing: simulate real work • Common Pitfalls • Over-rely on tech screens Picture of women’s sport’s team bench Assessment

Slide 51

Slide 51 text

© 2018 Domino Data Lab, Inc. 51 • Best Practices • Reinforce mindsets, not just skills • Develop culture of reuse, compounding • Reward community- enhancing behavior • Provide “soft” skills training • Common Pitfalls • “Not built here” mentality Training

Slide 52

Slide 52 text

© 2018 Domino Data Lab, Inc. 52 • Best Practices • Emphasize listening to stakeholders • Compensate team on new and existing work, not just current projects • Common Pitfalls • Employee churn from flawed expectations Source: Max Shron, Warby Parker Set expectations on time allocation upfront

Slide 53

Slide 53 text

© 2018 Domino Data Lab, Inc. 53 • Best Practices • Share accountability with the business’s KPIs • Focus on iteration velocity • Systematically capture stakeholder feedback and engagement • Common Pitfalls • Measure everyone but yourself • Over-index on any one project vs. factory performance Metrics of managing data science

Slide 54

Slide 54 text

© 2018 Domino Data Lab, Inc. 54 The many hats of data science PRIORITIES ROLE Generating and communicating insights, understanding the strengths and risks Data Scientist Creating engaging visual and narrative journeys for analytical solutions Data Storyteller Building scalable pipelines and infrastructure that make it possible to do the higher levels of needs. Data Infrastructure
 Engineer Articulating the business problem, translating to day-to-day work, ensuring ongoing engagement. Data Product 
 Manager Vetting the prioritization and ROI, providing ongoing feedback Business 
 Stakeholder

Slide 55

Slide 55 text

© 2018 Domino Data Lab, Inc. 55 • Best Practices • Solve prioritization and delivery problems first • Bridge silos with cross- cutting platforms • Common Pitfalls • Fail to evolve structure as org matures • Confine teams to ivory tower innovation labs • Stronger alignment with business processes and priorities • Easier change management • Less technical knowledge compounding • Harder to codify best practices • Risk of IT governance issues DECENTRALIZATION CENTRALIZATION • Community and mentorship • easier transparency for managers and IT • More passive technical knowledge sharing • Isolation on data science island • Loss of credibility with business • Frustrated data scientists Pros Cons Organizational Design Dilemmas

Slide 56

Slide 56 text

© 2018 Domino Data Lab, Inc. 56 What Org Design is Right For You? DS IT LoB1 LoB2 LoB3 Centralized Standalone Centralized under IT/Eng DS IT LoB1 LoB2 LoB3 Federated DS1 IT LoB1 LoB2 LoB3 DS2 DS1 Hub-and-Spoke DS1 IT LoB1 LoB2 LoB3 DS2 DS1 DS • Prioritize stakeholder proximity early if internal use cases • Tie to engineering if primary building model-driven external-facing products • Develop hub-and-spoke as you scale

Slide 57

Slide 57 text

© 2018 Domino Data Lab, Inc. 57 Hiring and Ramping Plan Template: Questions to Ask Attracting Talent • What’s your differentiated value proposition for candidate data scientists? List three things that make the opportunity unique, that you think will resonate with your target candidate pool. • What are 1-3 risks that might make the opportunity less appealing that competitive opportunities? How can you mitigate or get ahead of them? Hiring Process • What are the three most important attributes for your candidates? What is your assessment plan for each? Onboarding • What outcomes need to have been achieved in the first 30, 60, and 90 days? • What are the most important pieces of “tribal knowledge” your new hire needs to know, and how will she learn them? Examples include data sources, project methodologies, stakeholder dynamics, notable wins / losses, etc. Retention and Management • What skills do you hope this candidate develops over the first year? • What metrics will determine success of this candidate after a year? Examples include certain business metrics, community contributions, number of insights produced, or project iteration velocity.

Slide 58

Slide 58 text

58 Group Exercise #2: Build your hiring and ramp plan • Spend 15 minutes filling out template • Discuss in groups of five for 20 minutes

Slide 59

Slide 59 text

© 2018 Domino Data Lab, Inc. 59 Technology Agility & Iteration • Experimental agility • Tools / packages • Compute • Deployment agility • Expose work back to 
 the business quickly Collaboration • Shared context • Discussion • Knowledge Management (search & discovery) Reproducibility & 
 Reusability • Code • Data • Results • Environments

Slide 60

Slide 60 text

© 2018 Domino Data Lab, Inc. 60 Strategy: incentivize best practices “bottom up” Test Ideas Faster Deploy and Share Work Easily DATA SCIENTISTS: “I’M MORE PRODUCTIVE!” Powerful Collaboration Features Version Control & Reproducibility LEADERS: “CENTRALIZED WORK!”

Slide 61

Slide 61 text

© 2018 Domino Data Lab, Inc. 61 Decrease time to business impact: - Deploy models as APIs - Deploy apps (e.g., Shiny) & reports to non-technical stakeholders - Scheduled jobs for ETL, reporting, model retraining Entice data scientists with: - Vertical and horizontally scalable infrastructure - DevOps automation - Computational lab notebook to track results Centralizing work makes it possible to: - Find, reuse, reproduce, and discuss work. How we approached this

Slide 62

Slide 62 text

62 X-Factors

Slide 63

Slide 63 text

© 2018 Domino Data Lab, Inc. 63 • Problem emerges at later maturity • Track and guardrail model usage • Document risks and trade-offs made in flight, not post hoc • Pre-emptively establish validation, monitoring, and compliance controls Model liability

Slide 64

Slide 64 text

© 2018 Domino Data Lab, Inc. 64 • Educate executives on reality of probabilistic research • Anticipate demands of procurement (ROI of aggregate project portfolio) • Frame impacts of data science investment: • Out-compete peers • Increase operational efficiency • Reduce costs (headcount etc) • Reduce risk Navigating organizational politics

Slide 65

Slide 65 text

65 Summary

Slide 66

Slide 66 text

© 2018 Domino Data Lab, Inc. 66 • Data science success is not adding up individual successes, it’s an organizational capability • Alignment and partnership with the business is critical • Process – Enforce a pre-flight checklist • People – Develop hiring and onboarding plans • Technology - Leverage technology to increase productivity and best practice processes • X-Factors – Navigate politics and risk Summary

Slide 67

Slide 67 text

© 2018 Domino Data Lab, Inc. 67 • Ask us about Domino’s Data Science Lifecycle and Value Assessment offerings • Tailored analysis of existing processes, gaps, and tangible best practices • Leverage our ROI analysis templates across your portfolio Struggling with your own lifecycle?

Slide 68

Slide 68 text

© 2018 Domino Data Lab, Inc. 68 • Check out this content for more information • The Practical Guide to Managing Data Science at Scale • Data Science Management Survey Report • Stop by our booth #1403 Want to learn more? Questions?