Slide 1

Slide 1 text

1 DATA SCIENCE: PAST, PRESENT & FUTURE. PyCon Colombia 2018

Slide 2

Slide 2 text

2 ABOUT ME. GRAD STUDENT Energy 2011 E.ON / RWTH AACHEN PROCESS ENGINEER 2012 Manufacturing PROCTER & GAMBLE 2014 2015 2013 2017 2016 2018 PyCON Colombia 2018 BUSINESS ANALYST / CONSULTANT Banking BLUECAP - LACAIXA DATA SCIENCE CONSULTING Professional Services ANACONDA DATNA, LLC DATA SCIENTIST & PRODUCT MANAGER Tech }

Slide 3

Slide 3 text

3 DATA SCIENCE. ACADEMIA & RESEARCH INDUSTRY & ENTERPRISE COMMUNITY & OPEN SOURCE ME

Slide 4

Slide 4 text

4 N E X T DATA SCIENCE DEFINITIONS

Slide 5

Slide 5 text

DATA SCIENCE WORKFLOW 03 02 01 04 COLLECT Gather, integrate and store data UNDERSTAND Explore, clean, transform, visualize DEPLOY Communicate and integrate into systems MODEL Build and validate models Data Analytics & Insights Data Modeling Data Engineering & Architecture

Slide 6

Slide 6 text

03 STEP Modeling Unsupervised learning Supervised learning no labels labels Exploring (clustering, dimensionality reduction) Predicting (Classification, Regression) Decision making Reinforcement learning Market segmentation Anomaly detection Summarizing information Spam detection Object/face recognition Recommender systems Robotics - Make Humanoid robot walk Games - Defeat Go champion Finance - Trading strategies reward K-means Hierarchical clustering PCA T-SNE Logistic Regression SVM Decision trees k-NN Linear Regression Neural Networks Q-learning Policy gradient REINFORCE Dyna Dynamic programming MCTS TASKS APPLICATIONS ALGORITHMS MACHINE LEARNING

Slide 7

Slide 7 text

03 STEP Modeling Unsupervised learning Supervised learning no labels labels Exploring (clustering, dimensionality reduction) Predicting (Classification, Regression) Decision making Reinforcement learning Market segmentation Anomaly detection Summarizing information Spam detection Object/face recognition Recommender systems Robotics - Make Humanoid robot walk Games - Defeat Go champion Finance - Trading strategies reward K-means Hierarchical clustering PCA T-SNE Logistic Regression SVM Decision trees k-NN Linear Regression Neural Networks Q-learning Policy gradient REINFORCE Dyna Dynamic programming MCTS TASKS APPLICATIONS ALGORITHMS MACHINE LEARNING DEEP LEARNING

Slide 8

Slide 8 text

8 DEFINITIONS. MACHINE LEARNING DATA SCIENCE DEEP LEARNING MACHINE LEARNING ~ AI DEEP LEARNING ~ AI REINFORCEMENT LEARNING ~AI

Slide 9

Slide 9 text

9 N E X T DATA SCIENCE: THE PAST

Slide 10

Slide 10 text

10 TERM: DATA SCIENCE. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century OCTOBER, 2012 “It was coined in 2008 by one of us, D.J. Patil, and Jeff Hammerbacher, then the respective leads of data and analytics efforts at LinkedIn and Facebook”.

Slide 11

Slide 11 text

11 WE HAD ALREADY BEEN USING DATA. STATISTICS OPERATIONS RESEARCH BUSINESS INTELLIGENCE DATA MINING ANALYTICS PROCESS ENGINEERING QUANTITATIVE RESEARCH OPTIMIZATION REPORTS DASHBOARDS CRAWLING OPEN DATA DATA WAREHOUSE INFORMATION RETRIEVAL

Slide 12

Slide 12 text

12 Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram THE DATA SCIENTIST. SKILLS DATA TOOLS

Slide 13

Slide 13 text

13 THE DATA SCIENTIST. https://brohrer.github.io/imposter_syndrome.html If you are asking questions and using data to find answers, YOU ARE A DATA SCIENTIST. Period. ~ Brandon Rohrer

Slide 14

Slide 14 text

ANALYTICS IN THE ENTERPRISE. SAS + ORACLE + MATLAB + EXCEL Transparency Innovation Reproducibility Deployment CHALLENGES

Slide 15

Slide 15 text

1994-2005: THE FOUNDATIONS. Python v1.0 1994 SciPy 1998 NumPy 2005 IPython 2001 matplotlib 2001

Slide 16

Slide 16 text

16 Pandas 2008 Scikit-learn 2007 Jupyter 2014 Conda 2012 2006-2014: THE GROWTH.

Slide 17

Slide 17 text

2014-2017: THE “ENTERPRISE” OPEN SOURCE.

Slide 18

Slide 18 text

18 Source: https://speakerdeck.com/jakevdp/pythons-data-science-stack-jsm-2016 BUILDING ON EACH OTHER’S WORK.

Slide 19

Slide 19 text

19 WE INCORPORATED DATA PRODUCTS IN OUR LIVES. WHAT WE BUY AMAZON RECOMMENDATIONS WHAT INFORMATION WE CONSUME GOOGLE SEARCH WHAT SHOWS WE WATCH NETFLIX HOW WE NAVIGATE GOOGLE MAPS HOW WE CONNECT FACEBOOK LINKEDIN In 2009, Netflix awarded the $1M Grand Prize

Slide 20

Slide 20 text

20 Every project started with a small first step.

Slide 21

Slide 21 text

21 N E X T DATA SCIENCE: THE PRESENT

Slide 22

Slide 22 text

22 GARTNER HYPE CYCLE. Deep Learning Machine Learning Autonomous Vehicles

Slide 23

Slide 23 text

23 A NEW PROFESSION & CAREER PATH. Data Engineer Data Engineering & Architecture Data Analytics & Insights Data Science Sr. Data Engineer Data Architect Data Analyst Sr. Data Analyst Analytics Manager Director of Analytics Director of Data Engineering Data Scientist Sr. Data Scientist Data Science Manager Director of Data Science

Slide 24

Slide 24 text

24 DATA SCIENCE MATURITY. ECOSYSTEM SKILLS OPEN SOURCE COMPUTE DATA + ALGORITHMS

Slide 25

Slide 25 text

ECOSYSTEM. A mature product and vendor ecosystem to serve the early majority Source: FirstMark Capital, Matt Turck, Jim Hao http://mattturck.com/big-data-landscape-2016-v18-final/ M&A - CONSOLIDATION: TURI, YHAT, SENSE.IO, KAGGLE IPOs: CLOUDERA, ALTERYX

Slide 26

Slide 26 text

26 SKILLS. MIT - http://introtodeeplearning.com/

Slide 27

Slide 27 text

27 SKILLS. http://www.mastersindatascience.org/

Slide 28

Slide 28 text

28 OPEN SOURCE. DEEP LEARNING IDEs DATA MUNGING DATA VIZ MACHINE LEARNING DATA WORKFLOWS BIG DATA NLP STATISTICS

Slide 29

Slide 29 text

29 COMPUTE. GCP AZURE Data Science Collaboration ML / DL APIs GOOGLE CLOUD VISION API COMPUTER VISION API Build, train, deploy API AWS

Slide 30

Slide 30 text

30 DATA + ALGORITHMS. AUDIO IMAGES WEBSITES COMPETITIONS DATA REPOSITORY http://www.image-net.org/ https://research.google.com/audioset/ http://commoncrawl.org/ https://www.kaggle.com/ https://data.world/

Slide 31

Slide 31 text

ALPHA GO "Mastering the game of Go without human knowledge". Nature. 19 October 2017. Retrieved 19 October 2017. Oct. 2015 - Beats human professional Go player (v. Fan) Mar. 2016 - Beats Lee Sedol (9-dan professional) in five-game match (v. Lee) May 2017 - Beats Ke Jie the world's top Go player (v. Master) October 2017 - AlphaGo Zero beats Alpha Go (v.Lee) (100-0) with an algorithm based solely on reinforcement learning, without human data.

Slide 32

Slide 32 text

32 DOG VS . . . Source: https://imgur.com/a/K4RWn Chihuahua or muffin? Sharpei or towel?

Slide 33

Slide 33 text

33 Source: https://imgur.com/a/K4RWn Labradoodle or fried chicken? Sheepdog or mop?

Slide 34

Slide 34 text

34 REAL APPLICATIONS. Source: http://observer.com/2017/05/artificial-intelligence-can-stop-elephant-rhino-poaching-in-africa/ Source: https://www.nature.com/articles/nature21056

Slide 35

Slide 35 text

35 REAL APPLICATIONS. Source: https://cloud.google.com/blog/big-data/2016/08/how-a-japanese-cucumber-farmer-is-using-deep-learning-and-tensorflow

Slide 36

Slide 36 text

36 DATA PRODUCTS - DEVICES. HOME ASSISTANTS WEARABLES

Slide 37

Slide 37 text

37 N E X T DATA SCIENCE: THE FUTURE

Slide 38

Slide 38 text

38 AUTONOMOUS VEHICLES.

Slide 39

Slide 39 text

39 Source: https://research.fb.com/facebook-open-sources-detectron/ OPEN SOURCE DETECTRON.

Slide 40

Slide 40 text

40 CHALLENGES. SECURITY OPEN SOURCE SUSTAINABILITY ETHICS INTERPRETABILITY

Slide 41

Slide 41 text

41 SECURITY. https://www.nytimes.com/2018/01/29/world/middleeast/strava-heat-map.html https://qz.com/1042852/using-a-fitness-app-taught-me-the-scary-truth-about-why-privacy-settings-are-a-feminist-issue/

Slide 42

Slide 42 text

42 SECURITY. HACKING AI. Source: https://steemit.com/security/@mrosenquist/researchers-hack-self-driving-cars-with-stickers-on-signs Source: https://www.theverge.com/2017/4/12/15271874/ai-adversarial-images-fooling-attacks-artificial-intelligence

Slide 43

Slide 43 text

43 ETHICS. DEEPFAKE.

Slide 44

Slide 44 text

44 https://www.nytimes.com/2017/10/26/opinion/algorithm-compas-sentencing-bias.html ETHICS. BIAS. COMPAS predicts black defendants will have higher risks of recidivism than they actually do, while white defendants are predicted to have lower rates than they actually do

Slide 45

Slide 45 text

45 [Machine-learned models] will learn what the data shows them, and then tell you what they’ve learned. They refuse to learn “the world as we wish it were”. The fact is that these biases do exist in our society, and they’re reflected in nearly any piece of data you look at. Source: https://medium.com/@yonatanzunger/asking-the-right-questions-about-ai-7ed2d9820c48 BIAS.

Slide 46

Slide 46 text

46 ETHICS INITIATIVE. Source: https://www.bloomberg.com/company/announcements/bloomberg-brighthive-data-democracy-launch-initiative-develop-data-science-code-ethics/ Source: https://medium.com/@dpatil/a-code-of-ethics-for-data-science-cda27d1fac1

Slide 47

Slide 47 text

47 Source: https://medium.com/@yonatanzunger/asking-the-right-questions-about-ai-7ed2d9820c48 INTERPRETABILITY. What people are good at, it turns out, isn’t explaining how they made decisions: it’s coming up with a reasonable-sounding explanation for their decision after the fact.

Slide 48

Slide 48 text

48 Source: https://github.com/marcotcr/lime INTERPRETABILITY. Source: https://arxiv.org/pdf/1311.2901.pdf Source: https://twitter.com/pmddomingos/status/956697536189800448

Slide 49

Slide 49 text

49 OPEN SOURCE SUSTAINABILITY. Source: https://www.numfocus.org/blog/matplotlib-lead-developer-explains-why-he-cant-fix-the-docs-but-you-can/ DEVELOPER BURN OUT RATIO DEVELOPER - USER ENTERPRISES PROFITING WITHOUT GIVING BACK USERS WITH HIGH EXPECTATIONS

Slide 50

Slide 50 text

50 THANK YOU! @ch_doig [email protected]