PyCon Colombia 2018

1 DATA SCIENCE: PAST, PRESENT & FUTURE. PyCon Colombia 2018

2 ABOUT ME. GRAD STUDENT Energy 2011 E.ON / RWTH
AACHEN PROCESS ENGINEER 2012 Manufacturing PROCTER & GAMBLE 2014 2015 2013 2017 2016 2018 PyCON Colombia 2018 BUSINESS ANALYST / CONSULTANT Banking BLUECAP - LACAIXA DATA SCIENCE CONSULTING Professional Services ANACONDA DATNA, LLC DATA SCIENTIST & PRODUCT MANAGER Tech }

3 DATA SCIENCE. ACADEMIA & RESEARCH INDUSTRY & ENTERPRISE COMMUNITY
& OPEN SOURCE ME

4 N E X T DATA SCIENCE DEFINITIONS

DATA SCIENCE WORKFLOW 03 02 01 04 COLLECT Gather, integrate
and store data UNDERSTAND Explore, clean, transform, visualize DEPLOY Communicate and integrate into systems MODEL Build and validate models Data Analytics & Insights Data Modeling Data Engineering & Architecture

03 STEP Modeling Unsupervised learning Supervised learning no labels labels
Exploring (clustering, dimensionality reduction) Predicting (Classiﬁcation, Regression) Decision making Reinforcement learning Market segmentation Anomaly detection Summarizing information Spam detection Object/face recognition Recommender systems Robotics - Make Humanoid robot walk Games - Defeat Go champion Finance - Trading strategies reward K-means Hierarchical clustering PCA T-SNE Logistic Regression SVM Decision trees k-NN Linear Regression Neural Networks Q-learning Policy gradient REINFORCE Dyna Dynamic programming MCTS TASKS APPLICATIONS ALGORITHMS MACHINE LEARNING

03 STEP Modeling Unsupervised learning Supervised learning no labels labels
Exploring (clustering, dimensionality reduction) Predicting (Classiﬁcation, Regression) Decision making Reinforcement learning Market segmentation Anomaly detection Summarizing information Spam detection Object/face recognition Recommender systems Robotics - Make Humanoid robot walk Games - Defeat Go champion Finance - Trading strategies reward K-means Hierarchical clustering PCA T-SNE Logistic Regression SVM Decision trees k-NN Linear Regression Neural Networks Q-learning Policy gradient REINFORCE Dyna Dynamic programming MCTS TASKS APPLICATIONS ALGORITHMS MACHINE LEARNING DEEP LEARNING

8 DEFINITIONS. MACHINE LEARNING DATA SCIENCE DEEP LEARNING MACHINE LEARNING
~ AI DEEP LEARNING ~ AI REINFORCEMENT LEARNING ~AI

9 N E X T DATA SCIENCE: THE PAST

10 TERM: DATA SCIENCE. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century OCTOBER, 2012 “It was coined
in 2008 by one of us, D.J. Patil, and Jeff Hammerbacher, then the respective leads of data and analytics efforts at LinkedIn and Facebook”.

11 WE HAD ALREADY BEEN USING DATA. STATISTICS OPERATIONS RESEARCH
BUSINESS INTELLIGENCE DATA MINING ANALYTICS PROCESS ENGINEERING QUANTITATIVE RESEARCH OPTIMIZATION REPORTS DASHBOARDS CRAWLING OPEN DATA DATA WAREHOUSE INFORMATION RETRIEVAL

12 Source: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram THE DATA SCIENTIST. SKILLS DATA TOOLS

13 THE DATA SCIENTIST. https://brohrer.github.io/imposter_syndrome.html If you are asking questions
and using data to find answers, YOU ARE A DATA SCIENTIST. Period. ~ Brandon Rohrer

ANALYTICS IN THE ENTERPRISE. SAS + ORACLE + MATLAB +
EXCEL Transparency Innovation Reproducibility Deployment CHALLENGES

1994-2005: THE FOUNDATIONS. Python v1.0 1994 SciPy 1998 NumPy 2005
IPython 2001 matplotlib 2001

16 Pandas 2008 Scikit-learn 2007 Jupyter 2014 Conda 2012 2006-2014:
THE GROWTH.

2014-2017: THE “ENTERPRISE” OPEN SOURCE.

18 Source: https://speakerdeck.com/jakevdp/pythons-data-science-stack-jsm-2016 BUILDING ON EACH OTHER’S WORK.

19 WE INCORPORATED DATA PRODUCTS IN OUR LIVES. WHAT WE
BUY AMAZON RECOMMENDATIONS WHAT INFORMATION WE CONSUME GOOGLE SEARCH WHAT SHOWS WE WATCH NETFLIX HOW WE NAVIGATE GOOGLE MAPS HOW WE CONNECT FACEBOOK LINKEDIN In 2009, Netflix awarded the $1M Grand Prize

20 Every project started with a small first step.

21 N E X T DATA SCIENCE: THE PRESENT

22 GARTNER HYPE CYCLE. Deep Learning Machine Learning Autonomous Vehicles

23 A NEW PROFESSION & CAREER PATH. Data Engineer Data
Engineering & Architecture Data Analytics & Insights Data Science Sr. Data Engineer Data Architect Data Analyst Sr. Data Analyst Analytics Manager Director of Analytics Director of Data Engineering Data Scientist Sr. Data Scientist Data Science Manager Director of Data Science

24 DATA SCIENCE MATURITY. ECOSYSTEM SKILLS OPEN SOURCE COMPUTE DATA
+ ALGORITHMS

ECOSYSTEM. A mature product and vendor ecosystem to serve the
early majority Source: FirstMark Capital, Matt Turck, Jim Hao http://mattturck.com/big-data-landscape-2016-v18-final/ M&A - CONSOLIDATION: TURI, YHAT, SENSE.IO, KAGGLE IPOs: CLOUDERA, ALTERYX

26 SKILLS. MIT - http://introtodeeplearning.com/

27 SKILLS. http://www.mastersindatascience.org/

28 OPEN SOURCE. DEEP LEARNING IDEs DATA MUNGING DATA VIZ
MACHINE LEARNING DATA WORKFLOWS BIG DATA NLP STATISTICS

29 COMPUTE. GCP AZURE Data Science Collaboration ML / DL
APIs GOOGLE CLOUD VISION API COMPUTER VISION API Build, train, deploy API AWS

30 DATA + ALGORITHMS. AUDIO IMAGES WEBSITES COMPETITIONS DATA REPOSITORY
http://www.image-net.org/ https://research.google.com/audioset/ http://commoncrawl.org/ https://www.kaggle.com/ https://data.world/

ALPHA GO "Mastering the game of Go without human knowledge".
Nature. 19 October 2017. Retrieved 19 October 2017. Oct. 2015 - Beats human professional Go player (v. Fan) Mar. 2016 - Beats Lee Sedol (9-dan professional) in ﬁve-game match (v. Lee) May 2017 - Beats Ke Jie the world's top Go player (v. Master) October 2017 - AlphaGo Zero beats Alpha Go (v.Lee) (100-0) with an algorithm based solely on reinforcement learning, without human data.

32 DOG VS . . . Source: https://imgur.com/a/K4RWn Chihuahua or
muffin? Sharpei or towel?

33 Source: https://imgur.com/a/K4RWn Labradoodle or fried chicken? Sheepdog or mop?

34 REAL APPLICATIONS. Source: http://observer.com/2017/05/artificial-intelligence-can-stop-elephant-rhino-poaching-in-africa/ Source: https://www.nature.com/articles/nature21056

35 REAL APPLICATIONS. Source: https://cloud.google.com/blog/big-data/2016/08/how-a-japanese-cucumber-farmer-is-using-deep-learning-and-tensorflow

36 DATA PRODUCTS - DEVICES. HOME ASSISTANTS WEARABLES

37 N E X T DATA SCIENCE: THE FUTURE

38 AUTONOMOUS VEHICLES.

39 Source: https://research.fb.com/facebook-open-sources-detectron/ OPEN SOURCE DETECTRON.

40 CHALLENGES. SECURITY OPEN SOURCE SUSTAINABILITY ETHICS INTERPRETABILITY

41 SECURITY. https://www.nytimes.com/2018/01/29/world/middleeast/strava-heat-map.html https://qz.com/1042852/using-a-fitness-app-taught-me-the-scary-truth-about-why-privacy-settings-are-a-feminist-issue/

42 SECURITY. HACKING AI. Source: https://steemit.com/security/@mrosenquist/researchers-hack-self-driving-cars-with-stickers-on-signs Source: https://www.theverge.com/2017/4/12/15271874/ai-adversarial-images-fooling-attacks-artificial-intelligence

43 ETHICS. DEEPFAKE.

44 https://www.nytimes.com/2017/10/26/opinion/algorithm-compas-sentencing-bias.html ETHICS. BIAS. COMPAS predicts black defendants will have
higher risks of recidivism than they actually do, while white defendants are predicted to have lower rates than they actually do

45 [Machine-learned models] will learn what the data shows them,
and then tell you what they’ve learned. They refuse to learn “the world as we wish it were”. The fact is that these biases do exist in our society, and they’re reflected in nearly any piece of data you look at. Source: https://medium.com/@yonatanzunger/asking-the-right-questions-about-ai-7ed2d9820c48 BIAS.

46 ETHICS INITIATIVE. Source: https://www.bloomberg.com/company/announcements/bloomberg-brighthive-data-democracy-launch-initiative-develop-data-science-code-ethics/ Source: https://medium.com/@dpatil/a-code-of-ethics-for-data-science-cda27d1fac1

47 Source: https://medium.com/@yonatanzunger/asking-the-right-questions-about-ai-7ed2d9820c48 INTERPRETABILITY. What people are good at, it
turns out, isn’t explaining how they made decisions: it’s coming up with a reasonable-sounding explanation for their decision after the fact.

48 Source: https://github.com/marcotcr/lime INTERPRETABILITY. Source: https://arxiv.org/pdf/1311.2901.pdf Source: https://twitter.com/pmddomingos/status/956697536189800448

49 OPEN SOURCE SUSTAINABILITY. Source: https://www.numfocus.org/blog/matplotlib-lead-developer-explains-why-he-cant-fix-the-docs-but-you-can/ DEVELOPER BURN OUT RATIO
DEVELOPER - USER ENTERPRISES PROFITING WITHOUT GIVING BACK USERS WITH HIGH EXPECTATIONS

50 THANK YOU! @ch_doig [email protected]

PyCon Colombia 2018

PyCon Colombia 2018

More Decks by Christine Doig

Other Decks in Technology

Featured

Transcript