A Brief Introduction to Data Science, Machine Learning and the PyData Ecosystem

A Brief Introduction to Data Science, Machine Learning and the PyData Ecosystem

Presented at the PyData Cardiff meetup, Cardiff, April 2018.

https://www.meetup.com/PyData-Cardiff-Meetup/events/249298761/

---

Once confined to the corridors of academia, data science and machine learning are now having a massive impact on the world of business and beyond. There have been significant technological advances in just the last 10 years alone, from the explosion of open source software to the big data revolution, to yield what The Economist now calls the fourth industrial revolution.

In this talk we'll go beyond the hype and understand what data science really is, where it came from and where it's going. We'll demystify the field of machine learning, explain the terminology and see examples of how popular techniques are being used in practice. Finally, we'll see how you can get started by exploring the PyData ecosystem and provide concrete next steps towards a career in what the Harvard Business Review has dubbed "the sexiest job of the 21st century".

D97d7f6467d12ec08c5157dc9820a8c4?s=128

John Sandall

April 11, 2018
Tweet

Transcript

  1. Cardiff John Sandall 11th April 2018 Data Science & Engineering

    Consultant @john_sandall A Brief Introduction to Data Science, Machine Learning and the PyData Ecosystem
  2. I. What is Data Science? II. The Last 10 Years

    III. Machine Learning 101 IV. The PyData Ecosystem V. Tips For Success AGENDA
  3. BREAK INTO DATA SCIENCE PART I. WHAT IS DATA SCIENCE?

  4. WHAT IS DATA SCIENCE?

  5. WHAT IS DATA SCIENCE? WHAT IS DATA SCIENCE?

  6. WHAT IS DATA SCIENCE? WHAT IS DATA SCIENCE?

  7. THE QUALITIES OF A DATA SCIENTIST source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/ THE QUALITIES

    OF A DATA SCIENTIST
  8. WHAT IS DATA SCIENCE? WHAT IS DATA SCIENCE? A set

    of tools & techniques used to extract useful information from data. An interdisciplinary, problem-solving oriented subject. The application of scientific techniques to practical problems. A rapidly growing field.
  9. EARLY ADOPTERS OF DATA SCIENCE & MACHINE LEARNING EARLY ADOPTERS

    OF DATA SCIENCE & ENGINEERING
  10. BREAK INTO DATA SCIENCE PART II. THE LAST 10 YEARS

  11. 2007: A PIVOTAL YEAR 22 iPhone released Android launches 2007:

    A PIVOTAL YEAR
  12. 2007: FACEBOOK & TWITTER BOTH GO GLOBAL 23 2007: FACEBOOK

    & TWITTER BOTH GO GLOBAL
  13. 2007: INFORMATION REVOLUTION 24 Kindle launches IBM Watson created 2007:

    INFORMATION REVOLUTION
  14. 25 Kindle launches IBM Watson created IBM Watson wins Jeapardy

    TV gameshow in 2011 2007: INFORMATION REVOLUTION 2007: INFORMATION REVOLUTION
  15. 2007: OPEN SOURCE ECOSYSTEM ACCELERATES 28 2007: THE OPEN SOURCE

    ECOSYSTEM ACCELERATES
  16. 29 2015: PYTHON & R BECOME ENTERPRISE FRIENDLY 2007: R

    & PYTHON START TO BE ENTERPRISE FRIENDLY
  17. 2007: A PIVOTAL YEAR 30 2007: THE BIG DATA REVOLUTION

    BEGINS
  18. EVOLUTION OF DATA CREATION 33 2011: Every two days we

    create more information than we did up until 2003 (around two exabytes). 1 exabyte (EB) = 1000 petabytes (PB) = 1 billion gigabytes (GB) EVOLUTION OF DATA CREATION
  19. EVOLUTION OF DATA CREATION 34 2014: Oracle estimates total data

    created annually now surpasses five Zettabytes EVOLUTION OF DATA CREATION
  20. EVOLUTION OF DATA CREATION 35 2016: Data is growing at

    40 percent compound annual rate, now hitting over 10ZB annually EVOLUTION OF DATA CREATION
  21. EVOLUTION OF DATA CREATION 36 Forecasts suggest annual data creation

    will hit nearly 45ZB by 2020 EVOLUTION OF DATA CREATION
  22. WHERE IS DATA COMING FROM? WHERE IS DATA COMING FROM?

  23. DEFINING BIG DATA 38 COME TO DATA OBESITY! http://www.datasciencecentral.com/profiles/blogs/basic-understanding-of-big-data-what-is-this-and-how-it-is-going WELCOME

    TO DATA OBESITY!
  24. BIG DATA: A CAUTIONARY TALE 39 BIG DATA: A CAUTIONARY

    TALE
  25. BIG DATA: A CAUTIONARY TALE 40 BIG DATA: A CAUTIONARY

    TALE
  26. 1. Search engines 2. Recommendation systems 3. Image recognition 4.

    Speech recognition 5. Gaming 6. Price comparison/optimisation 7. Route planning (driving, airlines, social network virality!) 8. Fraud / risk detection 9. Logistics (deliveries of goods, of people, of data) 10. Self-driving cars 11. Robots & AI assistants 12. ... OTHER APPLICATIONS OF DATA SCIENCE & MACHINE LEARNING 43 OTHER APPLICATIONS OF DATA SCIENCE & MACHINE LEARNING
  27. BREAK INTO DATA SCIENCE PART III. MACHINE LEARNING

  28. 45 YOU ARE HERE!

  29. 48 WHAT IS MACHINE LEARNING? From Wikipedia: "Machine learning, a

    branch of artificial intelligence, is about the construction and study of systems that can learn from data." "The core of machine learning deals with representation and generalisation..." representation – extracting structure from data generalisation – making predictions from data
  30. 50 TYPES OF MACHINE LEARNING PROBLEM supervised unsupervised making predictions

    extracting structure representation generalisation
  31. TYPES OF LEARNING PROBLEMS 26 supervised making predictions 52 TYPES

    OF MACHINE LEARNING PROBLEM supervised unsupervised making predictions extracting structure
  32. 53 TYPES OF MACHINE LEARNING PROBLEM supervised unsupervised making predictions

    extracting structure TYPES OF LEARNING PROBLEMS 27 unsupervised extracting structure
  33. 54 TYPES OF DATA continuous categorical quantitative qualitative e.g. height

    e.g. eye colour
  34. 55 TYPES OF ML PROBLEMS continuous categorical supervised regression classification

    dimensional reduction clustering unsupervised
  35. BIG DATA 56 REGRESSION EXAMPLE: PREDICTING PHONE SALES 32 GDP

    population Gini phone penetration % GDP growth rate REGRESSION EXAMPLE: PREDICTING IPHONE SALES
  36. BIG DATA 59 CLASSIFICATION EXAMPLE: SPAM FILTERING 34 Bargain $$$

    100% free Act now! All natural As seen on Satisfaction guaranteed !!! CLASSIFICATION EXAMPLE: SPAM FILTERING
  37. CLUSTERING EXAMPLE: USER LOCATIONS longitude CLUSTERING EXAMPLE: USER LOCATIONS

  38. CLUSTERING EXAMPLE: USER LOCATIONS longitude latitude town CLUSTERING EXAMPLE: USER

    LOCATIONS
  39. BIG DATA 67 DIMENSIONAL REDUCTION EXAMPLE: A STOCK INDEX 36

    DIMENSIONAL REDUCTION EXAMPLE: A STOCK INDEX
  40. BIG DATA 68 DIMENSIONAL REDUCTION EXAMPLE: A STOCK INDEX 37

    DIMENSIONAL REDUCTION EXAMPLE: A STOCK INDEX
  41. BREAK INTO DATA SCIENCE PART IV. THE PYDATA ECOSYSTEM

  42. ...GET STARTED TONIGHT! WHY PYTHON?

  43. ...GET STARTED TONIGHT! POWERED BY PYTHON

  44. ...GET STARTED TONIGHT! POWERED BY PYTHON

  45. ...GET STARTED TONIGHT! START AT PYDATA.ORG

  46. ...GET STARTED TONIGHT! UPCOMING EVENTS

  47. ...GET STARTED TONIGHT! MEETUPS

  48. ...GET STARTED TONIGHT! DOWNLOADS & SPONSORED PROJECTS

  49. ...GET STARTED TONIGHT! PACKAGES TO START WITH pandas: manipulate data

    SciPy/NumPy: scientific computing and numerical calculations Scikit-learn: machine learning matplotlib/Seaborn: data visualisation spacy/nltk: natural language processing statsmodels: statistical tests Beautiful Soup: HTML/XML data & web scrapers Jupyter: interactive programming environment
  50. ...GET STARTED TONIGHT! MY MOST USED PACKAGES pandas: manipulate data

    SciPy/NumPy: scientific computing and numerical calculations Scikit-learn: machine learning matplotlib/Seaborn: data visualisation spacy/nltk: natural language processing statsmodels: statistical tests Beautiful Soup: HTML/XML data & web scrapers Jupyter: interactive programming environment
  51. ...GET STARTED TONIGHT! JUPYTER NOTEBOOK Jupyter Notebook is a web

    interface that let’s us use formatting along side our code.
  52. BREAK INTO DATA SCIENCE PART V. TIPS FOR SUCCESS

  53. BIG DATA 82 - In April 2012 McKinsey predicted 1.5

    million shortage of data scientists - More and more companies are looking for people to unlock the value in their data - Rise in available positions NEW JOB 48 NEW JOB
  54. BIG DATA 83 - Many companies struggle to recruit in

    this area - Traditional analysts too focused on specific tools - Many programmers don’t have business experience - Because the field is new there are few people with leadership skills SHORTAGE OF SKILLS 53 SHORTAGE OF SKILLS
  55. BOOKS 84 BOOKS

  56. MY TOP 3 BOOK RECOMMENDATIONS 85 MY TOP 3 BOOK

    RECOMMENDATIONS
  57. ONLINE COURSES 86 Machine Learning Andrew Ng (Stanford) Machine Learning

    CalTech CS156 www.dataquest.io Writing code, work with data, build projects in your browser. swirlstats.com "Learn R, in R" www.datacamp.com "Learn data analysis from the comfort of your browser" (R, Python, DataViz) ONLINE COURSES
  58. PODCASTS 87 ‣ Data Skeptic (Kyle Polich, I ❤ the

    mini-explainer episodes!) ‣ Partially Derivative (light hearted) ‣ Linear Digressions (Udacity) ‣ More or Less (Tim Harford & BBC Radio 4) ‣ O’Reilly Data Show (Ben Lorica, technical with more focus on data engineering) ‣ Planet Money (NPR, economics/data/finance – A/B testing, multiple comparisons) ‣ What's The Point (FiveThirtyEight, how data is changing our lives) ‣ Science Vs (Gimlet Media, new last summer, controversial issues + rigour) PODCASTS
  59. LONDON MEETUPS 88 LONDON MEETUPS ‣ PyData London ‣ LondonR

    ‣ Data Science Meetup London ‣ Big Data London ‣ London Machine Learning Meetup ‣ Quantified Self ‣ Predictive Analytics London Meetup ‣ Data Visualization Meetup ‣ PyLadies London ‣ Women in Data ‣ Londata ‣ Data Science Journal Club

  60. LONDON MEETUPS 89 BRISTOL MEETUPS LONDON MEETUPS BRISTOL MEETUPS! ‣

    PyData Bristol ‣ Bristol Data Scientists ‣ Big Data Bristol ‣ South West Data Meetup ‣ Bath Machine Learning Metope ‣ Bristol Digital Analytics Meetup ‣ SQL Bristol ‣ Cardiff R User Group ‣ Bristech ‣ South West Futurists ‣ CodeHub Bristol ‣ Bath: Hacked ‣ PyData London ‣ LondonR ‣ Data Science Meetup London ‣ Big Data London ‣ London Machine Learning Meetup ‣ Quantified Self ‣ Predictive Analytics London Meetup ‣ Data Visualization Meetup ‣ PyLadies London ‣ Women in Data ‣ Londata ‣ Data Science Journal Club

  61. HACKATHONS AND DATADIVES 91 ‣ DataKind ‣ NHS Hack ‣

    Kaggle ‣ UK Hackathons & James Meetup ‣ StartupWeekend ‣ Code for Good ‣ Bath: Hacked "We liberate data, and make useful things" HACKATHONS & DATADIVES
  62. BREAK INTO DATA SCIENCE FINAL THOUGHTS

  63. “BECOME A DATA SCIENTIST WITH THESE 4 WEIRD TIPS” 93

    FOUR STEPS TO SUCCESS 1. Learn to code Python. R. Professional software engineering practices. 2. Get statistical Significance. Inference. Regression. Machine learning. 3. Learn lean Business skills. Startup methodology. Communication. 4. Experience Side projects. Github. Kaggle. Hackathons. Stand out.
  64. IF YOU DO NOTHING ELSE... IF YOU DO NOTHING ELSE...

  65. ...GET STARTED TONIGHT! ‣ Data Skeptic Podcast dataskeptic.com IF YOU

    DO NOTHING ELSE......GET STARTED TONIGHT!
  66. BIG DATA 96 - Data science is a product of

    our time - Being a data scientists requires people and technical skills - We’re only getting started… CONCLUSION 55 FINAL THOUGHTS
  67. BREAK INTO DATA SCIENCE THANK YOU @PyDataCardiff @john_sandall

  68. @john_sandall BREAK INTO DATA SCIENCE QUESTIONS? @PyDataCardiff