Upgrade to Pro — share decks privately, control downloads, hide ads and more …

A Brief Introduction to Data Science, Machine Learning and the PyData Ecosystem

A Brief Introduction to Data Science, Machine Learning and the PyData Ecosystem

Presented at the PyData Cardiff meetup, Cardiff, April 2018.

https://www.meetup.com/PyData-Cardiff-Meetup/events/249298761/

---

Once confined to the corridors of academia, data science and machine learning are now having a massive impact on the world of business and beyond. There have been significant technological advances in just the last 10 years alone, from the explosion of open source software to the big data revolution, to yield what The Economist now calls the fourth industrial revolution.

In this talk we'll go beyond the hype and understand what data science really is, where it came from and where it's going. We'll demystify the field of machine learning, explain the terminology and see examples of how popular techniques are being used in practice. Finally, we'll see how you can get started by exploring the PyData ecosystem and provide concrete next steps towards a career in what the Harvard Business Review has dubbed "the sexiest job of the 21st century".

John Sandall

April 11, 2018
Tweet

More Decks by John Sandall

Other Decks in Technology

Transcript

  1. Cardiff
    John Sandall
    11th April 2018
    Data Science & Engineering Consultant
    @john_sandall
    A Brief Introduction to
    Data Science, Machine Learning
    and the PyData Ecosystem

    View Slide

  2. I. What is Data Science?
    II. The Last 10 Years
    III. Machine Learning 101
    IV. The PyData Ecosystem
    V. Tips For Success
    AGENDA

    View Slide

  3. BREAK INTO DATA SCIENCE
    PART I.
    WHAT IS DATA SCIENCE?

    View Slide

  4. WHAT IS DATA SCIENCE?

    View Slide

  5. WHAT IS DATA SCIENCE?
    WHAT IS DATA SCIENCE?

    View Slide

  6. WHAT IS DATA SCIENCE?
    WHAT IS DATA SCIENCE?

    View Slide

  7. THE QUALITIES OF A DATA SCIENTIST
    source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/
    THE QUALITIES OF A DATA SCIENTIST

    View Slide

  8. WHAT IS DATA SCIENCE?
    WHAT IS DATA SCIENCE?
    A set of tools & techniques used to extract useful information
    from data.
    An interdisciplinary, problem-solving oriented subject.
    The application of scientific techniques to practical problems.
    A rapidly growing field.

    View Slide

  9. EARLY ADOPTERS OF DATA SCIENCE & MACHINE LEARNING
    EARLY ADOPTERS OF DATA SCIENCE & ENGINEERING

    View Slide

  10. BREAK INTO DATA SCIENCE
    PART II.
    THE LAST 10 YEARS

    View Slide

  11. 2007: A PIVOTAL YEAR 22
    iPhone released Android launches
    2007: A PIVOTAL YEAR

    View Slide

  12. 2007: FACEBOOK & TWITTER BOTH GO GLOBAL 23
    2007: FACEBOOK & TWITTER BOTH GO GLOBAL

    View Slide

  13. 2007: INFORMATION REVOLUTION 24
    Kindle launches IBM Watson created
    2007: INFORMATION REVOLUTION

    View Slide

  14. 25
    Kindle launches IBM Watson created
    IBM Watson wins Jeapardy TV gameshow in 2011
    2007: INFORMATION REVOLUTION
    2007: INFORMATION REVOLUTION

    View Slide

  15. 2007: OPEN SOURCE ECOSYSTEM ACCELERATES 28
    2007: THE OPEN SOURCE ECOSYSTEM ACCELERATES

    View Slide

  16. 29
    2015: PYTHON & R BECOME ENTERPRISE FRIENDLY
    2007: R & PYTHON START TO BE ENTERPRISE FRIENDLY

    View Slide

  17. 2007: A PIVOTAL YEAR 30
    2007: THE BIG DATA REVOLUTION BEGINS

    View Slide

  18. EVOLUTION OF DATA CREATION 33
    2011:
    Every two days we
    create more information
    than we did up until
    2003 (around two
    exabytes).
    1 exabyte (EB) = 1000 petabytes (PB) = 1 billion gigabytes (GB)
    EVOLUTION OF DATA CREATION

    View Slide

  19. EVOLUTION OF DATA CREATION 34
    2014:
    Oracle estimates total
    data created annually
    now surpasses five
    Zettabytes
    EVOLUTION OF DATA CREATION

    View Slide

  20. EVOLUTION OF DATA CREATION 35
    2016:
    Data is growing at 40
    percent compound
    annual rate, now
    hitting over 10ZB
    annually
    EVOLUTION OF DATA CREATION

    View Slide

  21. EVOLUTION OF DATA CREATION 36
    Forecasts suggest annual data creation will hit nearly 45ZB by 2020
    EVOLUTION OF DATA CREATION

    View Slide

  22. WHERE IS DATA COMING FROM?
    WHERE IS DATA COMING FROM?

    View Slide

  23. DEFINING BIG DATA 38
    COME TO DATA OBESITY!
    http://www.datasciencecentral.com/profiles/blogs/basic-understanding-of-big-data-what-is-this-and-how-it-is-going
    WELCOME TO DATA OBESITY!

    View Slide

  24. BIG DATA: A CAUTIONARY TALE 39
    BIG DATA: A CAUTIONARY TALE

    View Slide

  25. BIG DATA: A CAUTIONARY TALE 40
    BIG DATA: A CAUTIONARY TALE

    View Slide

  26. 1. Search engines
    2. Recommendation systems
    3. Image recognition
    4. Speech recognition
    5. Gaming
    6. Price comparison/optimisation
    7. Route planning (driving, airlines, social network virality!)
    8. Fraud / risk detection
    9. Logistics (deliveries of goods, of people, of data)
    10. Self-driving cars
    11. Robots & AI assistants
    12. ...
    OTHER APPLICATIONS OF DATA SCIENCE & MACHINE LEARNING 43
    OTHER APPLICATIONS OF DATA SCIENCE & MACHINE LEARNING

    View Slide

  27. BREAK INTO DATA SCIENCE
    PART III.
    MACHINE LEARNING

    View Slide

  28. 45
    YOU ARE HERE!

    View Slide

  29. 48
    WHAT IS MACHINE LEARNING?
    From Wikipedia:
    "Machine learning, a branch of artificial intelligence, is
    about the construction and study of systems that can
    learn from data."
    "The core of machine learning deals with representation
    and generalisation..."
    representation – extracting structure from data
    generalisation – making predictions from data

    View Slide

  30. 50
    TYPES OF MACHINE LEARNING PROBLEM
    supervised
    unsupervised
    making predictions
    extracting structure
    representation
    generalisation

    View Slide

  31. TYPES OF LEARNING PROBLEMS 26
    supervised making predictions
    52
    TYPES OF MACHINE LEARNING PROBLEM
    supervised
    unsupervised
    making predictions
    extracting structure

    View Slide

  32. 53
    TYPES OF MACHINE LEARNING PROBLEM
    supervised
    unsupervised
    making predictions
    extracting structure
    TYPES OF LEARNING PROBLEMS 27
    unsupervised extracting structure

    View Slide

  33. 54
    TYPES OF DATA
    continuous categorical
    quantitative qualitative
    e.g. height e.g. eye colour

    View Slide

  34. 55
    TYPES OF ML PROBLEMS
    continuous categorical
    supervised regression classification
    dimensional
    reduction
    clustering
    unsupervised

    View Slide

  35. BIG DATA 56
    REGRESSION EXAMPLE: PREDICTING PHONE SALES 32
    GDP
    population
    Gini
    phone penetration %
    GDP growth rate
    REGRESSION EXAMPLE: PREDICTING IPHONE SALES

    View Slide

  36. BIG DATA 59
    CLASSIFICATION EXAMPLE: SPAM FILTERING 34
    Bargain
    $$$
    100% free
    Act now!
    All natural
    As seen on Satisfaction guaranteed
    !!!
    CLASSIFICATION EXAMPLE: SPAM FILTERING

    View Slide

  37. CLUSTERING EXAMPLE: USER LOCATIONS
    longitude
    CLUSTERING EXAMPLE: USER LOCATIONS

    View Slide

  38. CLUSTERING EXAMPLE: USER LOCATIONS
    longitude
    latitude
    town
    CLUSTERING EXAMPLE: USER LOCATIONS

    View Slide

  39. BIG DATA 67
    DIMENSIONAL REDUCTION EXAMPLE: A STOCK INDEX 36
    DIMENSIONAL REDUCTION EXAMPLE: A STOCK INDEX

    View Slide

  40. BIG DATA 68
    DIMENSIONAL REDUCTION EXAMPLE: A STOCK INDEX 37
    DIMENSIONAL REDUCTION EXAMPLE: A STOCK INDEX

    View Slide

  41. BREAK INTO DATA SCIENCE
    PART IV.
    THE PYDATA ECOSYSTEM

    View Slide

  42. ...GET STARTED TONIGHT!
    WHY PYTHON?

    View Slide

  43. ...GET STARTED TONIGHT!
    POWERED BY PYTHON

    View Slide

  44. ...GET STARTED TONIGHT!
    POWERED BY PYTHON

    View Slide

  45. ...GET STARTED TONIGHT!
    START AT PYDATA.ORG

    View Slide

  46. ...GET STARTED TONIGHT!
    UPCOMING EVENTS

    View Slide

  47. ...GET STARTED TONIGHT!
    MEETUPS

    View Slide

  48. ...GET STARTED TONIGHT!
    DOWNLOADS & SPONSORED PROJECTS

    View Slide

  49. ...GET STARTED TONIGHT!
    PACKAGES TO START WITH
    pandas: manipulate data
    SciPy/NumPy: scientific computing and numerical calculations
    Scikit-learn: machine learning
    matplotlib/Seaborn: data visualisation
    spacy/nltk: natural language processing
    statsmodels: statistical tests
    Beautiful Soup: HTML/XML data & web scrapers
    Jupyter: interactive programming environment

    View Slide

  50. ...GET STARTED TONIGHT!
    MY MOST USED PACKAGES
    pandas: manipulate data
    SciPy/NumPy: scientific computing and numerical calculations
    Scikit-learn: machine learning
    matplotlib/Seaborn: data visualisation
    spacy/nltk: natural language processing
    statsmodels: statistical tests
    Beautiful Soup: HTML/XML data & web scrapers
    Jupyter: interactive programming environment

    View Slide

  51. ...GET STARTED TONIGHT!
    JUPYTER NOTEBOOK
    Jupyter Notebook is a web interface that let’s us use formatting
    along side our code.

    View Slide

  52. BREAK INTO DATA SCIENCE
    PART V.
    TIPS FOR SUCCESS

    View Slide

  53. BIG DATA 82
    - In April 2012 McKinsey predicted 1.5 million shortage of data
    scientists
    - More and more companies are looking for people to unlock
    the value in their data
    - Rise in available positions
    NEW JOB 48
    NEW JOB

    View Slide

  54. BIG DATA 83
    - Many companies struggle to recruit in this area
    - Traditional analysts too focused on specific tools
    - Many programmers don’t have business experience
    - Because the field is new there are few people with leadership
    skills
    SHORTAGE OF SKILLS 53
    SHORTAGE OF SKILLS

    View Slide

  55. BOOKS 84
    BOOKS

    View Slide

  56. MY TOP 3 BOOK RECOMMENDATIONS 85
    MY TOP 3 BOOK RECOMMENDATIONS

    View Slide

  57. ONLINE COURSES 86
    Machine Learning
    Andrew Ng (Stanford)
    Machine Learning
    CalTech CS156
    www.dataquest.io
    Writing code, work with data,
    build projects in your browser.
    swirlstats.com
    "Learn R, in R"
    www.datacamp.com
    "Learn data analysis from the
    comfort of your browser"
    (R, Python, DataViz)
    ONLINE COURSES

    View Slide

  58. PODCASTS 87
    ‣ Data Skeptic (Kyle Polich, I ❤ the mini-explainer episodes!)
    ‣ Partially Derivative (light hearted)
    ‣ Linear Digressions (Udacity)
    ‣ More or Less (Tim Harford & BBC Radio 4)
    ‣ O’Reilly Data Show (Ben Lorica, technical with more focus on data engineering)
    ‣ Planet Money (NPR, economics/data/finance – A/B testing, multiple comparisons)
    ‣ What's The Point (FiveThirtyEight, how data is changing our lives)
    ‣ Science Vs (Gimlet Media, new last summer, controversial issues + rigour)
    PODCASTS

    View Slide

  59. LONDON MEETUPS 88
    LONDON MEETUPS
    ‣ PyData London
    ‣ LondonR
    ‣ Data Science Meetup London
    ‣ Big Data London
    ‣ London Machine Learning Meetup
    ‣ Quantified Self
    ‣ Predictive Analytics London Meetup
    ‣ Data Visualization Meetup
    ‣ PyLadies London
    ‣ Women in Data
    ‣ Londata
    ‣ Data Science Journal Club


    View Slide

  60. LONDON MEETUPS 89
    BRISTOL MEETUPS
    LONDON MEETUPS BRISTOL MEETUPS!
    ‣ PyData Bristol
    ‣ Bristol Data Scientists
    ‣ Big Data Bristol
    ‣ South West Data Meetup
    ‣ Bath Machine Learning Metope
    ‣ Bristol Digital Analytics Meetup
    ‣ SQL Bristol
    ‣ Cardiff R User Group
    ‣ Bristech
    ‣ South West Futurists
    ‣ CodeHub Bristol
    ‣ Bath: Hacked
    ‣ PyData London
    ‣ LondonR
    ‣ Data Science Meetup London
    ‣ Big Data London
    ‣ London Machine Learning Meetup
    ‣ Quantified Self
    ‣ Predictive Analytics London Meetup
    ‣ Data Visualization Meetup
    ‣ PyLadies London
    ‣ Women in Data
    ‣ Londata
    ‣ Data Science Journal Club


    View Slide

  61. HACKATHONS AND DATADIVES 91
    ‣ DataKind
    ‣ NHS Hack
    ‣ Kaggle
    ‣ UK Hackathons & James Meetup
    ‣ StartupWeekend
    ‣ Code for Good
    ‣ Bath: Hacked
    "We liberate data, and make useful things"
    HACKATHONS & DATADIVES

    View Slide

  62. BREAK INTO DATA SCIENCE
    FINAL THOUGHTS

    View Slide

  63. “BECOME A DATA SCIENTIST WITH THESE 4 WEIRD TIPS” 93
    FOUR STEPS TO SUCCESS
    1. Learn to code
    Python. R. Professional software engineering practices.
    2. Get statistical
    Significance. Inference. Regression. Machine learning.
    3. Learn lean
    Business skills. Startup methodology. Communication.
    4. Experience
    Side projects. Github. Kaggle. Hackathons. Stand out.

    View Slide

  64. IF YOU DO NOTHING ELSE...
    IF YOU DO NOTHING ELSE...

    View Slide

  65. ...GET STARTED TONIGHT!
    ‣ Data Skeptic Podcast
    dataskeptic.com
    IF YOU DO NOTHING ELSE......GET STARTED TONIGHT!

    View Slide

  66. BIG DATA 96
    - Data science is a product of our time
    - Being a data scientists requires people and technical skills
    - We’re only getting started…
    CONCLUSION 55
    FINAL THOUGHTS

    View Slide

  67. BREAK INTO DATA SCIENCE
    THANK YOU
    @PyDataCardiff @john_sandall

    View Slide

  68. @john_sandall
    BREAK INTO DATA SCIENCE
    QUESTIONS?
    @PyDataCardiff

    View Slide