Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Adding Jupyter Notebooks in your Data Analytics Toolbag

Adding Jupyter Notebooks in your Data Analytics Toolbag

Lots of people use only spreadsheets for their work and that is pretty much the only tool. This talk is about also adding Jupyter Notebooks to do repeatable and verifiable way of the data collected with spreadsheets.

Harish Pillay

February 24, 2020
Tweet

More Decks by Harish Pillay

Other Decks in Technology

Transcript

  1. Data Analytics for fun and profit
    Harish Pillay
    24 February 2020
    [email protected], @harishpillay
    1
    https://tinyurl.com/sygs4bm

    View full-size slide

  2. 3
    If the only tool you have is a hammer, everything
    looks like a nail.

    View full-size slide

  3. 4
    Spreadsheets
    1. Spreadsheets are the natural habitat of all
    finance-related people for the last 30 years
    2. Lots of experience in using them (like pivot-tables for
    example)
    3. Data size is limited to the desktop - even the SaaS
    providers place constraints
    4. Scripting with spreadsheet macros - not via languages
    like Python

    View full-size slide

  4. 5
    Issues with spreadsheets
    1. Cascading Errors - one (https://www.teampay.co/insights/biggest-excel-mistakes-of-all-time/)
    mistake,
    and it snowballs.
    2. Scalability - limited in size of sheets
    3. Performance - limited to system that it is used on
    4. Testing - almost impossible to test correctness of a sheets
    5. Traceability/Debugging - tiny changes can impact formulae and make
    things very difficult
    6. All Inclusive - The data and calculations are all contained within the
    spreadsheet file and run from a local computer.
    7. Operational Risk - All spreadsheets start as small/quick-fix calculations
    but some turn into permanent enterprise-wide solutions by feeding a
    number of business processes and the integrity of many financial,
    operational and regulatory processes is threatened due to a lack of
    visibility of the entire lineage.

    View full-size slide

  5. 6
    Do please use spreadsheets
    1. Correctness and accuracy is not a priority
    2. Data is not too big (i.e. no need for scalability)
    3. No need for real-time updates
    4. Using them as scratch pad to quickly put a
    prototype together
    5. No need for long term maintenance.

    View full-size slide

  6. 7
    https://jupyter.org
    By Cameron Oelsen - https://github.com/jupyter/jupyter.github.io/blob/master/assets/main-logo.svg, BSD,
    https://commons.wikimedia.org/w/index.php?curid=68763478

    View full-size slide

  7. 8
    Why Jupyter?

    View full-size slide

  8. 9
    Why?
    1. To tap on modern techniques of data analysis via the browser
    2. Polyglot language support via kernels (language engine)- Python, R,
    Julia and 138 others - https://github.com/jupyter/jupyter/wiki/Jupyter-kernels
    3. Rapid innovation in software is driven by the tsunami of data being
    generated and stored. These need fundamentally different tools for
    analytics that have to lightweight, fast, repeatable, scalable and low
    cost.
    4. A perfect storm of software packages - Scikit-learn, Scipy,
    Matplotlib, Pandas, Numpy, TensorFlow, PyTorch etc that have added
    significant value to the Jupyter ecosystem.

    View full-size slide

  9. 10
    Very short history of Jupyter
    2001: Fernando Perez started the iPython project (this is what happens
    when you get bored with your PhD work) as a “afternoon hack”.
    2014: Fernando announced spin-off of iPython to be Jupyter. This
    spin-off along with the modern framework of the web and lots of other
    open source initiatives, Jupyter (after 5 iterations) became a tool for
    data analytics.
    2018 onwards: Multiple providers of Jupyter notebooks: Jupyter
    Project, Kaggle, Colaboratory, mybinder.org, OpenShift
    (https://github.com/jupyter-on-openshift/jupyter-notebooks)

    View full-size slide

  10. 12
    Go ahead and try this:
    1. https://mybinder.org/v2/gh/jupyterlab/jupyterlab-demo/master?urlpath
    =lab/tree/demo
    2. https://colab.research.google.com
    3. https://kaggle.com
    Some tutorial Resources:
    1. https://github.com/datacamp/datacamp-community-tutorials/
    2. https://datajournalism.com/
    3. Lots of courses on coursera and edx.

    View full-size slide

  11. 13
    Comments?
    Harish Pillay
    [email protected]
    @harishpillay
    https://tinyurl.com/sygs4bm

    View full-size slide