Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Adding Jupyter Notebooks in your Data Analytics Toolbag

Adding Jupyter Notebooks in your Data Analytics Toolbag

Lots of people use only spreadsheets for their work and that is pretty much the only tool. This talk is about also adding Jupyter Notebooks to do repeatable and verifiable way of the data collected with spreadsheets.

Harish Pillay

February 24, 2020
Tweet

More Decks by Harish Pillay

Other Decks in Technology

Transcript

  1. Data Analytics for fun and profit Harish Pillay 24 February

    2020 [email protected], @harishpillay 1 https://tinyurl.com/sygs4bm
  2. 2

  3. 3 If the only tool you have is a hammer,

    everything looks like a nail.
  4. 4 Spreadsheets 1. Spreadsheets are the natural habitat of all

    finance-related people for the last 30 years 2. Lots of experience in using them (like pivot-tables for example) 3. Data size is limited to the desktop - even the SaaS providers place constraints 4. Scripting with spreadsheet macros - not via languages like Python
  5. 5 Issues with spreadsheets 1. Cascading Errors - one (https://www.teampay.co/insights/biggest-excel-mistakes-of-all-time/)

    mistake, and it snowballs. 2. Scalability - limited in size of sheets 3. Performance - limited to system that it is used on 4. Testing - almost impossible to test correctness of a sheets 5. Traceability/Debugging - tiny changes can impact formulae and make things very difficult 6. All Inclusive - The data and calculations are all contained within the spreadsheet file and run from a local computer. 7. Operational Risk - All spreadsheets start as small/quick-fix calculations but some turn into permanent enterprise-wide solutions by feeding a number of business processes and the integrity of many financial, operational and regulatory processes is threatened due to a lack of visibility of the entire lineage.
  6. 6 Do please use spreadsheets 1. Correctness and accuracy is

    not a priority 2. Data is not too big (i.e. no need for scalability) 3. No need for real-time updates 4. Using them as scratch pad to quickly put a prototype together 5. No need for long term maintenance.
  7. 9 Why? 1. To tap on modern techniques of data

    analysis via the browser 2. Polyglot language support via kernels (language engine)- Python, R, Julia and 138 others - https://github.com/jupyter/jupyter/wiki/Jupyter-kernels 3. Rapid innovation in software is driven by the tsunami of data being generated and stored. These need fundamentally different tools for analytics that have to lightweight, fast, repeatable, scalable and low cost. 4. A perfect storm of software packages - Scikit-learn, Scipy, Matplotlib, Pandas, Numpy, TensorFlow, PyTorch etc that have added significant value to the Jupyter ecosystem.
  8. 10 Very short history of Jupyter 2001: Fernando Perez started

    the iPython project (this is what happens when you get bored with your PhD work) as a “afternoon hack”. 2014: Fernando announced spin-off of iPython to be Jupyter. This spin-off along with the modern framework of the web and lots of other open source initiatives, Jupyter (after 5 iterations) became a tool for data analytics. 2018 onwards: Multiple providers of Jupyter notebooks: Jupyter Project, Kaggle, Colaboratory, mybinder.org, OpenShift (https://github.com/jupyter-on-openshift/jupyter-notebooks)
  9. 12 Go ahead and try this: 1. https://mybinder.org/v2/gh/jupyterlab/jupyterlab-demo/master?urlpath =lab/tree/demo 2.

    https://colab.research.google.com 3. https://kaggle.com Some tutorial Resources: 1. https://github.com/datacamp/datacamp-community-tutorials/ 2. https://datajournalism.com/ 3. Lots of courses on coursera and edx.