Slide 1

Slide 1 text

Data Analytics for fun and profit Harish Pillay 24 February 2020 [email protected], @harishpillay 1 https://tinyurl.com/sygs4bm

Slide 2

Slide 2 text

2

Slide 3

Slide 3 text

3 If the only tool you have is a hammer, everything looks like a nail.

Slide 4

Slide 4 text

4 Spreadsheets 1. Spreadsheets are the natural habitat of all finance-related people for the last 30 years 2. Lots of experience in using them (like pivot-tables for example) 3. Data size is limited to the desktop - even the SaaS providers place constraints 4. Scripting with spreadsheet macros - not via languages like Python

Slide 5

Slide 5 text

5 Issues with spreadsheets 1. Cascading Errors - one (https://www.teampay.co/insights/biggest-excel-mistakes-of-all-time/) mistake, and it snowballs. 2. Scalability - limited in size of sheets 3. Performance - limited to system that it is used on 4. Testing - almost impossible to test correctness of a sheets 5. Traceability/Debugging - tiny changes can impact formulae and make things very difficult 6. All Inclusive - The data and calculations are all contained within the spreadsheet file and run from a local computer. 7. Operational Risk - All spreadsheets start as small/quick-fix calculations but some turn into permanent enterprise-wide solutions by feeding a number of business processes and the integrity of many financial, operational and regulatory processes is threatened due to a lack of visibility of the entire lineage.

Slide 6

Slide 6 text

6 Do please use spreadsheets 1. Correctness and accuracy is not a priority 2. Data is not too big (i.e. no need for scalability) 3. No need for real-time updates 4. Using them as scratch pad to quickly put a prototype together 5. No need for long term maintenance.

Slide 7

Slide 7 text

7 https://jupyter.org By Cameron Oelsen - https://github.com/jupyter/jupyter.github.io/blob/master/assets/main-logo.svg, BSD, https://commons.wikimedia.org/w/index.php?curid=68763478

Slide 8

Slide 8 text

8 Why Jupyter?

Slide 9

Slide 9 text

9 Why? 1. To tap on modern techniques of data analysis via the browser 2. Polyglot language support via kernels (language engine)- Python, R, Julia and 138 others - https://github.com/jupyter/jupyter/wiki/Jupyter-kernels 3. Rapid innovation in software is driven by the tsunami of data being generated and stored. These need fundamentally different tools for analytics that have to lightweight, fast, repeatable, scalable and low cost. 4. A perfect storm of software packages - Scikit-learn, Scipy, Matplotlib, Pandas, Numpy, TensorFlow, PyTorch etc that have added significant value to the Jupyter ecosystem.

Slide 10

Slide 10 text

10 Very short history of Jupyter 2001: Fernando Perez started the iPython project (this is what happens when you get bored with your PhD work) as a “afternoon hack”. 2014: Fernando announced spin-off of iPython to be Jupyter. This spin-off along with the modern framework of the web and lots of other open source initiatives, Jupyter (after 5 iterations) became a tool for data analytics. 2018 onwards: Multiple providers of Jupyter notebooks: Jupyter Project, Kaggle, Colaboratory, mybinder.org, OpenShift (https://github.com/jupyter-on-openshift/jupyter-notebooks)

Slide 11

Slide 11 text

11 Demo time

Slide 12

Slide 12 text

12 Go ahead and try this: 1. https://mybinder.org/v2/gh/jupyterlab/jupyterlab-demo/master?urlpath =lab/tree/demo 2. https://colab.research.google.com 3. https://kaggle.com Some tutorial Resources: 1. https://github.com/datacamp/datacamp-community-tutorials/ 2. https://datajournalism.com/ 3. Lots of courses on coursera and edx.

Slide 13

Slide 13 text

13 Comments? Harish Pillay [email protected] @harishpillay https://tinyurl.com/sygs4bm