Simplifying Data Analysis & Visualization with Developer Tools & AI

Simplifying Data Analysis & Visualization with Developer Tools & AI
Nitya Narasimhan, PhD Senior AI Advocate, Microsoft @nitya | #in/nityan

Data Science Day 2024 | Nitya Narasimhan, PhD 011 Setup
a consistent and reusable dev environment using GitHub Codespaces | Exercise Instantiate the Codespaces- Jupyter template & launch it 021 Explore Jupyter notebooks for data science & machine learning examples | Exercise Validate ability to run Jupyter notebooks without added effort 031 Add GitHub Copilot extension. Explore use to create notebooks and explain examples | Exercise Create notebooks, learn Python data structures & visualization 041 Use a Visual Studio Code Data Science profile & extensions in your devcontainer | Exercise Complete the VS Code datasci tutorial, explore Data Wrangler 051 Explore open datasets (curated & shared by the ML community) to start exploration | Exercise Load & explore dataset from Hugging Face, Kaggle, Azure 061 Understand principles of responsible AI and use toolbox to train & debug your model | Exercise Explore text or tabular data & model from Hugging Face 071 Explore LLM-based data visualization with Microsoft LIDA for intuition, suggestions | Exercise Use natural language to get goals, visualizations & refine 081 Make the paradigm shift from ML Ops to LLM Ops (predictive to generative AI apps) | Exercise Explore the Azure AI Studio (UI & SDK) capabilities 091 Customize & extend the template to suit your learning needs and share feedback! | Exercise Pick a different open dataset and try these steps yourself 101 Related resources for self-guided learners to continue their journey. Thank you! Q&A | Exercise See #14DaysOf DataScience posts on Developer Tools What We’ll Cover Today – Fork The Repo To Follow Along! https://aka.ms/workshops/python-data-analysis

Data Science Day 2024 | Nitya Narasimhan, PhD Data analysis
– drives the ML models – that power AI algorithms Image Credit | Microsoft Learn must-have skill for a data scientist good-to-have skill for an AI developer trends show a shift left in the application lifecycle giving developers more responsibility in earlier stages of workflow its just fun to explore data and gain insights The Motivation – Why should I learn Data Analysis?

Data Science Day 2024 | Nitya Narasimhan, PhD KNOWLEDGE GAP
I know what I don’t know but I can plan my journey I don’t know what I don’t know so how do I even start? Use Developer Tools Use AI Assistance INTUITION GAP The Challenge – What stops me from learning it?

Data Science Day 2024 | Nitya Narasimhan, PhD I lack
data science & Python expertise I can’t do this! I want to learn how to do <data analysis & visualization> I have dev tools & AI assistance Where do I start? The Mindset – How can I cultivate my curiosity to learn? Tired Wired

Data Science Day 2024 | Nitya Narasimhan, PhD Help me
get setup and productive quickly .. The Approach – how can I practice goal-oriented learning? Make it FRICTIONLESS Make progress towards goal without distractions Keep it FOCUSED Make it reproducible by others for collaboration Make it FRIENDLY ”hey - it works on my machine..” ”let me google this – should be quick” ”I don’t remember – let me explain it”

Data Science Day 2024 | Nitya Narasimhan, PhD Hands-on Workshop
Let’s dive in

a consistent and reusable dev environment using GitHub Codespaces | Exercise Instantiate the Codespaces- Jupyter template & launch it 021 Explore Jupyter notebooks for data science & machine learning examples | Exercise Validate ability to run Jupyter notebooks without added effort 031 Add GitHub Copilot extension. Explore use to create notebooks and explain examples | Exercise Create notebooks, learn Python data structures & visualization 041 Use a Visual Studio Code Data Science profile & extensions in your devcontainer | Exercise Complete the VS Code datasci tutorial, explore Data Wrangler 051 Explore open datasets (curated & shared by the ML community) to start exploration | Exercise Load & explore dataset from Hugging Face, Kaggle, Azure 061 Understand principles of responsible AI and use toolbox to train & debug your model | Exercise Explore text or tabular data & model from Hugging Face 071 Explore LLM-based data visualization with Microsoft LIDA for intuition, suggestions | Exercise Use natural language to get goals, visualizations & refine 081 Make the paradigm shift from ML Ops to LLM Ops (predictive to generative AI apps) | Exercise Explore the Azure AI Studio (UI & SDK) capabilities 091 Customize & extend the template to suit your learning needs and share feedback! | Exercise Pick a different open dataset and try these steps yourself 101 Related resources for self-guided learners to continue their journey. Thank you! Q&A | Exercise See #14DaysOf DataScience posts on Developer Tools https://aka.ms/workshops/python-data-analysis Make It Frictionless – I need a consistent dev environment with easy setup Challenge 01

Data Science Day 2024 | Nitya Narasimhan, PhD A |
Start Development with a suitable “Codespaces” Template This repo extends the codespaces-jupyter template from GitHub Exercise 01 See: https://aka.ms/workshops/python-data-analysis You’ll find updated exercises in the the `data-science-day- 2024` branch Uncheck this before you fork repo, to get all branches copied

Data Science Day 2024 | Nitya Narasimhan, PhD B |
Fork the template & launch it with GitHub Codespaces (cloud) Exercise 01 See: https://aka.ms/workshops/python-data-analysis You can now launch a GitHub Codespaces instance directly from branch in the browser

Clone the template, and launch it with Docker Desktop (local) Exercise 01 See: https://aka.ms/workshops/python-data-analysis Or you can clone it to your local device and open it in Visual Studio Code .. If you have the Dev Container extension installed and a Docker Desktop running, you should see this ..

Data Science Day 2024 | Nitya Narasimhan, PhD C |
Get a pre-built dev environment that works the same for everyone Exercise 01 See: https://aka.ms/workshops/python-data-analysis Either way, you have a Visual Studio Code environment that is setup with a pre- built dev container with all dependencies installed for you – with no added effort.

Data Science Day 2024 | Nitya Narasimhan, PhD D |
Learn How Dev Containers Work (Cloud vs. Local) Exercise 01 See: https://aka.ms/workshops/python-data-analysis The container runs a Visual Studio Code server that you can connect to from a Visual Studio Code client (browser or local) – with the repository being visible to both. Configuration as code - a devcontainer.json file defines the environment, is version controlled like any other file.

a consistent and reusable dev environment using GitHub Codespaces | Exercise Instantiate the Codespaces- Jupyter template & launch it 021 Explore Jupyter notebooks for data science & machine learning examples | Exercise Validate ability to run Jupyter notebooks without added effort 031 Add GitHub Copilot extension. Explore use to create notebooks and explain examples | Exercise Create notebooks, learn Python data structures & visualization 041 Use a Visual Studio Code Data Science profile & extensions in your devcontainer | Exercise Complete the VS Code datasci tutorial, explore Data Wrangler 051 Explore open datasets (curated & shared by the ML community) to start exploration | Exercise Load & explore dataset from Hugging Face, Kaggle, Azure 061 Understand principles of responsible AI and use toolbox to train & debug your model | Exercise Explore text or tabular data & model from Hugging Face 071 Explore LLM-based data visualization with Microsoft LIDA for intuition, suggestions | Exercise Use natural language to get goals, visualizations & refine 081 Make the paradigm shift from ML Ops to LLM Ops (predictive to generative AI apps) | Exercise Explore the Azure AI Studio (UI & SDK) capabilities 091 Customize & extend the template to suit your learning needs and share feedback! | Exercise Pick a different open dataset and try these steps yourself 101 Related resources for self-guided learners to continue their journey. Thank you! Q&A | Exercise See #14DaysOf DataScience posts on Developer Tools https://aka.ms/workshops/python-data-analysis Make It Friendly – I need a reproducible environment for easy collaboration Challenge 02

“matplotlib” is the most popular library for 2D data visualizations Exercise 02 See: https://aka.ms/workshops/python-data-analysis Open the matplotlib example notebook & select a kernel using existing Python env.

Learn to run, edit, extend – Jupyter Notebooks in this environment Exercise 02 See: https://aka.ms/workshops/python-data-analysis Run All – executes all code cells in selected Python 3.10.13 env No added effort in setup. Ability to add “markdown” cells for code to document it Modify code or data to explore ideas in an interactive way Share notebook with collaborators for contributions, debug

Run the “pandas+matplotlib” example and intuit how it works Exercise 02 See: https://aka.ms/workshops/python-data-analysis Open the population example notebook and run it as before Note how pandas works by creating a df (data frame) from structured data (CSV) Then uses matplotlib to create a plot using data from 2 separate columns of that table

Now copy the code cell, change data – and see if intuition holds up Exercise 02 See: https://aka.ms/workshops/python-data-analysis The good news is that I can modify code and experiment inline to learn by doing. The bad news is that I don’t understand what the code does – or if there are better options I could use. Let’s try a cricket data set (IPL 2022) shared on Kaggle by a fan

a consistent and reusable dev environment using GitHub Codespaces | Exercise Instantiate the Codespaces- Jupyter template & launch it 021 Explore Jupyter notebooks for data science & machine learning examples | Exercise Validate ability to run Jupyter notebooks without added effort 031 Add GitHub Copilot extension. Explore use to create notebooks and explain examples | Exercise Create notebooks, learn Python data structures & visualization 041 Use a Visual Studio Code Data Science profile & extensions in your devcontainer | Exercise Complete the VS Code datasci tutorial, explore Data Wrangler 051 Explore open datasets (curated & shared by the ML community) to start exploration | Exercise Load & explore dataset from Hugging Face, Kaggle, Azure 061 Understand principles of responsible AI and use toolbox to train & debug your model | Exercise Explore text or tabular data & model from Hugging Face 071 Explore LLM-based data visualization with Microsoft LIDA for intuition, suggestions | Exercise Use natural language to get goals, visualizations & refine 081 Make the paradigm shift from ML Ops to LLM Ops (predictive to generative AI apps) | Exercise Explore the Azure AI Studio (UI & SDK) capabilities 091 Customize & extend the template to suit your learning needs and share feedback! | Exercise Pick a different open dataset and try these steps yourself 101 Related resources for self-guided learners to continue their journey. Thank you! Q&A | Exercise See #14DaysOf DataScience posts on Developer Tools https://aka.ms/workshops/python-data-analysis Keep it Focused – I want to progress in my goal without getting distracted Challenge 03

Install the GitHub Copilot Extension – activate inline AI assistance Exercise 03 See: https://aka.ms/workshops/python-data-analysis Add the extension to devcontainer.json if you want it installed by default in that env Github Copilot is a paid offering with a free trial to explore it.

Use GitHub Copilot in Chat mode – create notebooks to learn Exercise 03 See: https://aka.ms/workshops/python-data-analysis Using inline mode sets context to that specific file context Using chat mode sets context to workspace with richer options Use chat mode to create a notebook to learn pandas usage

Use GitHub Copilot inline – ask for explainers or fix errors in context Exercise 03 See: https://aka.ms/workshops/python-data-analysis The copilot-created notebook has errors but is a good starter for learning phases. Ask copilot chat to explain bug. See how it references the file. Ask copilot inline to fix the bug. See how it gives you a choice.

Data Science Day 2024 | Nitya Narasimhan, PhD E |
Explore GitHub Copilot suggestions – fill knowledge & intuition gaps Exercise 03 See: https://aka.ms/workshops/python-data-analysis Use suggested next prompts to fill in knowledge gaps – without losing focus Stop if knowledge gap is filled. Pivot from asking to doing Instead of “googling” and falling into a rabbit hole of search results, use Copilot as a contextual question-answer system that keeps you inside the development environment – and ties responses to code references Build intuition by trying suggestions and learning to make connections between code and outcomes (success or failure)

Data Science Day 2024 | Nitya Narasimhan, PhD F |
Explore GitHub Copilot for goal-oriented task – prompt engineering Exercise 03 See: https://aka.ms/workshops/python-data-analysis Open a new code cell and write a prompt to get your task done Refine the prompt interactively to move closer to desired goal Use suggestions to refactor code, goals. Add markdown to recall insights later

Use Visual Studio Code Profiles – Customize editor for productivity Exercise 04 See: https://aka.ms/workshops/python-data-analysis Start with the Data Science Profile to get popular extensions.

Activate Data Wrangler Extension – view & edit for data cleaning Exercise 04 See: https://aka.ms/workshops/python-data-analysis

Editing in Data Wrangler Extension – operations auto-generate code Exercise 04 See: https://aka.ms/workshops/python-data-analysis

a consistent and reusable dev environment using GitHub Codespaces | Exercise Instantiate the Codespaces- Jupyter template & launch it 021 Explore Jupyter notebooks for data science & machine learning examples | Exercise Validate ability to run Jupyter notebooks without added effort 031 Add GitHub Copilot extension. Explore use to create notebooks and explain examples | Exercise Create notebooks, learn Python data structures & visualization 041 Use a Visual Studio Code Data Science profile & extensions in your devcontainer | Exercise Complete the VS Code datasci tutorial, explore Data Wrangler 051 Explore open datasets (curated & shared by the ML community) to start exploration | Exercise Load & explore dataset from Hugging Face, Kaggle, Azure 061 Understand principles of responsible AI and use toolbox to train & debug your model | Exercise Explore text or tabular data & model from Hugging Face 071 Explore LLM-based data visualization with Microsoft LIDA for intuition, suggestions | Exercise Use natural language to get goals, visualizations & refine 081 Make the paradigm shift from ML Ops to LLM Ops (predictive to generative AI apps) | Exercise Explore the Azure AI Studio (UI & SDK) capabilities 091 Customize & extend the template to suit your learning needs and share feedback! | Exercise Pick a different open dataset and try these steps yourself 101 Related resources for self-guided learners to continue their journey. Thank you! Q&A | Exercise See #14DaysOf DataScience posts on Developer Tools https://aka.ms/workshops/python-data-analysis Make It Frictionless – I have the tools & environment. How about the data? Challenge 05

Explore Kaggle Datasets – use community notebooks for inspiration Exercise 05 See: https://aka.ms/workshops/python-data-analysis IPL 2022 Dataset Open Data Commons License downloaded Oct 2023 Example EDA on Kaggle. Find a dataset in a domain of interest (ideas for insights) Find EDA examples from the community to get intuition on how to explore data Find ML model examples based on that dataset, to learn new libraries (sklearn) https://www.kaggle.com/code/coolboyraghu/ipl-score-prediction

Explore Hugging Face – curated datasets for Deep Learning models Exercise 05 See: https://huggingface.co/docs/datasets/index Find datasets for new tasks and explore new libraries and tutorials

Explore Azure Open Datasets – curated datasets for many domains Exercise 05 See: https://learn.microsoft.com/azure/open-datasets/ Expand your understanding from community curated dataset to big data mindset

a consistent and reusable dev environment using GitHub Codespaces | Exercise Instantiate the Codespaces- Jupyter template & launch it 021 Explore Jupyter notebooks for data science & machine learning examples | Exercise Validate ability to run Jupyter notebooks without added effort 031 Add GitHub Copilot extension. Explore use to create notebooks and explain examples | Exercise Create notebooks, learn Python data structures & visualization 041 Use a Visual Studio Code Data Science profile & extensions in your devcontainer | Exercise Complete the VS Code datasci tutorial, explore Data Wrangler 051 Explore open datasets (curated & shared by the ML community) to start exploration | Exercise Load & explore dataset from Hugging Face, Kaggle, Azure 061 Understand principles of responsible AI and use toolbox to train & debug your model | Exercise Explore text or tabular data & model from Hugging Face 071 Explore LLM-based data visualization with Microsoft LIDA for intuition, suggestions | Exercise Use natural language to get goals, visualizations & refine 081 Make the paradigm shift from ML Ops to LLM Ops (predictive to generative AI apps) | Exercise Explore the Azure AI Studio (UI & SDK) capabilities 091 Customize & extend the template to suit your learning needs and share feedback! | Exercise Pick a different open dataset and try these steps yourself 101 Related resources for self-guided learners to continue their journey. Thank you! Q&A | Exercise See #14DaysOf DataScience posts on Developer Tools https://aka.ms/workshops/python-data-analysis Make It Friendly – Debug models and decision-making for responsible AI Challenge 06

Explore Responsible AI Toolkit – one notebook example at a time! Exercise 06 See: https://responsibleaitoolbox.ai/ Model Debugging Decision-Making Let's imagine that the diabetes progression scores predicted by the model are used to determine medical insurance rates. If the score is greater than 120, there is a higher rate. Patient 43's model score of 268.08 results in this increased rate, and they want to know how they should change their health to get a lower rate prediction from the model (leading to lower insurance price). The What-If counterfactuals component shows how slightly different feature values affect model predictions. This can be used to solve Patient 43's problem. https://github.com/microsoft/responsible-ai-toolbox/tree/main/notebooks/responsibleaidashboard/tabular

a consistent and reusable dev environment using GitHub Codespaces | Exercise Instantiate the Codespaces- Jupyter template & launch it 021 Explore Jupyter notebooks for data science & machine learning examples | Exercise Validate ability to run Jupyter notebooks without added effort 031 Add GitHub Copilot extension. Explore use to create notebooks and explain examples | Exercise Create notebooks, learn Python data structures & visualization 041 Use a Visual Studio Code Data Science profile & extensions in your devcontainer | Exercise Complete the VS Code datasci tutorial, explore Data Wrangler 051 Explore open datasets (curated & shared by the ML community) to start exploration | Exercise Load & explore dataset from Hugging Face, Kaggle, Azure 061 Understand principles of responsible AI and use toolbox to train & debug your model | Exercise Explore text or tabular data & model from Hugging Face 071 Explore LLM-based data visualization with Microsoft LIDA for intuition, suggestions | Exercise Use natural language to get goals, visualizations & refine 081 Make the paradigm shift from ML Ops to LLM Ops (predictive to generative AI apps) | Exercise Explore the Azure AI Studio (UI & SDK) capabilities 091 Customize & extend the template to suit your learning needs and share feedback! | Exercise Pick a different open dataset and try these steps yourself 101 Related resources for self-guided learners to continue their journey. Thank you! Q&A | Exercise See #14DaysOf DataScience posts on Developer Tools https://aka.ms/workshops/python-data-analysis Keep it focused– I need to build my intuition but I don’t know where to start Challenge 07

Data Science Day 2024 | Nitya Narasimhan, PhD LIDA is
a library for generating data visualizations and data-faithful infographics. LIDA is grammar agnostic (will work with any programming language and visualization libraries e.g. matplotlib, seaborn, altair, d3 etc) and works with multiple large language model providers (OpenAI, Azure OpenAI, PaLM, Cohere, Huggingface). Research Paper: https://arxiv.org/abs/2303.02927 Explore Project LIDA – Visualize Data using LLM & Natural Language Prompts Exercise 07 See: https://aka.ms/lida/org

Add Your Open AI Key – Codespaces Secret vs. Local Env Variable Exercise 07

Ask LIDA to generate a summary of the dataset Exercise 07 Using older screenshots (vs. live demo) given issues in OpenAI today

Generate Goals for me from the data – build intuition on what, how Exercise 07

Generate goals for me – but make them customized to my persona Exercise 07

Data Science Day 2024 | Nitya Narasimhan, PhD E |
Show me different ways to visualize the data – for the same goal Exercise 07

Data Science Day 2024 | Nitya Narasimhan, PhD GitHub Copilot
can do this too .. But you can vary the parameters and customize base prompt here programmatically It’s open source so you can do more if needed F | Ask questions of the data in natural language – and get visualizations Exercise 07

Data Science Day 2024 | Nitya Narasimhan, PhD Provides flexibility
for trial-and-error experiments to build intuition. G | Prompt Engineering works in user queries Exercise 07

Data Science Day 2024 | Nitya Narasimhan, PhD lida/components/viz/vizexplainer.py system_prompt
= """ You are a helpful assistant highly skilled in providing helpful, structured explanations of visualization of the plot(data: pd.DataFrame) method in the provided code. You divide the code into sections and provide a description of each section and an explanation. The first section should be named "accessibility" and describe the physical appearance of the chart (colors, chart type etc), the goal of the chart, as well the main insights from the chart. You can explain code across the following 3 dimensions: 1. accessibility: the physical appearance of the chart (colors, chart type etc), the goal of the chart, as well the main insights from the chart. 2. transformation: This should describe the section of the code that applies any kind of data transformation (filtering, aggregation, grouping, null value handling etc) 3. visualization: step by step description of the code that creates or modifies the presented visualization. """ H | Get Explanations For Decisions – Understand why this visualization Exercise 07

Data Science Day 2024 | Nitya Narasimhan, PhD I |
Get Recommendations for Visualizations relevant to your dataset Exercise 07

a consistent and reusable dev environment using GitHub Codespaces | Exercise Instantiate the Codespaces- Jupyter template & launch it 021 Explore Jupyter notebooks for data science & machine learning examples | Exercise Validate ability to run Jupyter notebooks without added effort 031 Add GitHub Copilot extension. Explore use to create notebooks and explain examples | Exercise Create notebooks, learn Python data structures & visualization 041 Use a Visual Studio Code Data Science profile & extensions in your devcontainer | Exercise Complete the VS Code datasci tutorial, explore Data Wrangler 051 Explore open datasets (curated & shared by the ML community) to start exploration | Exercise Load & explore dataset from Hugging Face, Kaggle, Azure 061 Understand principles of responsible AI and use toolbox to train & debug your model | Exercise Explore text or tabular data & model from Hugging Face 071 Explore LLM-based data visualization with Microsoft LIDA for intuition, suggestions | Exercise Use natural language to get goals, visualizations & refine 081 Make the paradigm shift from ML Ops to LLM Ops (predictive to generative AI apps) | Exercise Explore the Azure AI Studio (UI & SDK) capabilities 091 Customize & extend the template to suit your learning needs and share feedback! | Exercise Pick a different open dataset and try these steps yourself 101 Related resources for self-guided learners to continue their journey. Thank you! Q&A | Exercise See #14DaysOf DataScience posts on Developer Tools https://aka.ms/workshops/python-data-analysis Make the Paradigm Shift – From MLOps to LLM Ops and Generative AI Challenge 09

Data Science Day 2024 | Nitya Narasimhan, PhD Data analysis
– drives the ML models – that power AI algorithms Image Credit | Microsoft Learn must-have skill for a data scientist good-to-have skill for an AI developer trends show a shift left in the application lifecycle giving developers more responsibility in earlier stages of workflow its just fun to explore data and gain insights Closing the Loop ..

Data Science Day 2024 | Nitya Narasimhan, PhD ML Ops
- App lifecycle for developing Predictive AI Image Credit | Microsoft Learn

Data Science Day 2024 | Nitya Narasimhan, PhD LLM Ops
- App lifecycle for developing Generative AI Image Credit | Microsoft Learn

Data Science Day 2024 | Nitya Narasimhan, PhD Azure AI
Week : https://aka.ms/ai-studio/intelligent-apps Image Credit | Microsoft Learn Azure AI Week : https://aka.ms/ai-studio/intelligent-apps

Data Science Day 2024 | Nitya Narasimhan, PhD Help me
get setup and productive quickly .. The Approach – how can I practice goal-oriented learning? Make it FRICTIONLESS Make progress towards goal without distractions Keep it FOCUSED Make it reproducible by others for collaboration Make it FRIENDLY Dev Containers GitHub Codespaces GitHub Copilot Microsoft LIDA Jupyter Notebooks VS Code Profiles

Data Science Day 2024 | Nitya Narasimhan, PhD Data Science
Collection 1. Data Science Foundations 2. Cloud Skills Challenge 3. Resoponsible AI 4. Data Science Curriculum 5. Data Science Handbook 6. Hugging Face Datasets 7. Kaggle Online Courses Will be updated regularly Bookmark Collection: https://aka.ms/2024-datasci-collection

Simplifying Data Analysis & Visualization with ...

Simplifying Data Analysis & Visualization with Developer Tools & AI

More Decks by Nitya Narasimhan, PhD

Other Decks in Technology

Featured

Transcript