63

# Introduction to Jupyter Notebooks & Data Analysis using Kaggle

Mon, Feb 18 @ Dogpatch Labs

Title:
Introduction to Jupyter Notebooks & Data Analysis using Kaggle by Leticia Portella

Short Description:
We will give an overview on what are Jupyter Notebooks and how they work, how to read and user a dataset based on a spreadsheet and how can we explore the data using Pandas and Matplotlib. This is an introductory workshop for people interested in data analysis.

Setup before workshop
- An account to Kaggle (https://www.kaggle.com/) already verified. We will use the Titanic dataset (https://www.kaggle.com/sureshbhusare/titanic-dataset-from-kaggle) for exploratory analysis.

Leticia is an oceanographer that fell in love with programming. It is one of the hosts of the first brazilian podcast on Data Science, Pizza de Dados and is currently working with Project Jupyter at the Outreachy program. leportella.com

February 19, 2019

## Transcript

3. ### Kaggle is a place where you can ﬁnd a lot

of datasets, it already have installed most of tools you’ll need for a basic analysis, is a good place to see the people’s code and built a portfolio Why Kaggle?

5. None

7. None
8. None
9. ### Notebooks are a place where you can create code, show

graphs, document your methodologies and ﬁndings… all in a single place

cell ran

statement)
14. ### Jupyter Shortcuts Ctrl + Enter = Run cell ESC +

B = New cell below ESC + dd = Delete cell

16. ### Reading a document If you check the ﬁrst cell, it

will tell you that the documents are ready for you in ../input/. So, we can read the ﬁles by with a Pandas function and with the path of the ﬁle df = pd.read_csv(‘../input/train.csv')
17. ### Dataframes Dataframes are similiar to what you ﬁnd in Excel

structures. You have rows indicated by numbers and columns with names. You can check the ﬁrst 5 rows of a data frame to see the basic structure: df.head()

21. ### Dataframes You can check the structure of a dataframe, to

get an idea of how many rows and columns it has: df.shape
22. ### Dataframes You can check the main statistical characteristics of the

numerical columns of a data frame df.describe()

24. ### Series You can select a single column of the data

frame to work with. A column of a Dataframe is called Series and have some special properties df['Age']
25. ### Series You can also check the statistical characteristics of a

Series df['Age'].describe()
26. ### Series You can ﬁlter a series to see which rows

have adults. This will return a Series of True and False. df[‘Age'] > 10
27. ### Series And Series have functions that help you quickly plot

some of it. We can, for instance, check the histogram of Ages. df[‘Age’].plot.hist()

29. ### Series And Series have functions that help you quickly plot

some of it. We can, for instance, check the histogram of Ages. df[‘Age’].plot.hist()
30. ### Series We can count how many passengers were on each

class df[‘Pclass’].value_counts()
31. ### Series Since the result of a value_counts is also a

Series, we can store this value in a variable and use it to plot a pie chart :) passengers_per_class = df[‘Pclass’].value_counts() passengers_per_class.plot.pie()
32. ### Exercise Plot a bar plot with the number of people

that survived and didn’t survive (Column Survived)
33. ### Series Remember we could ﬁlter a series? We could use

it to checkout our variables. Let’s see which class survived the most survived = df[‘Survived'] > 0 ﬁltered_df = df[survived] passenger_per_class = ﬁltered_df[“Pclass”].value_counts() passenger_per_class.plot.pie()

classes:
35. ### Function Now if we pass an age to the function

it returns a label:
36. ### Series We can create a new column (Ageclass) using the

Column Age and this function :) df[“Ageclass”] = df[“Age”].apply(age_to_ageclass)
37. ### Exercise Now we have classes for age, we can check

which sector survived the most, the same we did with Class :)
38. ### Dataframes We can group two columns to count df[“Ageclass”] =

df[“Age”].apply(age_to_ageclass)