66

# Introduction to Jupyter Notebooks & Data Analysis using Kaggle

Mon, Feb 18 @ Dogpatch Labs

Title:
Introduction to Jupyter Notebooks & Data Analysis using Kaggle by Leticia Portella

Short Description:
We will give an overview on what are Jupyter Notebooks and how they work, how to read and user a dataset based on a spreadsheet and how can we explore the data using Pandas and Matplotlib. This is an introductory workshop for people interested in data analysis.

Setup before workshop
- An account to Kaggle (https://www.kaggle.com/) already verified. We will use the Titanic dataset (https://www.kaggle.com/sureshbhusare/titanic-dataset-from-kaggle) for exploratory analysis.

Leticia is an oceanographer that fell in love with programming. It is one of the hosts of the first brazilian podcast on Data Science, Pizza de Dados and is currently working with Project Jupyter at the Outreachy program. leportella.com February 19, 2019

## Transcript

1. Introduction to
Jupyter Notebooks
& Data Analysis
using Kaggle

2. LETICIA
PORTELLA
/in/leportella
@leportella
@leleportella
leportella.com

3. Kaggle is a place where you can ﬁnd a
lot of datasets, it already have installed
most of tools you’ll need for a basic
analysis, is a good place to see the
people’s code and built a portfolio
Why Kaggle?

4. Choosing a dataset
https://www.kaggle.com/datasets

5. https://www.kaggle.com/vikichocolate/
titanic-machine-learning-from-disaster

6. Notebooks are a place
where you can create code,
show graphs, document your
methodologies and ﬁndings…
all in a single place

7. You can write code
You can write text

8. And they work well together

9. Numbers on code cells indicate the
order in which each cell ran

10. Notebooks always print the last line
(event without the print statement)

11. Jupyter Shortcuts
Ctrl + Enter = Run cell
ESC + B = New cell below
ESC + dd = Delete cell

12. Run the ﬁrst cell

If you check the ﬁrst cell, it will tell you that
the documents are ready for you in ../input/.
So, we can read the ﬁles by with a Pandas
function and with the path of the ﬁle

14. Dataframes
Dataframes are similiar to what you ﬁnd in
Excel structures. You have rows indicated by
numbers and columns with names. You can
check the ﬁrst 5 rows of a data frame to see
the basic structure:

15. Dataframes

16. Dataframes
Dataframe columns: PassengerId, Survived, Pclass…

17. Dataframes
Dataframe rows: 0, 1, 2…

18. Dataframes
You can check the structure of a dataframe,
to get an idea of how many rows and
columns it has:
df.shape

19. Dataframes
You can check the main statistical
characteristics of the numerical columns of a
data frame
df.describe()

20. Dataframes
df.describe()

21. Series
You can select a single column of the data
frame to work with. A column of a
Dataframe is called Series and have some
special properties
df['Age']

22. Series
You can also check the statistical
characteristics of a Series
df['Age'].describe()

23. Series
You can ﬁlter a series to see which rows have
adults. This will return a Series of True and
False.
df[‘Age'] > 10

24. Series
quickly plot some of it. We can, for instance,
check the histogram of Ages.
df[‘Age’].plot.hist()

25. Exercise
Plot the histogram of Fares

26. Series
quickly plot some of it. We can, for instance,
check the histogram of Ages.
df[‘Age’].plot.hist()

27. Series
We can count how many passengers were on
each class
df[‘Pclass’].value_counts()

28. Series
Since the result of a value_counts is also a
Series, we can store this value in a variable
and use it to plot a pie chart :)
passengers_per_class = df[‘Pclass’].value_counts()
passengers_per_class.plot.pie()

29. Exercise
Plot a bar plot with the number of people
that survived and didn’t survive
(Column Survived)

30. Series
Remember we could ﬁlter a series? We could
use it to checkout our variables. Let’s see
which class survived the most
survived = df[‘Survived'] > 0
ﬁltered_df = df[survived]
passenger_per_class = ﬁltered_df[“Pclass”].value_counts()
passenger_per_class.plot.pie()

31. Function
Let’s build a function that “breaks" the age
into classes:

32. Function
Now if we pass an age to the function it
returns a label:

33. Series
We can create a new column (Ageclass)
using the Column Age and this function :)
df[“Ageclass”] = df[“Age”].apply(age_to_ageclass)

34. Exercise
Now we have classes for age, we can check
which sector survived the most, the same we
did with Class :)

35. Dataframes
We can group two columns to count
df[“Ageclass”] = df[“Age”].apply(age_to_ageclass)