Slide 1

Slide 1 text

Introduction to Jupyter Notebooks & Data Analysis using Kaggle

Slide 2

Slide 2 text

LETICIA PORTELLA /in/leportella @leportella @leleportella leportella.com pizzadedados.com

Slide 3

Slide 3 text

Kaggle is a place where you can find a lot of datasets, it already have installed most of tools you’ll need for a basic analysis, is a good place to see the people’s code and built a portfolio Why Kaggle?

Slide 4

Slide 4 text

Choosing a dataset https://www.kaggle.com/datasets

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

https://www.kaggle.com/vikichocolate/ titanic-machine-learning-from-disaster

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Notebooks are a place where you can create code, show graphs, document your methodologies and findings… all in a single place

Slide 10

Slide 10 text

You can write code You can write text

Slide 11

Slide 11 text

And they work well together

Slide 12

Slide 12 text

Numbers on code cells indicate the order in which each cell ran

Slide 13

Slide 13 text

Notebooks always print the last line (event without the print statement)

Slide 14

Slide 14 text

Jupyter Shortcuts Ctrl + Enter = Run cell ESC + B = New cell below ESC + dd = Delete cell

Slide 15

Slide 15 text

Run the first cell

Slide 16

Slide 16 text

Reading a document If you check the first cell, it will tell you that the documents are ready for you in ../input/. So, we can read the files by with a Pandas function and with the path of the file df = pd.read_csv(‘../input/train.csv')

Slide 17

Slide 17 text

Dataframes Dataframes are similiar to what you find in Excel structures. You have rows indicated by numbers and columns with names. You can check the first 5 rows of a data frame to see the basic structure: df.head()

Slide 18

Slide 18 text

Dataframes

Slide 19

Slide 19 text

Dataframes Dataframe columns: PassengerId, Survived, Pclass…

Slide 20

Slide 20 text

Dataframes Dataframe rows: 0, 1, 2…

Slide 21

Slide 21 text

Dataframes You can check the structure of a dataframe, to get an idea of how many rows and columns it has: df.shape

Slide 22

Slide 22 text

Dataframes You can check the main statistical characteristics of the numerical columns of a data frame df.describe()

Slide 23

Slide 23 text

Dataframes df.describe()

Slide 24

Slide 24 text

Series You can select a single column of the data frame to work with. A column of a Dataframe is called Series and have some special properties df['Age']

Slide 25

Slide 25 text

Series You can also check the statistical characteristics of a Series df['Age'].describe()

Slide 26

Slide 26 text

Series You can filter a series to see which rows have adults. This will return a Series of True and False. df[‘Age'] > 10

Slide 27

Slide 27 text

Series And Series have functions that help you quickly plot some of it. We can, for instance, check the histogram of Ages. df[‘Age’].plot.hist()

Slide 28

Slide 28 text

Exercise Plot the histogram of Fares

Slide 29

Slide 29 text

Series And Series have functions that help you quickly plot some of it. We can, for instance, check the histogram of Ages. df[‘Age’].plot.hist()

Slide 30

Slide 30 text

Series We can count how many passengers were on each class df[‘Pclass’].value_counts()

Slide 31

Slide 31 text

Series Since the result of a value_counts is also a Series, we can store this value in a variable and use it to plot a pie chart :) passengers_per_class = df[‘Pclass’].value_counts() passengers_per_class.plot.pie()

Slide 32

Slide 32 text

Exercise Plot a bar plot with the number of people that survived and didn’t survive (Column Survived)

Slide 33

Slide 33 text

Series Remember we could filter a series? We could use it to checkout our variables. Let’s see which class survived the most survived = df[‘Survived'] > 0 filtered_df = df[survived] passenger_per_class = filtered_df[“Pclass”].value_counts() passenger_per_class.plot.pie()

Slide 34

Slide 34 text

Function Let’s build a function that “breaks" the age into classes:

Slide 35

Slide 35 text

Function Now if we pass an age to the function it returns a label:

Slide 36

Slide 36 text

Series We can create a new column (Ageclass) using the Column Age and this function :) df[“Ageclass”] = df[“Age”].apply(age_to_ageclass)

Slide 37

Slide 37 text

Exercise Now we have classes for age, we can check which sector survived the most, the same we did with Class :)

Slide 38

Slide 38 text

Dataframes We can group two columns to count df[“Ageclass”] = df[“Age”].apply(age_to_ageclass)