Introduction to Jupyter Notebooks & Data Analysis using Kaggle

LETICIA PORTELLA /in/leportella @leportella @leleportella leportella.com pizzadedados.com

Kaggle is a place where you can ﬁnd a lot
of datasets, it already have installed most of tools you’ll need for a basic analysis, is a good place to see the people’s code and built a portfolio Why Kaggle?

Choosing a dataset https://www.kaggle.com/datasets

https://www.kaggle.com/vikichocolate/ titanic-machine-learning-from-disaster

Notebooks are a place where you can create code, show
graphs, document your methodologies and ﬁndings… all in a single place

You can write code You can write text

And they work well together

Numbers on code cells indicate the order in which each
cell ran

Notebooks always print the last line (event without the print
statement)

Jupyter Shortcuts Ctrl + Enter = Run cell ESC +
B = New cell below ESC + dd = Delete cell

Run the ﬁrst cell

Reading a document If you check the first cell, it
will tell you that the documents are ready for you in ../input/. So, we can read the files by with a Pandas function and with the path of the file df = pd.read_csv(‘../input/train.csv')

Dataframes Dataframes are similiar to what you ﬁnd in Excel
structures. You have rows indicated by numbers and columns with names. You can check the ﬁrst 5 rows of a data frame to see the basic structure: df.head()

Dataframes

Dataframes Dataframe columns: PassengerId, Survived, Pclass…

Dataframes Dataframe rows: 0, 1, 2…

Dataframes You can check the structure of a dataframe, to
get an idea of how many rows and columns it has: df.shape

Dataframes You can check the main statistical characteristics of the
numerical columns of a data frame df.describe()

Dataframes df.describe()

Series You can select a single column of the data
frame to work with. A column of a Dataframe is called Series and have some special properties df['Age']

Series You can also check the statistical characteristics of a
Series df['Age'].describe()

Series You can ﬁlter a series to see which rows
have adults. This will return a Series of True and False. df[‘Age'] > 10

Series And Series have functions that help you quickly plot
some of it. We can, for instance, check the histogram of Ages. df[‘Age’].plot.hist()

Exercise Plot the histogram of Fares

Series And Series have functions that help you quickly plot
some of it. We can, for instance, check the histogram of Ages. df[‘Age’].plot.hist()

Series We can count how many passengers were on each
class df[‘Pclass’].value_counts()

Series Since the result of a value_counts is also a
Series, we can store this value in a variable and use it to plot a pie chart :) passengers_per_class = df[‘Pclass’].value_counts() passengers_per_class.plot.pie()

Exercise Plot a bar plot with the number of people
that survived and didn’t survive (Column Survived)

Series Remember we could filter a series? We could use
it to checkout our variables. Let’s see which class survived the most survived = df[‘Survived'] > 0 filtered_df = df[survived] passenger_per_class = filtered_df[“Pclass”].value_counts() passenger_per_class.plot.pie()

Function Let’s build a function that “breaks" the age into
classes:

Function Now if we pass an age to the function
it returns a label:

Series We can create a new column (Ageclass) using the
Column Age and this function :) df[“Ageclass”] = df[“Age”].apply(age_to_ageclass)

Exercise Now we have classes for age, we can check
which sector survived the most, the same we did with Class :)

Dataframes We can group two columns to count df[“Ageclass”] =
df[“Age”].apply(age_to_ageclass)

Introduction to Jupyter Notebooks & Data Analys...

Introduction to Jupyter Notebooks & Data Analysis using Kaggle

More Decks by PyLadies Dublin

Other Decks in Technology

Featured

Transcript