Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Jupyter Notebooks & Data Analysis using Kaggle

Introduction to Jupyter Notebooks & Data Analysis using Kaggle

PyLadies Dublin Feb 2019 Meetup
Mon, Feb 18 @ Dogpatch Labs
Event page: https://www.meetup.com/PyLadiesDublin/events/dclgvlyzdbzb/

Title:
Introduction to Jupyter Notebooks & Data Analysis using Kaggle by Leticia Portella

Short Description:
We will give an overview on what are Jupyter Notebooks and how they work, how to read and user a dataset based on a spreadsheet and how can we explore the data using Pandas and Matplotlib. This is an introductory workshop for people interested in data analysis.

Setup before workshop
- An account to Kaggle (https://www.kaggle.com/) already verified. We will use the Titanic dataset (https://www.kaggle.com/sureshbhusare/titanic-dataset-from-kaggle) for exploratory analysis.

About Leticia Portella
Leticia is an oceanographer that fell in love with programming. It is one of the hosts of the first brazilian podcast on Data Science, Pizza de Dados and is currently working with Project Jupyter at the Outreachy program. leportella.com

PyLadies Dublin

February 19, 2019
Tweet

More Decks by PyLadies Dublin

Other Decks in Technology

Transcript

  1. Introduction to
    Jupyter Notebooks
    & Data Analysis
    using Kaggle

    View Slide

  2. LETICIA
    PORTELLA
    /in/leportella
    @leportella
    @leleportella
    leportella.com
    pizzadedados.com

    View Slide

  3. Kaggle is a place where you can find a
    lot of datasets, it already have installed
    most of tools you’ll need for a basic
    analysis, is a good place to see the
    people’s code and built a portfolio
    Why Kaggle?

    View Slide

  4. Choosing a dataset
    https://www.kaggle.com/datasets

    View Slide

  5. View Slide

  6. https://www.kaggle.com/vikichocolate/
    titanic-machine-learning-from-disaster

    View Slide

  7. View Slide

  8. View Slide

  9. Notebooks are a place
    where you can create code,
    show graphs, document your
    methodologies and findings…
    all in a single place

    View Slide

  10. You can write code
    You can write text

    View Slide

  11. And they work well together

    View Slide

  12. Numbers on code cells indicate the
    order in which each cell ran

    View Slide

  13. Notebooks always print the last line
    (event without the print statement)

    View Slide

  14. Jupyter Shortcuts
    Ctrl + Enter = Run cell
    ESC + B = New cell below
    ESC + dd = Delete cell

    View Slide

  15. Run the first cell

    View Slide

  16. Reading a document
    If you check the first cell, it will tell you that
    the documents are ready for you in ../input/.
    So, we can read the files by with a Pandas
    function and with the path of the file
    df = pd.read_csv(‘../input/train.csv')

    View Slide

  17. Dataframes
    Dataframes are similiar to what you find in
    Excel structures. You have rows indicated by
    numbers and columns with names. You can
    check the first 5 rows of a data frame to see
    the basic structure:
    df.head()

    View Slide

  18. Dataframes

    View Slide

  19. Dataframes
    Dataframe columns: PassengerId, Survived, Pclass…

    View Slide

  20. Dataframes
    Dataframe rows: 0, 1, 2…

    View Slide

  21. Dataframes
    You can check the structure of a dataframe,
    to get an idea of how many rows and
    columns it has:
    df.shape

    View Slide

  22. Dataframes
    You can check the main statistical
    characteristics of the numerical columns of a
    data frame
    df.describe()

    View Slide

  23. Dataframes
    df.describe()

    View Slide

  24. Series
    You can select a single column of the data
    frame to work with. A column of a
    Dataframe is called Series and have some
    special properties
    df['Age']

    View Slide

  25. Series
    You can also check the statistical
    characteristics of a Series
    df['Age'].describe()

    View Slide

  26. Series
    You can filter a series to see which rows have
    adults. This will return a Series of True and
    False.
    df[‘Age'] > 10

    View Slide

  27. Series
    And Series have functions that help you
    quickly plot some of it. We can, for instance,
    check the histogram of Ages.
    df[‘Age’].plot.hist()

    View Slide

  28. Exercise
    Plot the histogram of Fares

    View Slide

  29. Series
    And Series have functions that help you
    quickly plot some of it. We can, for instance,
    check the histogram of Ages.
    df[‘Age’].plot.hist()

    View Slide

  30. Series
    We can count how many passengers were on
    each class
    df[‘Pclass’].value_counts()

    View Slide

  31. Series
    Since the result of a value_counts is also a
    Series, we can store this value in a variable
    and use it to plot a pie chart :)
    passengers_per_class = df[‘Pclass’].value_counts()
    passengers_per_class.plot.pie()

    View Slide

  32. Exercise
    Plot a bar plot with the number of people
    that survived and didn’t survive
    (Column Survived)

    View Slide

  33. Series
    Remember we could filter a series? We could
    use it to checkout our variables. Let’s see
    which class survived the most
    survived = df[‘Survived'] > 0
    filtered_df = df[survived]
    passenger_per_class = filtered_df[“Pclass”].value_counts()
    passenger_per_class.plot.pie()

    View Slide

  34. Function
    Let’s build a function that “breaks" the age
    into classes:

    View Slide

  35. Function
    Now if we pass an age to the function it
    returns a label:

    View Slide

  36. Series
    We can create a new column (Ageclass)
    using the Column Age and this function :)
    df[“Ageclass”] = df[“Age”].apply(age_to_ageclass)

    View Slide

  37. Exercise
    Now we have classes for age, we can check
    which sector survived the most, the same we
    did with Class :)

    View Slide

  38. Dataframes
    We can group two columns to count
    df[“Ageclass”] = df[“Age”].apply(age_to_ageclass)

    View Slide