Slide 1

Slide 1 text

Data Analysis with Kotlin Notebook @antonarhipov

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

De fi ne objective

Slide 4

Slide 4 text

De fi ne objective Collect data

Slide 5

Slide 5 text

De fi ne objective Collect data Clean the data

Slide 6

Slide 6 text

De fi ne objective Collect data Clean data Explore data

Slide 7

Slide 7 text

De fi ne objective Collect data Clean data Explore data Analyze data

Slide 8

Slide 8 text

De fi ne objective Collect data Clean data Explore data Analyze data Interpret the results Communicate fi ndings Implement decisions Monitor decisions

Slide 9

Slide 9 text

De fi Collect data Clean data Analyze data Interpret the results Communicate fi Implement decisions Monitor decisions Exploratory Data Anasysis (EDA) Explore data This is what we are focusing on in this presentation

Slide 10

Slide 10 text

What does it take to hit the charts? spotify-2023.csv https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023

Slide 11

Slide 11 text

No content

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

Read data from CSV fi le and create a DataFrame instance

Slide 14

Slide 14 text

Expose the variable from the cell

Slide 15

Slide 15 text

Get a glimpse into the schema

Slide 16

Slide 16 text

Looks like Taylor Swift is the most popular artist with 34 appearances

Slide 17

Slide 17 text

Nullable type indicates that the data ins't always present

Slide 18

Slide 18 text

The streams count is identi fi ed as String. This is clearly wrong in this case!

Slide 19

Slide 19 text

Aha! An invalid data entry!

Slide 20

Slide 20 text

This song isn't present in any charts. Looks like we can just ignore it

Slide 21

Slide 21 text

Adjust the structure for convenience

Slide 22

Slide 22 text

Filter out the invalid data

Slide 23

Slide 23 text

Replace column data type

Slide 24

Slide 24 text

Move the column to the left for convenience

Slide 25

Slide 25 text

Rename columns

Slide 26

Slide 26 text

Sort the data set

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

Most popular artists by appearance

Slide 29

Slide 29 text

Use Kandy library to visualize the data

Slide 30

Slide 30 text

Top 20 artists Plot is an extension function for DataFrame Map the data to axes Add some colors

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

Looks like you just need to be Taylor Swift to hit the charts ;)

Slide 33

Slide 33 text

Not in Top 20?

Slide 34

Slide 34 text

Not in Top 30?

Slide 35

Slide 35 text

Not even in Top 50?

Slide 36

Slide 36 text

OK, 1 appearance in Top 100

Slide 37

Slide 37 text

Let's take a look at the top songs

Slide 38

Slide 38 text

New column merging the values from the 'title' and the 'artist'

Slide 39

Slide 39 text

Evenly distribute the values between the color categories

Slide 40

Slide 40 text

Looks like C# key is the most popular one

Slide 41

Slide 41 text

Let's try fi nding correlations between the di ff erent attributes related to the streams count

Slide 42

Slide 42 text

Looks like there are some attributes that correlate better than the others

Slide 43

Slide 43 text

"Gather" - prepare the data for plotting

Slide 44

Slide 44 text

A bit brighter spots are the points of interest

Slide 45

Slide 45 text

A bit brighter spots are the points of interest The outcome: Danceability is useful And you need to have at least some energy levels to be popular

Slide 46

Slide 46 text

Analyzing the BPM (tempo) of the song

Slide 47

Slide 47 text

Aggregation operation creates a new column that will be statically accessible in the next cell

Slide 48

Slide 48 text

The static accessor was generated for the new column created by the aggregate operation

Slide 49

Slide 49 text

The static accessor was generated for the new column created by the aggregate operation It looks like the most popular songs are contained within 90 to 130 BPM range The overall median is 121 BPM The most popular BPM is 120 - 39 occurances

Slide 50

Slide 50 text

C# is the most popular key among the top songs on Spotify

Slide 51

Slide 51 text

Some (very naive) outcomes: BPM matters: not too slow, not too fast, 120 is good The most popular key is C# Is probably G, G#, and D is acceptable too Having at least some energy levels in the song is useful

Slide 52

Slide 52 text

Resources: Kotlin Notebook plugin Kotlin DataFrame Kotlin Kandy https://plugins.jetbrains.com/plugin/16340-kotlin-notebook https://kotlin.github.io/dataframe https://kotlin.github.io/kandy Demo https://github.com/antonarhipov/kotlin-notebooks-demo/

Slide 53

Slide 53 text

https://speakerdeck.com/antonarhipov @antonarhipov https://github.com/antonarhipov