De
fi
ne objective
Collect data
Clean data
Explore data
Slide 7
Slide 7 text
De
fi
ne objective
Collect data
Clean data
Explore data
Analyze data
Slide 8
Slide 8 text
De
fi
ne objective
Collect data
Clean data
Explore data
Analyze data
Interpret the results
Communicate
fi
ndings
Implement decisions
Monitor decisions
Slide 9
Slide 9 text
De
fi
Collect data
Clean data
Analyze data
Interpret the results
Communicate
fi
Implement decisions
Monitor decisions
Exploratory Data Anasysis
(EDA)
Explore data
This is what we are focusing
on in this presentation
Slide 10
Slide 10 text
What does it take
to hit the charts?
spotify-2023.csv
https://www.kaggle.com/datasets/nelgiriyewithana/top-spotify-songs-2023
Slide 11
Slide 11 text
No content
Slide 12
Slide 12 text
No content
Slide 13
Slide 13 text
Read data from CSV
fi
le
and create a DataFrame instance
Slide 14
Slide 14 text
Expose the variable
from the cell
Slide 15
Slide 15 text
Get a glimpse into the schema
Slide 16
Slide 16 text
Looks like Taylor Swift is the most popular
artist with 34 appearances
Slide 17
Slide 17 text
Nullable type indicates that the data ins't
always present
Slide 18
Slide 18 text
The streams count is identi
fi
ed as String.
This is clearly wrong in this case!
Slide 19
Slide 19 text
Aha! An invalid data entry!
Slide 20
Slide 20 text
This song isn't present in any charts. Looks like we
can just ignore it
Slide 21
Slide 21 text
Adjust the structure for
convenience
Slide 22
Slide 22 text
Filter out the invalid
data
Slide 23
Slide 23 text
Replace column
data type
Slide 24
Slide 24 text
Move the column to the left for
convenience
Slide 25
Slide 25 text
Rename columns
Slide 26
Slide 26 text
Sort the data set
Slide 27
Slide 27 text
No content
Slide 28
Slide 28 text
Most popular artists by
appearance
Slide 29
Slide 29 text
Use Kandy library to
visualize the data
Slide 30
Slide 30 text
Top 20 artists
Plot is an extension function for DataFrame
Map the data to axes
Add some colors
Slide 31
Slide 31 text
No content
Slide 32
Slide 32 text
Looks like you just need
to be Taylor Swift to hit the
charts ;)
Slide 33
Slide 33 text
Not in Top 20?
Slide 34
Slide 34 text
Not in Top 30?
Slide 35
Slide 35 text
Not even in Top 50?
Slide 36
Slide 36 text
OK, 1 appearance
in Top 100
Slide 37
Slide 37 text
Let's take a look at the
top songs
Slide 38
Slide 38 text
New column merging
the values from the
'title' and the 'artist'
Slide 39
Slide 39 text
Evenly distribute the
values between the
color categories
Slide 40
Slide 40 text
Looks like C# key is the
most popular one
Slide 41
Slide 41 text
Let's try
fi
nding correlations between the di
ff
erent
attributes related to the streams count
Slide 42
Slide 42 text
Looks like there are some attributes that
correlate better than the others
Slide 43
Slide 43 text
"Gather" - prepare the data for plotting
Slide 44
Slide 44 text
A bit brighter spots are the points of
interest
Slide 45
Slide 45 text
A bit brighter spots are the points of
interest
The outcome:
Danceability is useful
And you need to have at least some
energy levels to be popular
Slide 46
Slide 46 text
Analyzing the BPM (tempo) of the song
Slide 47
Slide 47 text
Aggregation operation creates a new
column that will be statically accessible in
the next cell
Slide 48
Slide 48 text
The static accessor was generated for the
new column created by the aggregate
operation
Slide 49
Slide 49 text
The static accessor was generated for the
new column created by the aggregate
operation
It looks like the most popular songs are contained
within 90 to 130 BPM range
The overall median is 121 BPM
The most popular BPM is 120 - 39 occurances
Slide 50
Slide 50 text
C# is the most
popular key among
the top songs on
Spotify
Slide 51
Slide 51 text
Some (very naive) outcomes:
BPM matters: not too slow, not too fast, 120 is good
The most popular key is C#
Is probably G, G#, and D is acceptable too
Having at least some energy levels in the song is useful