Skip to content
Spotify Playlist Prediction (copy)
  • AI Chat
  • Code
  • Report
  • Executive Summary

    As the summer begins, it is important to celebrate the great work of the team. As a consistent spotify premium user, it was quite exciting to work on a dance-themed playlist for a company's party in the nothern hemishere. In this report, I show the steps taken to analyze the spotify dataset and choose the top 50 songs for the perfect party playlist.

    Overview of the dataset

    It is important to understand which countries make up the northern hemisphere before starting the analysis. According to world population view,there are several continents that are located in the northern hemisphere. North America, the northern part of South America, and all of Europe are located in the northern hemisphere. In addition, the vast majority of Asia and two-thirds of Africa are located in the northern hemisphere. Also, the dataset consists of 1000 tracks across 125 genre, 20 columns and 3 null values.

    Data Quality Issue

    The spotify dataset had the following data qulaity issue:

    • The dataset had 3 null values.
    • The dataset had different units as in the case of duration_ms measured in milliseconds and tempo measured in beats per minute.

    Goals of the analysis

    The goals of the analysis include the following:

    • To interpret the model outcomes and utilize data-driven insights to curate an ultimate dance party playlist of the top 50 songs according to your model.

    Recommendations

    • I suggest that songs on the playlist should be created with the target audience in mind.
    • I recommend that when creating playlist other aspects that make up a song like the genre and the use or not use of explicit words should be taking into consideration.
    • I suggest that values should be recorded in the same unit to aid uniformity.

    💾 The Data

    You have assembled information on more than 125 genres of Spotify music tracks in a file called spotify.csv, with each genre containing approximately 1000 tracks. All tracks, from all time, have been taken into account without any time period limitations. However, the data collection was concluded in October 2022. Each row represents a track that has some audio features associated with it.

    ColumnDescription
    track_idThe Spotify ID number of the track.
    artistsNames of the artists who performed the track, separated by a ; if there's more than one.
    album_nameThe name of the album that includes the track.
    track_nameThe name of the track.
    popularityNumerical value ranges from 0 to 100, with 100 being the highest popularity. This is calculated based on the number of times the track has been played recently, with more recent plays contributing more to the score. Duplicate tracks are scored independently.
    duration_msThe length of the track, measured in milliseconds.
    explicitIndicates whether the track contains explicit lyrics. true means it does, false means it does not or it's unknown.
    danceabilityA score ranges between 0.0 and 1.0 that represents the track's suitability for dancing. This is calculated by algorithm and is determined by factors like tempo, rhythm stability, beat strength, and regularity.
    energyA score ranges between 0.0 and 1.0 indicating the track's intensity and activity level. Energetic tracks tend to be fast, loud, and noisy.
    keyThe key the track is in. Integers map to pitches using standard Pitch class notation. E.g.0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
    loudnessThe overall loudness, measured in decibels (dB).
    modeThe modality of a track, represented as 1 for major and 0 for minor.
    speechinessMeasures the amount of spoken words in a track. A value close to 1.0 denotes speech-based content, while 0.33 to 0.66 indicates a mix of speech and music like rap. Values below 0.33 are usually music and non-speech tracks.
    acousticnessA confidence measure ranges from 0.0 to 1.0, with 1.0 representing the highest confidence that the track is acoustic.
    instrumentalnessInstrumentalness estimates the likelihood of a track being instrumental. Non-lyrical sounds such as "ooh" and "aah" are considered instrumental, whereas rap or spoken word tracks are classified as "vocal". A value closer to 1.0 indicates a higher probability that the track lacks vocal content.
    livenessA measure of the probability that the track was performed live. Scores above 0.8 indicate a high likelihood of the track being live.
    valenceA score from 0.0 to 1.0 representing the track's positiveness. High scores suggest a more positive or happier track.
    tempoThe track's estimated tempo, measured in beats per minute (BPM).
    time_signatureAn estimate of the track's time signature (meter), which is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of 3/4, to 7/4.
    track_genreThe genre of the track.

    Source (data has been modified)

    1. Loading Data

    To start the analysis, I imported the various python libriaries used for this analysis and the spotify datasets as data/spotify.csv

    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    spotify = pd.read_csv('data/spotify.csv')
    spotify.head()

    2. Data Cleaning and Exploration

    In this section, I checked the data for any opossible descrpancies like duplicate values,null values or unique values in the dataset.

    spotify.info()

    From the above code, you can see that the dataset has 20 columns with 113027 entries. We can also see that the data are in different types, floats, Intergers and booleans.

    Null Values

    spotify.isnull().sum()

    From the above analysis, we can see that in the dataset there are 3 null values in the artist, album_name and track_name columns. You can also see that the dataset has 20 columns in total. In the next step, I will remove the columns with null values as it is just a small number from the dataset.

    spotify.dropna()

    To carry out accurate analysis, I double checked that that the values had the same unit. To do this, I converted the duration_ms which is in milliseconds to minutes. This is to correspond the values with the tempo values which is recorded in beats per minutes.

    # Convert duration_ms from milliseconds to minutes
    spotify['duration_min'] = spotify['duration_ms'] / 60000
    spotify.head()

    3. Analyzing the factor 'danceability' in the dataset

    As we already know, the main goal of this analysis is to create a playlist of 50 dance-themed songs for a companies summer party. In this section,I narrowed down our analysis to the danceability factor and how it compares with other factors in the data set. I calculated the minimum, maximum and median values as well the correlation of danceability with other factors.

    Finding the song wiht the highest danceability score