Skip to content
0

πŸ’ͺ Challenge

Your task is to devise an analytically-backed, dance-themed playlist for the company's summer party. Your choices must be justified with a comprehensive report explaining your methodology and reasoning. Below are some suggestions on how you might want to start curating the playlist:

  • Use descriptive statistics and data visualization techniques to explore the audio features and understand their relationships.
  • Develop and apply a machine learning model that predicts a song's danceability.
  • Interpret the model outcomes and utilize your data-driven insights to curate your ultimate dance party playlist of the top 50 songs according to your model.

πŸ’Ύ The Data

You have assembled information on more than 125 genres of Spotify music tracks in a file called spotify.csv, with each genre containing approximately 1000 tracks. All tracks, from all time, have been taken into account without any time period limitations. However, the data collection was concluded in October 2022. Each row represents a track that has some audio features associated with it.

ColumnDescription
track_idThe Spotify ID number of the track.
artistsNames of the artists who performed the track, separated by a ; if there's more than one.
album_nameThe name of the album that includes the track.
track_nameThe name of the track.
popularityNumerical value ranges from 0 to 100, with 100 being the highest popularity. This is calculated based on the number of times the track has been played recently, with more recent plays contributing more to the score. Duplicate tracks are scored independently.
duration_msThe length of the track, measured in milliseconds.
explicitIndicates whether the track contains explicit lyrics. true means it does, false means it does not or it's unknown.
danceabilityA score ranges between 0.0 and 1.0 that represents the track's suitability for dancing. This is calculated by algorithm and is determined by factors like tempo, rhythm stability, beat strength, and regularity.
energyA score ranges between 0.0 and 1.0 indicating the track's intensity and activity level. Energetic tracks tend to be fast, loud, and noisy.
keyThe key the track is in. Integers map to pitches using standard Pitch class notation. E.g.0 = C, 1 = Cβ™―/Dβ™­, 2 = D, and so on. If no key was detected, the value is -1.
loudnessThe overall loudness, measured in decibels (dB).
modeThe modality of a track, represented as 1 for major and 0 for minor.
speechinessMeasures the amount of spoken words in a track. A value close to 1.0 denotes speech-based content, while 0.33 to 0.66 indicates a mix of speech and music like rap. Values below 0.33 are usually music and non-speech tracks.
acousticnessA confidence measure ranges from 0.0 to 1.0, with 1.0 representing the highest confidence that the track is acoustic.
instrumentalnessInstrumentalness estimates the likelihood of a track being instrumental. Non-lyrical sounds such as "ooh" and "aah" are considered instrumental, whereas rap or spoken word tracks are classified as "vocal". A value closer to 1.0 indicates a higher probability that the track lacks vocal content.
livenessA measure of the probability that the track was performed live. Scores above 0.8 indicate a high likelihood of the track being live.
valenceA score from 0.0 to 1.0 representing the track's positiveness. High scores suggest a more positive or happier track.
tempoThe track's estimated tempo, measured in beats per minute (BPM).
time_signatureAn estimate of the track's time signature (meter), which is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of 3/4, to 7/4.
track_genreThe genre of the track.

Source (data has been modified)

πŸ“œ Executive Summary:

This report provides a comprehensive analysis of a music dataset, focusing on danceability as a key attribute, and explores various aspects of data preprocessing, genre analysis, audio features, and machine learning modeling. The objective of this analysis is to gain insights into the factors that influence danceability and create a playlist of songs optimized for danceability.

Data Exploration: Initial data inspection using .info() and .describe() functions to understand dataset structure and statistics. Plotting visualizations to gain insights.

Data Pre-processing: Duplicate entries are removed for data accuracy. Missing values are imputed.

Genre Analysis: An exploration of music genres' distribution in the dataset, providing insights for subsequent genre-based investigations.Analysis of how different genres correlate with danceability scores, helping identify high-danceability genres.

Feature Correlation: Visualization of feature correlations with danceability through a heatmap and histograms, offering insights into feature importance.

Clustering:K-Means clustering is used to cluster the genres, the elbow method is used to find the optimal number of clusters based on song danceability and valence.

ML Modeling: After the validation of the chosen ML models,The KNN Regressor Machine learning model is used to predict danceability scores based on selected features. Feature selection, data splitting, and performance evaluation are done to ensure the most optimal results.

Playlist Creation: A curated dance playlist is generated based on the analysis and modeling results, catering to dance enthusiasts.

Playlist Refining : Refining the playlist with a weight based algorithm

πŸ‘©β€πŸ’» Reading Data

In this step, we read the necessary data for our analysis. The dataset was provided by the competition and originally sourced from Kaggle. For this task, let's import the pandas library and use it to unveil our dataset.

import pandas as pd
spotify = pd.read_csv('data/spotify.csv')
spotify

πŸ”Ž Exploring the Data

In this step, we will explore the dataset to gain insights and understand the structure of the data.

Let's start by examining the first few rows of the dataset.

spotify.head()

Looks like we have got a lot of da'ta'cing to do!

Using the .info() function

The .info() function is a useful method in pandas that provides a concise summary of a DataFrame. It gives information about the column names, the number of non-null values, and the data types of each column. Using the .info() function is a quick way to get an overview of the structure and content of a DataFrame.

spotify.info()

Recommendation

What do we know so far? πŸ€”

Dataset Size: The dataset contains a total of 113,027 entries or rows.

Columns: There are 20 columns in this dataset, each representing different attributes of music tracks.

Data Types:

  • Most columns contain numeric data types, including integers (int64) and floating-point numbers (float64).
  • The explicit column is represented as a boolean (bool) data type, indicating whether a track contains explicit content.
  • Several columns, such as track_id, artists, album_name, track_name, and track_genre, are of object (object) data type, which typically represents strings or categorical data.

Using the .describe() function

The .describe() function is a useful method in pandas that provides descriptive statistics of a DataFrame. It gives information about the count, mean, standard deviation, minimum, maximum, and quartiles of the numerical columns in the DataFrame.

Using the .describe() function is a quick way to get an overview of the distribution and summary statistics of the numerical data in a DataFrame.

spotify.describe()

Recommendation

β€Œ
β€Œ
β€Œ