πͺ Challenge
Your task is to devise an analytically-backed, dance-themed playlist for the company's summer party. Your choices must be justified with a comprehensive report explaining your methodology and reasoning. Below are some suggestions on how you might want to start curating the playlist:
- Use descriptive statistics and data visualization techniques to explore the audio features and understand their relationships.
- Develop and apply a machine learning model that predicts a song's
danceability. - Interpret the model outcomes and utilize your data-driven insights to curate your ultimate dance party playlist of the top 50 songs according to your model.
πΎ The Data
You have assembled information on more than 125 genres of Spotify music tracks in a file called spotify.csv, with each genre containing approximately 1000 tracks. All tracks, from all time, have been taken into account without any time period limitations. However, the data collection was concluded in October 2022.
Each row represents a track that has some audio features associated with it.
| Column | Description |
|---|---|
track_id | The Spotify ID number of the track. |
artists | Names of the artists who performed the track, separated by a ; if there's more than one. |
album_name | The name of the album that includes the track. |
track_name | The name of the track. |
popularity | Numerical value ranges from 0 to 100, with 100 being the highest popularity. This is calculated based on the number of times the track has been played recently, with more recent plays contributing more to the score. Duplicate tracks are scored independently. |
duration_ms | The length of the track, measured in milliseconds. |
explicit | Indicates whether the track contains explicit lyrics. true means it does, false means it does not or it's unknown. |
danceability | A score ranges between 0.0 and 1.0 that represents the track's suitability for dancing. This is calculated by algorithm and is determined by factors like tempo, rhythm stability, beat strength, and regularity. |
energy | A score ranges between 0.0 and 1.0 indicating the track's intensity and activity level. Energetic tracks tend to be fast, loud, and noisy. |
key | The key the track is in. Integers map to pitches using standard Pitch class notation. E.g.0 = C, 1 = Cβ―/Dβ, 2 = D, and so on. If no key was detected, the value is -1. |
loudness | The overall loudness, measured in decibels (dB). |
mode | The modality of a track, represented as 1 for major and 0 for minor. |
speechiness | Measures the amount of spoken words in a track. A value close to 1.0 denotes speech-based content, while 0.33 to 0.66 indicates a mix of speech and music like rap. Values below 0.33 are usually music and non-speech tracks. |
acousticness | A confidence measure ranges from 0.0 to 1.0, with 1.0 representing the highest confidence that the track is acoustic. |
instrumentalness | Instrumentalness estimates the likelihood of a track being instrumental. Non-lyrical sounds such as "ooh" and "aah" are considered instrumental, whereas rap or spoken word tracks are classified as "vocal". A value closer to 1.0 indicates a higher probability that the track lacks vocal content. |
liveness | A measure of the probability that the track was performed live. Scores above 0.8 indicate a high likelihood of the track being live. |
valence | A score from 0.0 to 1.0 representing the track's positiveness. High scores suggest a more positive or happier track. |
tempo | The track's estimated tempo, measured in beats per minute (BPM). |
time_signature | An estimate of the track's time signature (meter), which is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of 3/4, to 7/4. |
track_genre | The genre of the track. |
Source (data has been modified)
π Executive Summary:
This report provides a comprehensive analysis of a music dataset, focusing on danceability as a key attribute, and explores various aspects of data preprocessing, genre analysis, audio features, and machine learning modeling. The objective of this analysis is to gain insights into the factors that influence danceability and create a playlist of songs optimized for danceability.
Data Exploration: Initial data inspection using .info() and .describe() functions to understand dataset structure and statistics. Plotting visualizations to gain insights.
Data Pre-processing: Duplicate entries are removed for data accuracy. Missing values are imputed.
Genre Analysis: An exploration of music genres' distribution in the dataset, providing insights for subsequent genre-based investigations.Analysis of how different genres correlate with danceability scores, helping identify high-danceability genres.
Feature Correlation: Visualization of feature correlations with danceability through a heatmap and histograms, offering insights into feature importance.
Clustering:K-Means clustering is used to cluster the genres, the elbow method is used to find the optimal number of clusters based on song danceability and valence.
ML Modeling: After the validation of the chosen ML models,The KNN Regressor Machine learning model is used to predict danceability scores based on selected features. Feature selection, data splitting, and performance evaluation are done to ensure the most optimal results.
Playlist Creation: A curated dance playlist is generated based on the analysis and modeling results, catering to dance enthusiasts.
Playlist Refining : Refining the playlist with a weight based algorithm
π©βπ» Reading Data
In this step, we read the necessary data for our analysis. The dataset was provided by the competition and originally sourced from Kaggle. For this task, let's import the pandas library and use it to unveil our dataset.
import pandas as pd
spotify = pd.read_csv('data/spotify.csv')
spotifyπ Exploring the Data
In this step, we will explore the dataset to gain insights and understand the structure of the data.
Let's start by examining the first few rows of the dataset.
spotify.head()Looks like we have got a lot of da'ta'cing to do!
Using the .info() function
.info() functionThe .info() function is a useful method in pandas that provides a concise summary of a DataFrame. It gives information about the column names, the number of non-null values, and the data types of each column.
Using the .info() function is a quick way to get an overview of the structure and content of a DataFrame.
spotify.info()Recommendation
What do we know so far? π€
Dataset Size: The dataset contains a total of 113,027 entries or rows.
Columns: There are 20 columns in this dataset, each representing different attributes of music tracks.
Data Types:
- Most columns contain numeric data types, including integers (int64) and floating-point numbers (float64).
- The explicit column is represented as a boolean (bool) data type, indicating whether a track contains explicit content.
- Several columns, such as track_id, artists, album_name, track_name, and track_genre, are of object (object) data type, which typically represents strings or categorical data.
Using the .describe() function
.describe() functionThe .describe() function is a useful method in pandas that provides descriptive statistics of a DataFrame. It gives information about the count, mean, standard deviation, minimum, maximum, and quartiles of the numerical columns in the DataFrame.
Using the .describe() function is a quick way to get an overview of the distribution and summary statistics of the numerical data in a DataFrame.
spotify.describe()Recommendation
β
β