Competition - Dance Party Songs

Which songs are most suitable for a dancing party?

📖 Background

It's that vibrant time of year again - Summer has arrived (for those of us in the Northern Hemisphere at least)! There's an energy in the air that inspires us to get up and move. In sync with this exuberance, your company has decided to host a dance party to celebrate. And you, with your unique blend of creativity and analytical expertise, have been entrusted with the crucial task of curating a dance-themed playlist that will set the perfect mood for this electrifying night. The question then arises - How can you identify the songs that would make the attendees dance their hearts out? This is where your coding skills come into play.

💾 The Data

You have assembled information on more than 125 genres of Spotify music tracks in a file called spotify.csv, with each genre containing approximately 1000 tracks. All tracks, from all time, have been taken into account without any time period limitations. However, the data collection was concluded in October 2022. Each row represents a track that has some audio features associated with it.

Column	Description
`track_id`	The Spotify ID number of the track.
`artists`	Names of the artists who performed the track, separated by a `;` if there's more than one.
`album_name`	The name of the album that includes the track.
`track_name`	The name of the track.
`popularity`	Numerical value ranges from `0` to `100`, with `100` being the highest popularity. This is calculated based on the number of times the track has been played recently, with more recent plays contributing more to the score. Duplicate tracks are scored independently.
`duration_ms`	The length of the track, measured in milliseconds.
`explicit`	Indicates whether the track contains explicit lyrics. `true` means it does, `false` means it does not or it's unknown.
`danceability`	A score ranges between `0.0` and `1.0` that represents the track's suitability for dancing. This is calculated by algorithm and is determined by factors like tempo, rhythm stability, beat strength, and regularity.
`energy`	A score ranges between `0.0` and `1.0` indicating the track's intensity and activity level. Energetic tracks tend to be fast, loud, and noisy.
`key`	The key the track is in. Integers map to pitches using standard Pitch class notation. E.g.`0 = C`, `1 = C♯/D♭`, `2 = D`, and so on. If no key was detected, the value is `-1`.
`loudness`	The overall loudness, measured in decibels (dB).
`mode`	The modality of a track, represented as `1` for major and `0` for minor.
`speechiness`	Measures the amount of spoken words in a track. A value close to `1.0` denotes speech-based content, while `0.33` to `0.66` indicates a mix of speech and music like rap. Values below `0.33` are usually music and non-speech tracks.
`acousticness`	A confidence measure ranges from `0.0` to `1.0`, with `1.0` representing the highest confidence that the track is acoustic.
`instrumentalness`	Instrumentalness estimates the likelihood of a track being instrumental. Non-lyrical sounds such as "ooh" and "aah" are considered instrumental, whereas rap or spoken word tracks are classified as "vocal". A value closer to `1.0` indicates a higher probability that the track lacks vocal content.
`liveness`	A measure of the probability that the track was performed live. Scores above `0.8` indicate a high likelihood of the track being live.
`valence`	A score from `0.0` to `1.0` representing the track's positiveness. High scores suggest a more positive or happier track.
`tempo`	The track's estimated tempo, measured in beats per minute (BPM).
`time_signature`	An estimate of the track's time signature (meter), which is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from `3` to `7` indicating time signatures of `3/4`, to `7/4`.
`track_genre`	The genre of the track.

Source (data has been modified)

💪 Challenge

Your task is to devise an analytically-backed, dance-themed playlist for the company's summer party. Your choices must be justified with a comprehensive report explaining your methodology and reasoning. Below are some suggestions on how you might want to start curating the playlist:

Use descriptive statistics and data visualization techniques to explore the audio features and understand their relationships.
Develop and apply a machine learning model that predicts a song's danceability.
Interpret the model outcomes and utilize your data-driven insights to curate your ultimate dance party playlist of the top 50 songs according to your model.

Descriptive Statistics and Data Visualization

import pandas as pd
spotify_data = pd.read_csv('data/spotify.csv')
spotify_data

import matplotlib.pyplot as plt
import seaborn as sns

# Select key audio features for exploration
audio_features = ['danceability', 'energy', 'tempo', 'valence', 'popularity']

# Generate summary statistics
summary_stats = spotify_data[audio_features].describe()

# Display summary statistics
summary_stats

# Visualize distributions of key audio features
plt.figure(figsize=(15, 10))
for i, feature in enumerate(audio_features, 1):
    plt.subplot(2, 3, i)
    sns.histplot(spotify_data[feature], bins=30, kde=True)
    plt.title(f'Distribution of {feature.capitalize()}')
plt.tight_layout()
plt.show()

# Pairplot to visualize relationships
sns.pairplot(spotify_data[audio_features])
plt.show()

Initial Analysis Summary

From the summary statistics and visualizations, we have the following observations:

Distribution:

The danceability scores are generally well-distributed with most values between 0.4 and 0.9.
Energy and tempo also show a wide range of values, indicating diversity in the tracks.
Valence and popularity show more concentration towards the middle of their respective ranges.

Relationships:

Pairplots suggest some positive relationships between danceability and energy, and between danceability and tempo.
Valence (happiness) seems to have a slight positive relationship with danceability.
There isn't a strong visible relationship between danceability and popularity.

Correlation Analysis

# Calculate the correlation matrix
correlation_matrix = spotify_data[audio_features].corr()

# Visualize the correlation matrix
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix of Key Audio Features')
plt.show()

Correlation Analysis Summary

From the correlation matrix, we observe the following:

Danceability:

Energy has a moderate positive correlation with danceability (around 0.55).
Tempo shows a moderate positive correlation with danceability (around 0.33).
Valence has a lower positive correlation with danceability (around 0.29).
Popularity has a very weak positive correlation with danceability (around 0.09).

These findings suggest that energy, tempo, and valence are key features influencing danceability, with energy being the most significant predictor.

Machine Learning Model

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Select features and target variable
features = ['energy', 'tempo', 'valence', 'popularity']
target = 'danceability'

# Split the data into training and testing sets
X = spotify_data[features]
y = spotify_data[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Display model performance metrics
model_performance = {
    'Mean Squared Error': mse,
    'R-squared': r2,
    'Coefficients': model.coef_,
    'Intercept': model.intercept_
}

model_performance

Machine Learning Model Summary

The linear regression model provided the following insights:

Model Performance:
- Mean Squared Error (MSE): 0.023, which indicates the average squared difference between the predicted and actual danceability values.
- R-squared: 0.235, suggesting that approximately 23.5% of the variance in danceability is explained by the model. This is relatively low, indicating that other factors not included in the model may also influence danceability.
Coefficients:
- Energy: 0.023
- Tempo: -0.0006 (negative but very small, indicating a negligible effect)
- Valence: 0.317 (the strongest positive effect)
- Popularity: 0.0004 (very small effect)

The model confirms that valence (happiness) has the strongest positive influence on danceability, followed by energy. Tempo and popularity have minimal effects.

‌
‌
‌