Project: Data-Driven Product Management: Conducting a Market Analysis

You are a product manager for a fitness studio and are interested in understanding the current demand for digital fitness classes. You plan to conduct a market analysis in Python to gauge demand and identify potential areas for growth of digital products and services.

The Data

You are provided with a number of CSV files in the "Files/data" folder, which offer international and national-level data on Google Trends keyword searches related to fitness and related products.

workout.csv

Column	Description
`'month'`	Month when the data was measured.
`'workout_worldwide'`	Index representing the popularity of the keyword 'workout', on a scale of 0 to 100.

three_keywords.csv

Column	Description
`'month'`	Month when the data was measured.
`'home_workout_worldwide'`	Index representing the popularity of the keyword 'home workout', on a scale of 0 to 100.
`'gym_workout_worldwide'`	Index representing the popularity of the keyword 'gym workout', on a scale of 0 to 100.
`'home_gym_worldwide'`	Index representing the popularity of the keyword 'home gym', on a scale of 0 to 100.

workout_geo.csv

Column	Description
`'country'`	Country where the data was measured.
`'workout_2018_2023'`	Index representing the popularity of the keyword 'workout' during the 5 year period.

three_keywords_geo.csv

Column	Description
`'country'`	Country where the data was measured.
`'home_workout_2018_2023'`	Index representing the popularity of the keyword 'home workout' during the 5 year period.
`'gym_workout_2018_2023'`	Index representing the popularity of the keyword 'gym workout' during the 5 year period.
`'home_gym_2018_2023'`	Index representing the popularity of the keyword 'home gym' during the 5 year period.

Let's start by analyzing 'workout.csv'. It has the time evolution of an index that measures the popularity of the search 'home workout' captured at every month during 5 years (2018 -2013). It ranges from 0 - 100. The yearly mean does not have a huge variation, with exception of the year 2020 whose mean and std arose by almost 8, and 10 points respectively; otherwise, its value fluctuates around 56 points approx. +/- 4

# Import the necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from statsmodels.tsa.stattools import acf
import seaborn as sns
import seaborn.objects as so
import datetime as dt

# Start coding here
workout_df = pd.read_csv('data/workout.csv')

# Convert column dates to ealily analyze data
workout_df['month'].astype('datetime64[ns]')

# Extract year
workout_df['year'] = workout_df['month'].astype('datetime64[ns]').dt.year

# Plot Time Series data
workout_df.plot(x='month',y='workout_worldwide', color='red', marker='x')
plt.title('Fig. 1 Popularity Search Index Time Series')
plt.show()

# Compute yearly statistics
workout_mean= workout_df.groupby('year')['workout_worldwide'].apply(lambda x: x.mean())
workout_std= workout_df.groupby('year')['workout_worldwide'].apply(lambda x: x.std())
year_str=str(workout_mean.index.values[workout_mean.argmax()])
print(type(year_str))
#Covid values
print("Mean Popularity Index for 2020",workout_mean[2020])
print("STD Popularity Index for 2020",workout_std[2020])

# Compute total mean and std
print("Yearly PI Mean:",workout_mean.mean()," +/-", workout_mean.std())
print("Peak interest year:",year_str)

#Plot yearly data distributions
sns.set_theme(style="whitegrid")
sns.violinplot(data=workout_df, x="year", y="workout_worldwide", fill=False)
plt.ylabel('Popularity Index')
plt.grid(True)
plt.title('Fig. 2. Yearly Distribution of Index Values')

Unknown table

The time series shows a degree of sesonality. In order to measure it, let's calculate its Autocorrelation function(Fig 2), which suggests that cicle of the sesonality is one year, 12 months as shown in the red dotted lines. In addition, there is another recurring event happening around the 9th month, November in this data set. The Popularity Index starts to increase and its Autocorrelation Function becomes positive. This information suggest that a better lunching date will be in late November when "home workout search" will start to be more demanded.

from statsmodels.graphics.tsaplots import plot_pacf

# Convert to numpy
workout_arr = workout_df['workout_worldwide'].iloc[:].to_numpy()

# Compute Autocorrelation
autocorr_normalized = acf(workout_arr,nlags=len(workout_arr)-1)

# Plot autocorrelation
plt.plot(autocorr_normalized,'go',np.zeros(len(autocorr_normalized)),'b-',)

# Plot aid lines
plt.axvline(x=12, color='r', linestyle='--', label='12')
plt.axvline(x=24, color='r', linestyle='--', label='24')
plt.axvline(x=36, color='r', linestyle='--', label='36')
plt.axvline(x=48, color='r', linestyle='--', label='48')
plt.axvline(x=9, color='b', linestyle='--', label='9')
plt.axvline(x=22, color='b', linestyle='--', label='22')
plt.axvline(x=34, color='b', linestyle='--', label='34')
plt.axvline(x=46, color='b', linestyle='--', label='46')
plt.title("Fig 3. Autocorrelation")
plt.legend()
plt.xlabel('Number of months')
plt.ylabel('Autocorrelation')
plt.show()

Now let's dive upon the geografical data.

import pandas as pd

# Load the geographical data from the CSV file
workout_geo_df = pd.read_csv('data/workout_geo.csv')

# Display the first few rows of the dataframe to verify the data
workout_geo_df.head()

# Find if there is any na value
total_is_na = workout_geo_df.isna().sum().sum()
print("Total na is ", total_is_na)

# Drop na values
workout_geo_clean_df = workout_geo_df.dropna()
print("Total na is ", workout_geo_clean_df.isna().sum().sum())

# Sort by values in descending order
workout_geo_clean_df = workout_geo_clean_df.sort_values(by='workout_2018_2023', ascending=False)
print(workout_geo_clean_df.head())

# Select highest index country
top_country = workout_geo_clean_df.iloc[0,0]
print("Country with the highest interest in homeworkouts: ",top_country)

#Compare Malaysia and Philippines indexes
[Malaysia_index ]= workout_geo_clean_df[workout_geo_clean_df['country'] == "Malaysia"]['workout_2018_2023'].values
print("Malaysia index ", Malaysia_index)

[Philippines_index] =workout_geo_clean_df[workout_geo_clean_df['country'] == "Philippines"]['workout_2018_2023'].values
print("Philippines index ", Philippines_index)

if Philippines_index > Malaysia_index :
    home_workout_geo = 'Philippines'
else :
    home_workout_geo = 'Malaysia'
print("Home workout",home_workout_geo)

import pandas

# Load the geographical data from the CSV file
three_keywords_df = pd.read_csv('data/three_keywords.csv')

# Convert column dates to ealily analyze data
three_keywords_df['month'].astype('datetime64[ns]')

# Convert to three cols
three_keywords_concat = pd.DataFrame()
cols = three_keywords_df.columns.to_list()
cols.remove('month')
for col in cols:
    temp = three_keywords_df[['month',col]]
    temp.rename(columns={col:'pop_index'},inplace=True)
    temp['keywords'] = str(col).rstrip("_worldwide")
    three_keywords_concat = pd.concat([three_keywords_concat,temp], axis=0, ignore_index=True)
# Extract year
three_keywords_concat['year'] = three_keywords_concat['month'].astype('datetime64[ns]').dt.year


#Plot yearly data distributions
sns.set_theme(style="whitegrid")
sns.violinplot(data=three_keywords_concat, x="year", y="pop_index", hue="keywords",fill=False)
plt.ylabel('Popularity Index')
plt.grid(True)
plt.title('Fig. 4. Yearly Distribution of Index Values')

#Extract yearly means to compare keywords 

mean_df = three_keywords_concat.groupby(['year','keywords'])['pop_index'].apply(lambda x: x.mean()).sort_values(ascending=False)

# Display the first few rows of the dataframe to verify the data
print(mean_df.head())
peak_covid = mean_df.index[0][1]
print("The most popular set of keywords during Covid was",peak_covid)
current = mean_df.index[1][1]
print("The most popular set of keywords during 2023 was",current)