The project is defined for a product manager for a fitness studio based in Singapore and they are interested in understanding the types of digital products you should offer. They plan to conduct a market analysis in Python to understand how to place your digital fitness products in the regional market. A market analysis will allow them to identify strengths of our competitors, gauge demand, and create unique new digital products and services for potential users.
We are provided with a number of CSV files in the Files-"data" folder, which offer international data on Google Trends and YouTube keyword searches related to fitness and related products. Two helper functions have also been provided, read_file
and read_geo
, to help us process and visualize these CSV files for further analysis.
We'll use pandas
methods to explore this data and drive your product management insights.
# STARTER CODE - PLEASE DO NOT EDIT ANY CODE IN THIS CELL
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='white', palette='Pastel2')
import os
def read_file(filepath, plot = True):
"""
Read a CSV file from a given filepath, convert it into a pandas DataFrame,
and return a processed DataFrame with three columns: 'week', 'region', and 'interest'. Generate a line plot using Seaborn to visualize the data. This corresponds to the first graphic (time series) returned by trends.google.com.
"""
file = pd.read_csv(filepath, header=1)
df = file.set_index('Week').stack().reset_index()
df.columns = ['week','region','interest']
df['week'] = pd.to_datetime(df['week'])
plt.figure(figsize=(8,3))
df = df[df['interest']!="<1"]
df['interest'] = df['interest'].astype(float)
if plot:
sns.lineplot(data = df, x= 'week', y= 'interest',hue='region')
return df
def read_geo(filepath, multi=False):
"""
Read a CSV file from a given filepath, convert it into a pandas DataFrame,
and return a processed DataFrame with two columns: 'country' and 'interest'. Generate a bar plot using Seaborn to visualize the data. This corresponds to the second graphic returned by trends.google.com. Use multi=False if only one keyword is being analyzed, and multi=True if more than one keyword is being analyzed.
"""
file = pd.read_csv(filepath, header=1)
if not multi:
file.columns = ['country', 'interest']
plt.figure(figsize=(8,4))
sns.barplot(data = file.dropna().iloc[:25,:], y = 'country', x='interest')
if multi:
plt.figure(figsize=(3,8))
file = file.set_index('Country').stack().reset_index()
file.columns = ['country','category','interest']
file['interest'] = pd.to_numeric(file['interest'].apply(lambda x: x[:-1]))
sns.barplot(data=file.dropna(), y = 'country', x='interest', hue='category')
file = file.sort_values(ascending=False,by='interest')
return file
# 1. Load data on global interest in fitness
workout = read_file('data/workout.csv')
# 2. Assess global interest in fitness
workout_by_month = workout.set_index('week').resample('MS').mean()
month_high = workout_by_month[workout_by_month['interest']==workout_by_month['interest'].max()]
month_str = str(month_high.index[0].date())
# 3. Compare interest in home workouts, gym workouts and home gyms
workout = read_file('data/three_keywords.csv') # This will create a lineplot
current = 'gym workout'
peak_covid = 'home workout'
# 4. Segment global interest by region
workout_global = read_geo('data/workout_global.csv')
top_25_countries = workout_global.head(25)
top_country = top_25_countries['country'].iloc[0]
# 5. Assessing regional demand for home workouts, gym workouts and home gyms
geo_categories = read_geo('data/geo_three_keywords.csv', multi=True)
MESA_countries = ["Philippines", "Singapore", "United Arab Emirates", "Qatar", "Kuwait", "Malaysia", "Sri Lanka", "India", "Pakistan", "Lebanon"]
MESA = geo_categories.loc[geo_categories.country.isin(MESA_countries), :]
# 6. Assess the split of interest by country and category
MESA.set_index(['country','category']).unstack()
top_home_workout_country = 'Philippines'
# 7. A deeper dive into two countries
read_file('data/yoga_zumba_sng.csv')
read_file('data/yoga_zumba_phl.csv')
pilot_content = ['yoga', 'zumba']