You are a product manager for a fitness studio based in Singapore and are interested in understanding the types of digital products you should offer. You plan to conduct a market analysis in Python to understand how to place your digital fitness products in the regional market. A market analysis will allow you to identify strengths of your competitors, gauge demand, and create unique new digital products and services for potential users.
You are provided with a number of CSV files in the Files-"data" folder, which offer international data on Google Trends and YouTube keyword searches related to fitness and related products. Two helper functions have also been provided, read_file
and read_geo
, to help you process and visualize these CSV files for further analysis.
You'll use pandas
methods to explore this data and drive your product management insights.
You can continue beyond the bounds of this project and also investigate in-person classes, local gyms, and online personal instructors!
# STARTER CODE - PLEASE DO NOT EDIT ANY CODE IN THIS CELL
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='white', palette='Pastel2')
import os
def read_file(filepath, plot = True):
"""
Read a CSV file from a given filepath, convert it into a pandas DataFrame,
and return a processed DataFrame with three columns: 'week', 'region', and 'interest'. Generate a line plot using Seaborn to visualize the data. This corresponds to the first graphic (time series) returned by trends.google.com.
"""
file = pd.read_csv(filepath, header=1)
df = file.set_index('Week').stack().reset_index()
df.columns = ['week','region','interest']
df['week'] = pd.to_datetime(df['week'])
plt.figure(figsize=(8,3))
df = df[df['interest']!="<1"]
df['interest'] = df['interest'].astype(float)
if plot:
sns.lineplot(data = df, x= 'week', y= 'interest',hue='region')
return df
def read_geo(filepath, multi=False):
"""
Read a CSV file from a given filepath, convert it into a pandas DataFrame,
and return a processed DataFrame with two columns: 'country' and 'interest'. Generate a bar plot using Seaborn to visualize the data. This corresponds to the second graphic returned by trends.google.com. Use multi=False if only one keyword is being analyzed, and multi=True if more than one keyword is being analyzed.
"""
file = pd.read_csv(filepath, header=1)
if not multi:
file.columns = ['country', 'interest']
plt.figure(figsize=(8,4))
sns.barplot(data = file.dropna().iloc[:25,:], y = 'country', x='interest')
if multi:
plt.figure(figsize=(3,8))
file = file.set_index('Country').stack().reset_index()
file.columns = ['country','category','interest']
file['interest'] = pd.to_numeric(file['interest'].apply(lambda x: x[:-1]))
sns.barplot(data=file.dropna(), y = 'country', x='interest', hue='category')
file = file.sort_values(ascending=False,by='interest')
return file
import pandas as pd
workout = read_file('data/workout.csv')
to assess the month in which demand for fitness is highest across the world, on average. Create a string variable month_str containing the month with the highest interest from this workout_by_month, in the format of "yyyy-mm-dd."
workout.head()
import pandas as pd
workout_by_month = workout.set_index('week').resample('MS').mean()
month_high = workout_by_month[workout_by_month['interest']==workout_by_month['interest'].max()]
month_str = str(month_high.index[0].date())
highest_interest_month = workout_by_month[workout_by_month['interest'] == workout_by_month['interest'].max()]
highest_interest_month
month_str = highest_interest_month.index.strftime('%Y-%m-%d')[0]
month_str
keyword = read_file('data/three_keywords.csv')
workout_global = read_geo('data/workout_global.csv')
top_25_countries = workout_global.head(25)
top_country = top_25_countries['country'].iloc[0]
keyword.head()
current = keyword[(keyword['week'] >= '2022-01-01')]
current
peak_covid = keyword[keyword['week'].dt.year == 2020]['region'].value_counts().idxmax()
peak_covid
geo_categories = read_geo('data/geo_three_keywords.csv', multi=True)
geo_categories
countries = ["Philippines", "Singapore", "United Arab Emirates", "Qatar", "Kuwait", "Lebanon", "Malaysia", "Sri Lanka", "India", "Pakistan"]
MESA = geo_categories[geo_categories['country'].isin(countries)]
MESA.reset_index(drop=True, inplace=True)
MESA