Project: Data Driven Product Management: Conducting a Market Analysis

You are a product manager for a fitness studio based in Singapore and are interested in understanding the types of digital products you should offer. You already run successful local studios and have an established practice in Singapore. You want to understand the place of digital fitness products in your local market.

You would like to conduct a market analysis in Python to understand how to place your digital product in the regional market and what else is currently out there.

A market analysis will allow you to achieve several things. By identifying strengths of your competitors, you can gauge demand and create unique digital products and services. By identifying gaps in the market, you can find areas to offer a unique value proposition to potential users.

The sky is the limit for how you build on this beyond the project! Some areas to go investigate next are in-person classes, local gyms, local fitness classes, personal instructors, and even online personal instructors.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='white', palette='Pastel2')
import os

def read_file(filepath, plot = True):
    """
    Read a CSV file from a given filepath, convert it into a pandas DataFrame,
    and return a processed DataFrame with three columns: 'week', 'region', and 'interest'. Generate a line plot using Seaborn to visualize the data. This corresponds to the first graphic (time series) returned by trends.google.com. 
    """
    file = pd.read_csv(filepath, header=1)
    df = file.set_index('Week').stack().reset_index()
    df.columns = ['week','region','interest']
    df['week'] = pd.to_datetime(df['week'])
    plt.figure(figsize=(8,3))
    df = df[df['interest']!="<1"]
    df['interest'] = df['interest'].astype(float)

    if plot:
        sns.lineplot(data = df, x= 'week', y= 'interest',hue='region')
    return df

def read_geo(filepath, multi=False):
    """
    Read a CSV file from a given filepath, convert it into a pandas DataFrame,
    and return a processed DataFrame with two columns: 'country' and 'interest'. Generate a bar plot using Seaborn to visualize the data. This corresponds to the second graphic returned by trends.google.com. Use multi=False if only one keyword is being analyzed, and multi=True if more than one keyword is being analyzed.
    """
    file = pd.read_csv(filepath, header=1)

    if not multi:
        file.columns = ['country', 'interest']
        plt.figure(figsize=(8,4))
        sns.barplot(data = file.dropna().iloc[:25,:], y = 'country', x='interest')

    if multi:
        plt.figure(figsize=(3,8))
        file = file.set_index('Country').stack().reset_index()
        file.columns = ['country','category','interest']
        file['interest'] = pd.to_numeric(file['interest'].apply(lambda x: x[:-1]))
        sns.barplot(data=file.dropna(), y = 'country', x='interest', hue='category')

    file = file.sort_values(ascending=False,by='interest')
    return file

workout = read_file('data/workout.csv')
workout_by_month = workout.set_index('week').resample('M').mean(numeric_only=True)
month_high = workout_by_month.loc[workout_by_month['interest'].idxmax()]
print(month_high)

hwo = read_file('data/home_workout_gym_workout_home_gym.csv')

# find the highest interest in home workout between 2021 and 2023 and return the region
current = hwo.query('week > "2021-01-01" & week < "2023-01-01"')\
    .groupby('region').mean(numeric_only=True)\
    .sort_values(ascending=False,by='interest').iloc[0].name.split(':')[0]

# regex the name to only nclude up to the colon



peak_covid = hwo.query('week.dt.year == 2020')\
    .groupby('region').mean(numeric_only=True)\
    .sort_values(ascending=False,by='interest').iloc[0].name.split(':')[0]

print(current)
print(peak_covid)

workout_global = read_geo('data/workout_global.csv')

top_25_countries = workout_global.sort_values(ascending=False,by='interest').head(25).reset_index(drop=True)

some_country = ["Philippines", "Singapore", "United Arab Emirates", "Qatar", "Kuwait", "Malaysia", "Sri Lanka", "India", "Pakistan"]

# return a list of countrys, where the country is not in the top 25 countries
print([country for country in some_country if country not in top_25_countries['country'].tolist()])

top_25_countries

geo_categories = read_geo('data/geo_home_workout_gym_workout_home_gym.csv', multi=True)

MESA = geo_categories[geo_categories['country'].isin(top_25_countries['country'])]
MESA.set_index(['country','category']).unstack()

# get the country with the highest interest groubed by category where the string includes home workout
top_home_country  = MESA.query('category.str.contains("home workout")')\
    .sort_values(ascending=False,by='interest').iloc[0].country

top_home_country

phl = read_file('data/yoga_workout_zumba_bodybuilding_weight_loss_phl.csv')
sng = read_file('data/yoga_workout_zumba_bodybuilding_weight_loss_sng.csv')

# join phl and sng then find the top 2 rows with the highest interest value
# drop all rows that contain the word "workout" in the region column
pilot_content = pd.concat([phl,sng])\
    .query('region.str.contains("workout") == False')\
    .groupby('region').mean(numeric_only=True)\
    .sort_values(ascending=False,by='interest').head(2).reset_index()

# turn the df into an array with just the two regions, but regex it so it returns just the str before a colon
# make sure to take each element out of their invididual arrays
pilot_content = pilot_content['region'].str.extract(r'(.+):')[0].tolist()
pilot_content