Project: Data Driven Product Management: Conducting a Market Analysis

You are a product manager for a fitness studio based in Singapore and are interested in understanding the types of digital products you should offer. You plan to conduct a market analysis in Python to understand how to place your digital fitness products in the regional market. A market analysis will allow you to identify strengths of your competitors, gauge demand, and create unique new digital products and services for potential users.

You are provided with a number of CSV files in the Files-"data" folder, which offer international data on Google Trends and YouTube keyword searches related to fitness and related products. Two helper functions have also been provided, read_file and read_geo, to help you process and visualize these CSV files for further analysis.

You'll use pandas methods to explore this data and drive your product management insights.

You can continue beyond the bounds of this project and also investigate in-person classes, local gyms, and online personal instructors!

# STARTER CODE - PLEASE DO NOT EDIT ANY CODE IN THIS CELL

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='white', palette='Pastel2')
import os

def read_file(filepath, plot = True):
    """
    Read a CSV file from a given filepath, convert it into a pandas DataFrame,
    and return a processed DataFrame with three columns: 'week', 'region', and 'interest'. Generate a line plot using Seaborn to visualize the data. This corresponds to the first graphic (time series) returned by trends.google.com. 
    """
    file = pd.read_csv(filepath, header=1)
    df = file.set_index('Week').stack().reset_index()
    df.columns = ['week','region','interest']
    df['week'] = pd.to_datetime(df['week'])
    plt.figure(figsize=(8,3))
    df = df[df['interest']!="<1"]
    df['interest'] = df['interest'].astype(float)

    if plot:
        sns.lineplot(data = df, x= 'week', y= 'interest',hue='region')
    return df

def read_geo(filepath, multi=False):
    """
    Read a CSV file from a given filepath, convert it into a pandas DataFrame,
    and return a processed DataFrame with two columns: 'country' and 'interest'. Generate a bar plot using Seaborn to visualize the data. This corresponds to the second graphic returned by trends.google.com. Use multi=False if only one keyword is being analyzed, and multi=True if more than one keyword is being analyzed.
    """
    file = pd.read_csv(filepath, header=1)

    if not multi:
        file.columns = ['country', 'interest']
        plt.figure(figsize=(8,4))
        sns.barplot(data = file.dropna().iloc[:25,:], y = 'country', x='interest')

    if multi:
        plt.figure(figsize=(3,8))
        file = file.set_index('Country').stack().reset_index()
        file.columns = ['country','category','interest']
        file['interest'] = pd.to_numeric(file['interest'].apply(lambda x: x[:-1]))
        sns.barplot(data=file.dropna(), y = 'country', x='interest', hue='category')

    file = file.sort_values(ascending=False,by='interest')
    return file

# 1. Read workout data from a CSV file
workout = read_file('data/workout.csv')  

# 2. Assess global interest in fitness by resampling the data to a monthly frequency and calculating the mean
workout_by_month = workout.set_index('week').resample('MS').mean()  

# Find the month(s) with the highest average interest in fitness
month_high = workout_by_month[workout_by_month['interest']==workout_by_month['interest'].max()]

# Convert the index (month) of the highest interest to a string in 'YYYY-MM-DD' format
month_str = str(month_high.index[0].date())

# 3. Compare interest in home workouts, gym workouts, and home gyms by reading the relevant data
workout = read_file('data/three_keywords.csv') 
current = 'gym workout'  # Current interest keyword
peak_covid = 'home workout'  # Interest keyword during the peak of COVID-19

# 4. Segment global interest by region by reading geographical data
workout_global = read_geo('data/workout_global.csv')

# Extract the top 25 rows from the DataFrame to identify top countries interested in workouts
top_25_countries = workout_global.head(25)

# Retrieve the country name from the first row of the top 25 countries DataFrame
top_country = top_25_countries['country'].iloc[0]

# 5. Assess regional demand for home workouts, gym workouts, and home gyms by reading geographical data with multiple categories
geo_categories = read_geo('data/geo_three_keywords.csv', multi=True)

# Define a list of countries in the Middle East and South Asia (MESA) with high interest in workout-related activities
MESA_countries = ["Philippines", "Singapore", "United Arab Emirates", "Qatar", "Kuwait", "Malaysia", "Sri Lanka", "India", "Pakistan", "Lebanon"]

# Filter the DataFrame to include only the rows corresponding to the MESA countries
MESA = geo_categories.loc[geo_categories.country.isin(MESA_countries), :]

# 6. Assess the split of interest by country and category in the MESA region
MESA.set_index(['country','category']).unstack()  
top_home_workout_country = 'Philippines'  # Country with the highest interest in home workouts

# 7. A deeper dive into two countries' interests in yoga and zumba by reading their respective data
read_file('data/yoga_zumba_sng.csv')  # Data for Singapore
read_file('data/yoga_zumba_phl.csv')  # Data for the Philippines
pilot_content = ['yoga', 'zumba']  # Keywords for pilot content