Skip to content
New Workbook
Sign up
Market Analysis (Python)

Conducting a Market Analysis

0. The Goal

A fitness studio called based in Singapore is interested in understanding the types of digital products they should offer. They have asked me to perform analysis that will help them

  • understand how to place their digital fitness products in the regional market,
  • identify strengths of their competitors,
  • gauge demand, and
  • create unique new digital products and services for potential users.

To accomplish this, I will conduct a market analysis in Python using international data, provided by the company, on Google Trends and YouTube keyword searches related to fitness and related products.

1. Load and validate data on global interest in workouts

I start by assessing global demand for working out. To do this, I will create and call helper functions "read_file" and "read_geo" to load the workout.csv file into a DataFrame.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style='white', palette='Pastel2')
import os

def read_file(filepath, plot = True):
    """
    Read a CSV file from a given filepath, convert it into a DataFrame, and return a
    processed DataFrame with three columns: 'week', 'region', and 'interest'. 
    Generate a line plot using Seaborn to visualize the data. 
    This corresponds to the first graphic (time series) returned by trends.google.com. 
    """
    if not hasattr(read_file, "count"): # if the function has not been called before, set the count to 1
        read_file.count = 1
    else: # if the function has been called before, increment the count by 1
        read_file.count += 1

    file = pd.read_csv(filepath, header=1)
    df = file.set_index('Week').stack().reset_index()
    df.columns = ['week','region','interest']
    df['week'] = pd.to_datetime(df['week'])
    plt.figure(figsize=(8,3))
    df = df[df['interest']!="<1"]
    df['interest'] = df['interest'].astype(float)

    if plot: # modify the code so it gives the plot a title "Figure [n]: Google Trends search ratio over time". n is how many times the function has been called.
        plt.title(f"Figure {read_file.count}: Google Trends search ratio over time")
        sns.lineplot(data = df, x= 'week', y= 'interest', hue='region')
    return df

def read_geo(filepath, multi=False):
    """
    Read a CSV file from a given filepath, convert it into a pandas DataFrame,
    and return a processed DataFrame with two columns: 'country' and 'interest'. 
    Generate a bar plot using Seaborn to visualize the data. 
    This corresponds to the second graphic returned by trends.google.com. 
    Use multi=False if only one keyword is being analyzed, and 
    multi=True if more than one keyword is being analyzed.
    """
    file = pd.read_csv(filepath, header=1)

    if not multi:
        file.columns = ['country', 'interest']
        plt.figure(figsize=(8,4))
        sns.barplot(data = file.dropna().iloc[:25,:], y = 'country', x='interest')

    if multi:
        plt.figure(figsize=(3,8))
        file = file.set_index('Country').stack().reset_index()
        file.columns = ['country','category','interest']
        file['interest'] = pd.to_numeric(file['interest'].apply(lambda x: x[:-1]))
        sns.barplot(data=file.dropna(), y = 'country', x='interest', hue='category')

    file = file.sort_values(ascending=False,by='interest')
    return file
df=read_file("data/workout.csv")
dataset_name = "workout.csv"

From figure 1, we can see that demand is cyclical, with peaks in January. This is likely due to New Year's resolutions. The lowest demand is in December, which is likely due to the holiday season. We can also see outlier event peaks in March 2020, which is likely due to the COVID-19 pandemic as people were forced to stay home and were using Google Search to find ways to exercise.

Let's check if the dataset has any dublicate, outlier or missing values.

import pandas as pd
import scipy.stats as stats

def generate_report(df, dataset_name):
    num_duplicate_rows = df.duplicated().sum()
    report = f"The dataset in {dataset_name} has {df.shape[0]} rows "
    if num_duplicate_rows == 0:
        report += "from which 0 are duplicates."
    else:
        report += f"from which {num_duplicate_rows} are duplicates.\n"
    
    report += "\n\nThe Data is observed via {} columns:\n".format(df.shape[1])
    
    for column in df.columns:
        report += get_column_info(df, column)
    
    report += additional_data_validation_tasks(df)
    
    return report

def get_column_info(df, column_name):
    num_unique_values = df[column_name].nunique()
    num_missing_values = df[column_name].isnull().sum()
    
    missing_status = f"and {num_missing_values} missing values."
    
    info = f"- {column_name}: Contained {num_unique_values} unique values {missing_status}\n"
    
    return info

def additional_data_validation_tasks(df):
    # Outlier Detection Task
    outlier_status = detect_outliers(df)

    # Skewness and Kurtosis Task
    skewness_kurtosis_status = calculate_skewness_kurtosis(df)
    
    return outlier_status + skewness_kurtosis_status

def detect_outliers(df):
    # You can implement your outlier detection logic here.
    # For example, identify and report columns with potential outliers.
    
    # Placeholder logic for illustration purposes
    outlier_columns = [col for col in df.columns if df[col].dtype == 'float64']
    outlier_status = f"\nThe following columns contained outliers: {', '.join(outlier_columns)}\n"
    
    return outlier_status

def calculate_skewness_kurtosis(df):
    skewness_kurtosis_status = ""
    
    for col in df.select_dtypes(include=['float64']).columns:
        skewness = stats.skew(df[col].dropna())
        kurtosis = stats.kurtosis(df[col].dropna())
        skewness_kurtosis_status += f"{col}: Skewness = {skewness:.2f}, Kurtosis = {kurtosis:.2f}\n"
    
    return skewness_kurtosis_status

# Call the generate_report function to generate the report
report_text = generate_report(df, dataset_name)

# Print the report
print(report_text)

The outlier values of interest seen in figure 1 in raise its kurtosis value to 6.49, which is significantly greater than the desired 3.

Undesirable kurtosis indicates that the data does follow normal distribution. Normal distribution is needed for

2. Assess global interest in fitness

Next, I'll load the file called three_keywords.csv using the same function as before, into a DataFrame. This file tracks global interest in three keywords.

keywords = read_file('data/three_keywords.csv')

After visually assessing the plot, I'll create two variables:

  1. "current" which equals the keyword string value that you can see generated the most interest from 2022 to 2023.
  2. peak_covid, indicating which keyword string value that you can see generated the most interest during 2020.
current = 'workout'
peak_covid = 'home workout'