Skip to content
From Views to Value: Leveraging YouTube Data for E-Learning Success
  • AI Chat
  • Code
  • Report
  • Enhance your brand using YouTube

    You're a data scientist at a global marketing agency that helps some of the world's largest companies enhance their online presence.


    Your project is to identify the most effective YouTube videos to promote your clients’ brands; dive deep and discover who really connects with audiences through innovative content analysis.

    Outline of steps taken

    1. Import libraries, data, and helper functions
    2. Data Preprocessing and Cleaning: fix outliers, feature engineering, remap categories, and throw out information where necessary
    3. Exploratory Data Analysis: understand the underlying relationships of both Video Stats and Comments
    4. Model Building
    5. Strategic recommendations for E-Learning Collaboration

    Let me briefly explain the purpose of each step

    Import Libaries, Data, and Helper Functions

    • We need to perform a significant number of steps and these are the instruments of our analysis

    Data Processing and Cleaning

    • "Video Stats" dataset is riddled with errors and null values; it also contains a large segment of non-English titles only add noise to our analysis and model building. I throw out all entries that are not English language
    • Secondly, I incorporate a new feature, "engagement rate," to measure the video's popularity. I settle on a definition of engagement rate as equal to the ratio of the sum of likes and 1.5 times the comments over the total number of views.

    Exploratory Data Analysis

    • For Video Stats, we must answer the following questions at minimum:
      • How does a video's engagement rate vary across industry?
      • How are videos distributed across time?
      • Which industries have the highest view count, likes, and comments?
      • Are there specific keywords that generate higher engagement?
      • How do engagement metrics vary over time? Are there noticeable trends or patterns?
      • What are the characteristics of the top-performing videos in each industry?
    • For Comments, we must answer the following questions at minimum:
      • What is the overall comment distribution?
      • How does comment sentiment vary by Industry?
      • Is there a relationship between comment sentiment and engagement rate?

    For both datasets, it was crucial to keep in mind that the EDA was in service of the model building. I did not assume that the datasets were representative of YouTube video population. If a video observation failed to meet the model building requirements, it was filtered out prior to visualization.

    Model Building

    • We will pare down our datasets to their core features
      • Using the following features: 'Industry', 'Title_Clean', 'Avg Views', 'Avg Comments', and 'Avg Likes'
      • Our target variable is 'Engagement Rate'
    • So, our model takes the above mentioned features and predicts what the video's average engagement rate should be.
    • The data will be churned into a pipeline that is fed to an XGBoost regression estimator

    Strategic recommendations for E-Learning Collaboration

    • Using our newly built XGBoost model, we will feed the dataset back into the model and predict the engagement rate for each video
    • We will filter the existing dataset based on temporal and keywords, then return the videos with the highest model-predicted engagement rate

    Let me define some important concepts and assumptions

    How should we measure audience connection?

    The dataset supplies us with three qualitative features: views, likes, and comments. A view a precondition for a like or a comment: someone who has viewed the video may not necessarily like or comment on it. A video that deeply connects with a user is more likely to have comments and likes. Because of the user effort it takes to write a comment is greater than the effort to leave a comment, I have increased the weight of a comment as a means of signaling higher engagement. I therefore define a video's engagement rate using the following formula:

    I will use this metric later to build my model. A high engagement rate indicates strong viewer interaction and interest.

    How valid is our dataset for the purposes of model building?

    • I am assuming that the target of our model building exercise is an English speaking audience and therefore have excluded all non-English entries from consideration during exploratory data analysis and model building. In fact there is a signficant number of observations in both the Video Stats and Comments datasets that are non-English.
    • In addition, we will discover that the column feature 'keyword' is poorly encoded. There are especially egregious instances were a keyword does not reflect its true category (particularly for music videos). I manually re-encoded this data.

    How did I remap categories?

    • To improve our understanding of video brands and categories, as well as their temporal trends and engagement metrics, I re-encoded the original keyword column using my best judgment. Where I thought there was topic overlap, I consolidated keywords by topic, e.g. Nintendo, X-Box, Minecraft are all considered Video Games.
    Original KeywordRemapped Keyword
    computer scienceCoding
    cubesRubix Cube
    data scienceData Science
    game developmentCoding
    gamingVideo Games
    machine learningData Science
    minecraftVideo Games
    mrbeastMr Beast
    nintendoVideo Games
    xboxVideo Games

    In addition, I found multiple instances were music artists were present in keyword categories unrelated to music. These instances were manually remapped.

    How did I normalize view, comment, and like counts across time?

    By definition, a video's view count starts at zero and only increments positively. A video's view count today can only be greater than or equal to its view count yesterday. A video that has been published for several years has the benefit of many more days' public exposure compared to a video that has only been published for a few days. To normalize this, I calculated the number of days since published, which is defined as the difference between the max published date in the dataset and the respective video's dataset. This shows the number of days each video has been 'live' on site. Then I took the respective metric - comment, view, or like - divided it by the number of days published.

    The dataset did not provide better alternatives to this logic: for instance, I could not apply a weighted average metric that put more weigh on recent views.

    How did I deal with the datasets' text features?

    Every video observation had a title (otherwise I would have thrown it out). I performed standard text cleaning of the title, including lowercase conversion, punctuation removal, stopword removal, and lemmatization. This was applied to both the title and the comment features. During the model building stage, I employed TF-IDF analysis to the text features, using a maximum of the 50 most common words as features for my model building. This was a subjective decision.

    What about comment sentiment?

    For the comments dataset, I also calculated the comment length and the comment sentiment of each observation separately. The original dataset only categorized comment sentiment into three categories: zero, one, and two. I was not satisfied with this granularity. Rather than rely on the dataset's provided sentiment score, I used the TextBlob library instead because it yielded higher sensitivity to comment sentiment. Why TextBlob? Because it was good enough for the task of delivering an estimated average comment sentiment for each video.

    For reference the TextBlob library normalizes comment sentiment on a -1 to 1 scale, the former signifying a 'negative' comment and the latter signifying a 'positive' comment. I considered any comment with a TextBlob sentiment value less than -0.2 to be 'negative'; any comment with a sentiment value greater than 0.2 to be 'positive'; and everything in between to be neural.

    How did you build the model?

    I isolated the key features from the dataset, which as mentioned were 'Industry', 'Title_Clean', 'Avg Views', 'Avg Comments', and 'Avg Likes' for each video. These features were used to predict the engagement rate of a video. As this is a regression problem, I evaluated multiple machine learning algorithms and ulimately settled on the XGBoost model because it returned the lowest mean squared error value. I did not include the model evaluation testing part of this analysis in the workbook. In addition, I did not perform any hyperparameter tuning as I consider then model good enough to meet the requirements of the business objective as is.

    What was your final result?

    Ultimately, the fitted XGBoost estimator predicted the engagement rate of each of the video observations categorized as Data Science under my revised keyword mapping. I was selective on what qualified as a strategic partnership candidate video.

    Specifically, I selected the top-3 videos with the highest predicted engagement that were:

    • Categorized as 'Data Science'
    • Had an Engagement Rate greater than 4.00 (for reference, the median engagement rate was 3.26)
    • Had an average comment sentiment greater than 0.2, signifying positive comments
    • Was published on or after January 1, 2021 (I am biased against older videos in the dataset)

    My Recommendations

    Choice 1:

    • Video: DATA SCIENTIST CAREER STORY (how I became a data scientist & revealing my salaries)
    • Video ID: A2v4_SFd9Cc
    • Published: 2022-02-20 00:00:00
    • Predicted Engagement: 5.98
    • Why: This video has a high predicted engagement above the median, positive average comment sentiment, is relatively recent (Feb. 2022), covers a theme (career building) that overlaps with our goal of an e-learning platform. The model suggests that this video has potential to attract more viewership.

    Choice 2:

    • Video: Live Day 1- Introduction To Machine Learning Algorithms For Data Science
    • Video ID: 4UJelID_ICw
    • Published: 2022-02-01 00:00:00
    • Predicted Engagement: 5.86
    • Why: This video has a high predicted engagement above the median, a very positive average comment sentiment (far above the average of other data science videos), is relatively recent (Feb. 2022), covers a theme (introduction to the machine learning algorithms) that overlaps with our goal of an e-learning platform. The model suggests that this video has potential to attract more viewership and would be a positive addition.

    Choice 3:

    • Video: Data science roadmap: What skills you should learn first?
    • Video ID: ylOILe-Sc-w
    • Published: 2021-08-06 00:00:00
    • Predicted Engagement: 4.99
    • Why: This video has a high predicted engagement above the median, a positive average comment sentiment, is less recent (August 2021), but covers the objective theme (giving an overview of machine learning skillsets) that overlaps with our goal of an e-learning platform. This video has a large number of views and the model suggests that this video has potential to attract more viewership in partnership.

    Let's now take a look at the analysis below...

    1) Import Libraries, Data, and Helper Functions


    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    import as px
    from scipy.stats import ttest_ind
    import re
    import string
    from datetime import datetime
    #import spacy
    #nlp = spacy.load("en_core_web_sm")
    import nltk
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.decomposition import LatentDirichletAllocation, PCA
    from sklearn.model_selection import train_test_split, cross_validate
    from sklearn.pipeline import make_pipeline
    from sklearn.cluster import KMeans
    from sklearn.linear_model import LinearRegression, Lasso, Ridge
    from sklearn.compose import make_column_transformer, make_column_selector
    from sklearn.metrics import mean_squared_error
    from xgboost import XGBRegressor
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.feature_selection import VarianceThreshold
    from wordcloud import WordCloud
    from collections import Counter
    from textblob import TextBlob
    #import langid
    #from langdetect import detect, DetectorFactory
    #from langdetect.lang_detect_exception import LangDetectException
    import statsmodels.api as sm
    import gensim
    from gensim import corpora
    from gensim.models.coherencemodel import CoherenceModel
    # Download necessary NLTK data'omw-1.4')'punkt')'stopwords')'wordnet')
    pd.set_option('display.float_format', '{:,.2f}'.format)
    # Setting up visualization aesthetics

    Helper functions

    def clean_text(text):
        if type(text) != str:
            text = str(text)
        # Convert to lowercase
        text = text.lower()
        # Remove punctuation and special characters
        text = text.translate(str.maketrans('', '', string.punctuation))
        # Tokenize text
        tokens = word_tokenize(text)
        # Remove stopwords
        stop_words = set(stopwords.words('english'))
        tokens = [word for word in tokens if word not in stop_words]
        # Lemmatize tokens
        lemmatizer = WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
        tokens = [t for t in tokens if t not in string.punctuation]
        # Join tokens back into a single string for easier analysis later
        cleaned_text = ' '.join(tokens)
        return cleaned_text
    # Function to detect language using langdetect
    def detect_language_langdetect(text):
            return detect(text)
        except LangDetectException:
            return "unknown"
    # Function to detect language using langid
    def detect_language_langid(text):
        lang, _ = langid.classify(text)
        return lang
    # Combined function to detect language using both methods
    def detect_language(text):
        lang1 = detect_language_langdetect(text)
        lang2 = detect_language_langid(text)
        if lang1 == 'en' or lang2 == 'en':
            return 'en'
            return 'unknown'
    def get_sentiment_category(polarity):
        if polarity < -0.2:
            return 'Negative'
        elif polarity < 0.2:
            return 'Neutral'
            return 'Positive'
    def get_comment_length(text):
        if type(text) != str:
            text = str(text)
        return len(text)
    def get_comment_type(text):
        return type(text) == str
    # Define the function to calculate custom engagement rate
    def calculate_engagement_rate(likes, comments, views, like_weight=1, comment_weight=1.5):
        if views == 0:  # Avoid division by zero
            return 0
        return ((like_weight * likes + comment_weight * comments) / views) * 100
    def get_sort_index(df, column):
        industry_medians = df.groupby(column)['Engagement Rate'].median().sort_values(ascending = False)
        sorted_order = industry_medians.index.tolist()
        return sorted_order
    def plot_median_engagement_rate_by(df, column):
        tmp = df.groupby(column)['Engagement Rate'].median().sort_values(ascending = False).reset_index()
        tmp = tmp.melt(id_vars = [column])
        tmp.columns = [column, 'Metric', 'Value']
        tmp2 = tmp[tmp['Metric'] == 'Engagement Rate'].sort_values(by = 'Value', ascending = False)
        sns.barplot(data = tmp2, y = column, x = 'Value')
    def filter_outliers(df, category_col, value_col):
        Filters outliers from the DataFrame based on the category of the item.
            df (pd.DataFrame): The input DataFrame.
            category_col (str): The name of the column containing category information.
            value_col (str): The name of the column containing the values to be filtered.
            pd.DataFrame: The DataFrame with outliers removed.
        def remove_outliers(group):
            q1 = group[value_col].quantile(0.25)
            q3 = group[value_col].quantile(0.75)
            iqr = q3 - q1
            lower_bound = q1 - 1.5 * iqr
            upper_bound = q3 + 1.5 * iqr
            return group[(group[value_col] >= lower_bound) & (group[value_col] <= upper_bound)]
        return df.groupby(category_col).apply(remove_outliers).reset_index(drop=True)
    # Function to filter out tokens that are purely numeric
    def filter_numeric_tokens(tokens):
        return [str(token) for token in tokens if not re.match(r'^\d+$|^.{1}$', str(token))]
    # Function to get the most common tokens for each category
    def most_common_tokens_by_category(df, category_col, tokens_col, top_n=3):
        # Group by the category column
        grouped = df.groupby(category_col)
        # Dictionary to hold results
        common_tokens = {}
        # Iterate over each group
        for category, group in grouped:
            all_tokens = []
            # Collect all tokens from the tokens column
            for tokens in group[tokens_col]:
                filtered_tokens = filter_numeric_tokens(tokens)
            # Count the tokens
            token_counts = Counter(all_tokens)
            # Get the most common tokens
            common_tokens[category] = token_counts.most_common(top_n)
        return common_tokens
    def extract_keywords(title, keywords):
        found_keywords = [keyword for keyword in keywords if keyword in title]
        return found_keywords if found_keywords else ['other']
    def get_sentiment(text):
        if type(text) != str:
        textblob = TextBlob(text)
        return textblob.sentiment[0]


    source = 'efficient'
    if source == 'original':
        # Video stats for final version
        print('original source')
        video_stats = pd.read_csv('data/videos_stats.csv', parse_dates=['Published At'], dayfirst=True)
        comments = pd.read_csv("data/comments.csv")
        # Video stats for efficiency
        print('efficient source')
        video_stats = pd.read_excel('data/video_stats_ENGLISH.xlsx')
        comments = pd.read_excel("data/comments_ENGLISH.xlsx")
        comments.columns = ['Video ID','Comment','Comment_Likes','Comment_Sentiment']
    # Definitions
    engagement_metrics = ['Views', 'Likes', 'Comments']
    original_shape = video_stats.shape
    print(f"{original_shape[0] - video_stats[['Title', 'Published At', 'Keyword', 'Likes', 'Comments','Views']].drop_duplicates().shape[0]} duplicate rows")

    2) Data Preprocessing & Cleaning

    Filtering Poorly Encoded Data

    Video Stats

    # Replace negative values with NaN and then handle missing values by filling with the median (robust to outliers)
    video_stats.loc[video_stats['Likes'] < 0, 'Likes'] = None
    video_stats.loc[video_stats['Comments'] < 0, 'Comments'] = None
    video_stats.loc[video_stats['Views'] < 0, 'Views'] = None
    # Fill missing/negative values with zero
    video_stats['Likes'].fillna(0, inplace=True)
    video_stats['Comments'].fillna(0, inplace=True)
    video_stats['Views'].fillna(0, inplace=True)
    # Recheck the dataframe to ensure cleaning is complete, negative_values



    Filtering Values