From Views to Value: Leveraging YouTube Data for E-Learning Success

Enhance your brand using YouTube

You're a data scientist at a global marketing agency that helps some of the world's largest companies enhance their online presence.

Objective

Your project is to identify the most effective YouTube videos to promote your clients’ brands; dive deep and discover who really connects with audiences through innovative content analysis.

Outline of steps taken

Import libraries, data, and helper functions
Data Preprocessing and Cleaning: fix outliers, feature engineering, remap categories, and throw out information where necessary
Exploratory Data Analysis: understand the underlying relationships of both Video Stats and Comments
Model Building
Strategic recommendations for E-Learning Collaboration

Let me briefly explain the purpose of each step

Import Libaries, Data, and Helper Functions

We need to perform a significant number of steps and these are the instruments of our analysis

Data Processing and Cleaning

"Video Stats" dataset is riddled with errors and null values; it also contains a large segment of non-English titles only add noise to our analysis and model building. I throw out all entries that are not English language
Secondly, I incorporate a new feature, "engagement rate," to measure the video's popularity. I settle on a definition of engagement rate as equal to the ratio of the sum of likes and 1.5 times the comments over the total number of views.

Exploratory Data Analysis

For Video Stats, we must answer the following questions at minimum:
- How does a video's engagement rate vary across industry?
- How are videos distributed across time?
- Which industries have the highest view count, likes, and comments?
- Are there specific keywords that generate higher engagement?
- How do engagement metrics vary over time? Are there noticeable trends or patterns?
- What are the characteristics of the top-performing videos in each industry?
For Comments, we must answer the following questions at minimum:
- What is the overall comment distribution?
- How does comment sentiment vary by Industry?
- Is there a relationship between comment sentiment and engagement rate?

For both datasets, it was crucial to keep in mind that the EDA was in service of the model building. I did not assume that the datasets were representative of YouTube video population. If a video observation failed to meet the model building requirements, it was filtered out prior to visualization.

Model Building

We will pare down our datasets to their core features
- Using the following features: 'Industry', 'Title_Clean', 'Avg Views', 'Avg Comments', and 'Avg Likes'
- Our target variable is 'Engagement Rate'
So, our model takes the above mentioned features and predicts what the video's average engagement rate should be.
The data will be churned into a pipeline that is fed to an XGBoost regression estimator

Strategic recommendations for E-Learning Collaboration

Using our newly built XGBoost model, we will feed the dataset back into the model and predict the engagement rate for each video
We will filter the existing dataset based on temporal and keywords, then return the videos with the highest model-predicted engagement rate

Let me define some important concepts and assumptions

How should we measure audience connection?

The dataset supplies us with three qualitative features: views, likes, and comments. A view a precondition for a like or a comment: someone who has viewed the video may not necessarily like or comment on it. A video that deeply connects with a user is more likely to have comments and likes. Because of the user effort it takes to write a comment is greater than the effort to leave a comment, I have increased the weight of a comment as a means of signaling higher engagement. I therefore define a video's engagement rate using the following formula:

I will use this metric later to build my model. A high engagement rate indicates strong viewer interaction and interest.

How valid is our dataset for the purposes of model building?

I am assuming that the target of our model building exercise is an English speaking audience and therefore have excluded all non-English entries from consideration during exploratory data analysis and model building. In fact there is a signficant number of observations in both the Video Stats and Comments datasets that are non-English.
In addition, we will discover that the column feature 'keyword' is poorly encoded. There are especially egregious instances were a keyword does not reflect its true category (particularly for music videos). I manually re-encoded this data.

How did I remap categories?

To improve our understanding of video brands and categories, as well as their temporal trends and engagement metrics, I re-encoded the original keyword column using my best judgment. Where I thought there was topic overlap, I consolidated keywords by topic, e.g. Nintendo, X-Box, Minecraft are all considered Video Games.

Original Keyword	Remapped Keyword
animals	Nature
apple	Tech
asmr	ASMR
bed	Leisure
biology	Science
business	Business
chess	Chess
cnn	News
computer science	Coding
crypto	Crypto
cubes	Rubix Cube
data science	Data Science
education	Education
finance	Finance
food	Food
game development	Coding
gaming	Video Games
google	Tech
history	History
how-to	Tutorials
interview	Interviews
literature	Literature
lofi	Music
machine learning	Data Science
marvel	Movies
mathchemistry	Science
minecraft	Video Games
movies	Movies
mrbeast	Mr Beast
mukbang	Food
music	Music
news	News
nintendo	Video Games
physics	Science
reaction	Reaction
sat	SAT
sports	Sports
tech	Tech
trolling	Trolling
tutorial	Tutorials
xbox	Video Games

In addition, I found multiple instances were music artists were present in keyword categories unrelated to music. These instances were manually remapped.

How did I normalize view, comment, and like counts across time?

By definition, a video's view count starts at zero and only increments positively. A video's view count today can only be greater than or equal to its view count yesterday. A video that has been published for several years has the benefit of many more days' public exposure compared to a video that has only been published for a few days. To normalize this, I calculated the number of days since published, which is defined as the difference between the max published date in the dataset and the respective video's dataset. This shows the number of days each video has been 'live' on site. Then I took the respective metric - comment, view, or like - divided it by the number of days published.

The dataset did not provide better alternatives to this logic: for instance, I could not apply a weighted average metric that put more weigh on recent views.

How did I deal with the datasets' text features?

Every video observation had a title (otherwise I would have thrown it out). I performed standard text cleaning of the title, including lowercase conversion, punctuation removal, stopword removal, and lemmatization. This was applied to both the title and the comment features. During the model building stage, I employed TF-IDF analysis to the text features, using a maximum of the 50 most common words as features for my model building. This was a subjective decision.

What about comment sentiment?

For the comments dataset, I also calculated the comment length and the comment sentiment of each observation separately. The original dataset only categorized comment sentiment into three categories: zero, one, and two. I was not satisfied with this granularity. Rather than rely on the dataset's provided sentiment score, I used the TextBlob library instead because it yielded higher sensitivity to comment sentiment. Why TextBlob? Because it was good enough for the task of delivering an estimated average comment sentiment for each video.

For reference the TextBlob library normalizes comment sentiment on a -1 to 1 scale, the former signifying a 'negative' comment and the latter signifying a 'positive' comment. I considered any comment with a TextBlob sentiment value less than -0.2 to be 'negative'; any comment with a sentiment value greater than 0.2 to be 'positive'; and everything in between to be neural.

How did you build the model?

I isolated the key features from the dataset, which as mentioned were 'Industry', 'Title_Clean', 'Avg Views', 'Avg Comments', and 'Avg Likes' for each video. These features were used to predict the engagement rate of a video. As this is a regression problem, I evaluated multiple machine learning algorithms and ulimately settled on the XGBoost model because it returned the lowest mean squared error value. I did not include the model evaluation testing part of this analysis in the workbook. In addition, I did not perform any hyperparameter tuning as I consider then model good enough to meet the requirements of the business objective as is.

What was your final result?

Ultimately, the fitted XGBoost estimator predicted the engagement rate of each of the video observations categorized as Data Science under my revised keyword mapping. I was selective on what qualified as a strategic partnership candidate video.

Specifically, I selected the top-3 videos with the highest predicted engagement that were:

Categorized as 'Data Science'
Had an Engagement Rate greater than 4.00 (for reference, the median engagement rate was 3.26)
Had an average comment sentiment greater than 0.2, signifying positive comments
Was published on or after January 1, 2021 (I am biased against older videos in the dataset)

My Recommendations

Choice 1:

Video: DATA SCIENTIST CAREER STORY (how I became a data scientist & revealing my salaries)
Video ID: A2v4_SFd9Cc
Published: 2022-02-20 00:00:00
Predicted Engagement: 5.98
Why: This video has a high predicted engagement above the median, positive average comment sentiment, is relatively recent (Feb. 2022), covers a theme (career building) that overlaps with our goal of an e-learning platform. The model suggests that this video has potential to attract more viewership.

Choice 2:

Video: Live Day 1- Introduction To Machine Learning Algorithms For Data Science
Video ID: 4UJelID_ICw
Published: 2022-02-01 00:00:00
Predicted Engagement: 5.86
Why: This video has a high predicted engagement above the median, a very positive average comment sentiment (far above the average of other data science videos), is relatively recent (Feb. 2022), covers a theme (introduction to the machine learning algorithms) that overlaps with our goal of an e-learning platform. The model suggests that this video has potential to attract more viewership and would be a positive addition.

Choice 3:

Video: Data science roadmap: What skills you should learn first?
Video ID: ylOILe-Sc-w
Published: 2021-08-06 00:00:00
Predicted Engagement: 4.99
Why: This video has a high predicted engagement above the median, a positive average comment sentiment, is less recent (August 2021), but covers the objective theme (giving an overview of machine learning skillsets) that overlaps with our goal of an e-learning platform. This video has a large number of views and the model suggests that this video has potential to attract more viewership in partnership.

Let's now take a look at the analysis below...

1) Import Libraries, Data, and Helper Functions

Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from scipy.stats import ttest_ind

import re
import string
from datetime import datetime

#import spacy
#nlp = spacy.load("en_core_web_sm")

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.decomposition import LatentDirichletAllocation, PCA
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import VarianceThreshold

from wordcloud import WordCloud
from collections import Counter

from textblob import TextBlob


#import langid
#from langdetect import detect, DetectorFactory
#from langdetect.lang_detect_exception import LangDetectException

import statsmodels.api as sm

import gensim
from gensim import corpora
from gensim.models.coherencemodel import CoherenceModel

# Download necessary NLTK data
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

pd.set_option('display.float_format', '{:,.2f}'.format)

# Setting up visualization aesthetics
sns.set(style="whitegrid")

Helper functions

def clean_text(text):

    if type(text) != str:
        text = str(text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove punctuation and special characters
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenize text
    tokens = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    
    # Lemmatize tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]

    tokens = [t for t in tokens if t not in string.punctuation]
    
    # Join tokens back into a single string for easier analysis later
    cleaned_text = ' '.join(tokens)
    
    return cleaned_text

# Function to detect language using langdetect
def detect_language_langdetect(text):
    try:
        return detect(text)
    except LangDetectException:
        return "unknown"

# Function to detect language using langid
def detect_language_langid(text):
    lang, _ = langid.classify(text)
    return lang

# Combined function to detect language using both methods
def detect_language(text):
    lang1 = detect_language_langdetect(text)
    lang2 = detect_language_langid(text)
    if lang1 == 'en' or lang2 == 'en':
        return 'en'
    else:
        return 'unknown'

def get_sentiment_category(polarity):
    if polarity < -0.2:
        return 'Negative'
    elif polarity < 0.2:
        return 'Neutral'
    else:
        return 'Positive'

def get_comment_length(text):
    if type(text) != str:
        text = str(text)
    return len(text)

def get_comment_type(text):
    return type(text) == str

# Define the function to calculate custom engagement rate
def calculate_engagement_rate(likes, comments, views, like_weight=1, comment_weight=1.5):
    if views == 0:  # Avoid division by zero
        return 0
    return ((like_weight * likes + comment_weight * comments) / views) * 100


def get_sort_index(df, column):
    industry_medians = df.groupby(column)['Engagement Rate'].median().sort_values(ascending = False)
    sorted_order = industry_medians.index.tolist()
    return sorted_order


def plot_median_engagement_rate_by(df, column):
    tmp = df.groupby(column)['Engagement Rate'].median().sort_values(ascending = False).reset_index()

    tmp = tmp.melt(id_vars = [column])
    tmp.columns = [column, 'Metric', 'Value']
    
    tmp2 = tmp[tmp['Metric'] == 'Engagement Rate'].sort_values(by = 'Value', ascending = False)
    sns.barplot(data = tmp2, y = column, x = 'Value')


def filter_outliers(df, category_col, value_col):
    """
    Filters outliers from the DataFrame based on the category of the item.
    
    Parameters:
        df (pd.DataFrame): The input DataFrame.
        category_col (str): The name of the column containing category information.
        value_col (str): The name of the column containing the values to be filtered.
        
    Returns:
        pd.DataFrame: The DataFrame with outliers removed.
    """
    def remove_outliers(group):
        q1 = group[value_col].quantile(0.25)
        q3 = group[value_col].quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr
        return group[(group[value_col] >= lower_bound) & (group[value_col] <= upper_bound)]
    
    return df.groupby(category_col).apply(remove_outliers).reset_index(drop=True)


# Function to filter out tokens that are purely numeric
def filter_numeric_tokens(tokens):
    return [str(token) for token in tokens if not re.match(r'^\d+$|^.{1}$', str(token))]

# Function to get the most common tokens for each category
def most_common_tokens_by_category(df, category_col, tokens_col, top_n=3):
    # Group by the category column
    grouped = df.groupby(category_col)
    
    # Dictionary to hold results
    common_tokens = {}
    
    # Iterate over each group
    for category, group in grouped:
        all_tokens = []
        # Collect all tokens from the tokens column
        for tokens in group[tokens_col]:
            filtered_tokens = filter_numeric_tokens(tokens)
            all_tokens.extend(filtered_tokens)
        # Count the tokens
        token_counts = Counter(all_tokens)
        # Get the most common tokens
        common_tokens[category] = token_counts.most_common(top_n)
    
    return common_tokens


def extract_keywords(title, keywords):
    found_keywords = [keyword for keyword in keywords if keyword in title]
    return found_keywords if found_keywords else ['other']

def get_sentiment(text):
    if type(text) != str:
        str(text)
    
    textblob = TextBlob(text)
    return textblob.sentiment[0]

Data

source = 'efficient'

if source == 'original':
    # Video stats for final version
    print('original source')
    video_stats = pd.read_csv('data/videos_stats.csv', parse_dates=['Published At'], dayfirst=True)
    comments = pd.read_csv("data/comments.csv")

else:
    # Video stats for efficiency
    print('efficient source')
    video_stats = pd.read_excel('data/video_stats_ENGLISH.xlsx')
    comments = pd.read_excel("data/comments_ENGLISH.xlsx")
    comments.columns = ['Video ID','Comment','Comment_Likes','Comment_Sentiment']

# Definitions
engagement_metrics = ['Views', 'Likes', 'Comments']

original_shape = video_stats.shape
print(original_shape)
print(f"{original_shape[0] - video_stats[['Title', 'Published At', 'Keyword', 'Likes', 'Comments','Views']].drop_duplicates().shape[0]} duplicate rows")

2) Data Preprocessing & Cleaning

Filtering Poorly Encoded Data

Video Stats

# Replace negative values with NaN and then handle missing values by filling with the median (robust to outliers)
video_stats.loc[video_stats['Likes'] < 0, 'Likes'] = None
video_stats.loc[video_stats['Comments'] < 0, 'Comments'] = None
video_stats.loc[video_stats['Views'] < 0, 'Views'] = None

# Fill missing/negative values with zero
video_stats['Likes'].fillna(0, inplace=True)
video_stats['Comments'].fillna(0, inplace=True)
video_stats['Views'].fillna(0, inplace=True)

# Recheck the dataframe to ensure cleaning is complete
#video_stats.info(), negative_values
print(video_stats.shape)

Comments

comments.isnull().sum()

Filtering Values

‌
‌
‌