Skip to content

Data Analysis of Sabine Hossenfelder YT' videos

This is Part 1 - steps I've taken to go from a xls file of raw data (copied and pasted from SH YT'channels) to a clean dataframe

Part 2 is EDA

Part 3 is my attempt at modeling the data

Excel file contains data scraped (copy&paste) from Sabine Hossenfelder YT channel. https://www.youtube.com/@SabineHossenfelder I need to perform some data cleaning and reformatting, since my file is just two 1-columns sheets, containing

  • the duration of video (rows no. 1, 4, 7 )
  • the title (rows no. 2, 5, 8...)
  • string consisting in number of views (number + K or number + M) and
  • time ago (number + day | days | week | weeks | month | months | year | years ago) (rows no. 3, 6, 9...)

The goal is to create a Data Frame with

  1. index ------------------------------------------ -- int
  2. title --------------------------------------------- -- string
  3. duration_min ------------------------------- -- int
  4. views -------------------------------------------- - int
  5. time ago-in days -------------------------- -- int
  6. date #need to find out how -------- -- date
import pandas as pd
import numpy as np
raw_videos = pd.read_excel('Sabine (1).xlsx')
raw_videos.head()
video_list = list(raw_videos['Unnamed: 0'])
from datetime import datetime, timedelta
def get_duration(content):
  '''
  content - string
  returns: duration in min
  ignores seconds converst number of hours and minutes into minutes
  '''
  duration = 0
  content = content.replace('NOW PLAYING', '')
  d_list= content.split(':') # splits string into hours, minutes and seconds
  if len(d_list)<2:
    return 0
  if len(d_list)<3:
    return int(d_list[0])

  duration = int(d_list[0])*60 + int(d_list[1])
  return duration




def get_views(content):
  '''
    content: string containing the number of views as numberK or numberM
    return number of views in int
  '''
  frags = content.split(' views')
  if 'K' in frags[0]:
    views = float(frags[0].replace('K', ''))
    views = views *1000
  if 'M' in frags[0]:
    views =float(frags[0].replace('M', ''))
    views = views *1_000_000

  ago = frags[1].split()
  ago_str = ' '.join(ago)
  return views, ago_str

def decluter(raw_list):
  '''
  raw_list: list of data string
  returns: df of clean data
  '''
  durations = []
  views = []
  agos = []
  titles = []
  vid_idx = 0
  for index, content in enumerate(raw_list):
    if index%3 == 0:
      duration = get_duration(content)
      durations.append(duration)
    if index%3 == 1:
      titles.append(content)
    if index%3 == 2:
      no_views, ago = get_views(content)
      views.append(no_views)
      agos.append(ago)
  d = {'title': titles, 'duration': durations, 'views': views, 'ago': agos}
  df = pd.DataFrame(data=d)
  return df
clean_vids = decluter(video_list)
clean_vids.head()
clean_vids.dtypes
clean_vids = clean_vids.astype({'views': int})
clean_vids.dtypes