Data Cleaning for Sabine Hossenfelder's YT vids DA (Part 1)

Data Analysis of Sabine Hossenfelder YT' videos

This is Part 1 - steps I've taken to go from a xls file of raw data (copied and pasted from SH YT'channels) to a clean dataframe

Part 2 is EDA

Part 3 is my attempt at modeling the data

Excel file contains data scraped (copy&paste) from Sabine Hossenfelder YT channel. https://www.youtube.com/@SabineHossenfelder I need to perform some data cleaning and reformatting, since my file is just two 1-columns sheets, containing

the duration of video (rows no. 1, 4, 7 )
the title (rows no. 2, 5, 8...)
string consisting in number of views (number + K or number + M) and
time ago (number + day | days | week | weeks | month | months | year | years ago) (rows no. 3, 6, 9...)

The goal is to create a Data Frame with

index ------------------------------------------ -- int
title --------------------------------------------- -- string
duration_min ------------------------------- -- int
views -------------------------------------------- - int
time ago-in days -------------------------- -- int
date #need to find out how -------- -- date

import pandas as pd
import numpy as np

raw_videos = pd.read_excel('Sabine (1).xlsx')

raw_videos.head()

video_list = list(raw_videos['Unnamed: 0'])

from datetime import datetime, timedelta

def get_duration(content):
  '''
  content - string
  returns: duration in min
  ignores seconds converst number of hours and minutes into minutes
  '''
  duration = 0
  content = content.replace('NOW PLAYING', '')
  d_list= content.split(':') # splits string into hours, minutes and seconds
  if len(d_list)<2:
    return 0
  if len(d_list)<3:
    return int(d_list[0])

  duration = int(d_list[0])*60 + int(d_list[1])
  return duration

def get_views(content):
  '''
    content: string containing the number of views as numberK or numberM
    return number of views in int
  '''
  frags = content.split(' views')
  if 'K' in frags[0]:
    views = float(frags[0].replace('K', ''))
    views = views *1000
  if 'M' in frags[0]:
    views =float(frags[0].replace('M', ''))
    views = views *1_000_000

  ago = frags[1].split()
  ago_str = ' '.join(ago)
  return views, ago_str


def decluter(raw_list):
  '''
  raw_list: list of data string
  returns: df of clean data
  '''
  durations = []
  views = []
  agos = []
  titles = []
  vid_idx = 0
  for index, content in enumerate(raw_list):
    if index%3 == 0:
      duration = get_duration(content)
      durations.append(duration)
    if index%3 == 1:
      titles.append(content)
    if index%3 == 2:
      no_views, ago = get_views(content)
      views.append(no_views)
      agos.append(ago)
  d = {'title': titles, 'duration': durations, 'views': views, 'ago': agos}
  df = pd.DataFrame(data=d)
  return df

clean_vids = decluter(video_list)

clean_vids.head()

clean_vids.dtypes

clean_vids = clean_vids.astype({'views': int})

clean_vids.dtypes

‌
‌
‌

Data Cleaning for Sabine Hossenfelder's YT vids DA (Part 1)

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Data Analysis of Sabine Hossenfelder YT' videos

This is Part 1 - steps I've taken to go from a xls file of raw data (copied and pasted from SH YT'channels) to a clean dataframe

Part 2 is EDA

Part 3 is my attempt at modeling the data

Data Analysis of Sabine Hossenfelder YT' videos