Skip to content
Data Cleaning for Sabine Hossenfelder's YT vids DA (Part 1)
Data Analysis of Sabine Hossenfelder YT' videos
This is Part 1 - steps I've taken to go from a xls file of raw data (copied and pasted from SH YT'channels) to a clean dataframe
Part 2 is EDA
Part 3 is my attempt at modeling the data
Excel file contains data scraped (copy&paste) from Sabine Hossenfelder YT channel. https://www.youtube.com/@SabineHossenfelder I need to perform some data cleaning and reformatting, since my file is just two 1-columns sheets, containing
- the duration of video (rows no. 1, 4, 7 )
- the title (rows no. 2, 5, 8...)
- string consisting in number of views (number + K or number + M) and
- time ago (number + day | days | week | weeks | month | months | year | years ago) (rows no. 3, 6, 9...)
The goal is to create a Data Frame with
- index ------------------------------------------ -- int
- title --------------------------------------------- -- string
- duration_min ------------------------------- -- int
- views -------------------------------------------- - int
- time ago-in days -------------------------- -- int
- date #need to find out how -------- -- date
import pandas as pd
import numpy as npraw_videos = pd.read_excel('Sabine (1).xlsx')
raw_videos.head()video_list = list(raw_videos['Unnamed: 0'])from datetime import datetime, timedeltadef get_duration(content):
'''
content - string
returns: duration in min
ignores seconds converst number of hours and minutes into minutes
'''
duration = 0
content = content.replace('NOW PLAYING', '')
d_list= content.split(':') # splits string into hours, minutes and seconds
if len(d_list)<2:
return 0
if len(d_list)<3:
return int(d_list[0])
duration = int(d_list[0])*60 + int(d_list[1])
return duration
def get_views(content):
'''
content: string containing the number of views as numberK or numberM
return number of views in int
'''
frags = content.split(' views')
if 'K' in frags[0]:
views = float(frags[0].replace('K', ''))
views = views *1000
if 'M' in frags[0]:
views =float(frags[0].replace('M', ''))
views = views *1_000_000
ago = frags[1].split()
ago_str = ' '.join(ago)
return views, ago_str
def decluter(raw_list):
'''
raw_list: list of data string
returns: df of clean data
'''
durations = []
views = []
agos = []
titles = []
vid_idx = 0
for index, content in enumerate(raw_list):
if index%3 == 0:
duration = get_duration(content)
durations.append(duration)
if index%3 == 1:
titles.append(content)
if index%3 == 2:
no_views, ago = get_views(content)
views.append(no_views)
agos.append(ago)
d = {'title': titles, 'duration': durations, 'views': views, 'ago': agos}
df = pd.DataFrame(data=d)
return df
clean_vids = decluter(video_list)clean_vids.head()clean_vids.dtypesclean_vids = clean_vids.astype({'views': int})clean_vids.dtypes