1. Welcome!
.
The Office! What started as a British mockumentary series about office culture in 2001 has since spawned ten other variants across the world, including an Israeli version (2010-13), a Hindi version (2019-), and even a French Canadian variant (2006-2007). Of all these iterations (including the original), the American series has been the longest-running, spanning 201 episodes over nine seasons.
In this notebook, we will take a look at a dataset of The Office episodes, and try to understand how the popularity and quality of the series varied over time. To do so, we will use the following dataset: datasets/office_episodes.csv, which was downloaded from Kaggle here.
This dataset contains information on a variety of characteristics of each episode. In detail, these are:
- episode_number: Canonical episode number.
- season: Season in which the episode appeared.
- episode_title: Title of the episode.
- description: Description of the episode.
- ratings: Average IMDB rating.
- votes: Number of votes.
- viewership_mil: Number of US viewers in millions.
- duration: Duration in number of minutes.
- release_date: Airdate.
- guest_stars: Guest stars in the episode (if any).
- director: Director of the episode.
- writers: Writers of the episode.
- has_guests: True/False column for whether the episode contained guest stars.
- scaled_ratings: The ratings scaled from 0 (worst-reviewed) to 1 (best-reviewed).
# Use this cell to begin your analysis, and add as many as you would like!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pandas import Series, DataFrameoffice_df=pd.read_csv("datasets/office_episodes.csv")office_df.head()# setting up the color category for further plottting
cols= []
for ind , row in office_df.iterrows():
if row['scaled_ratings'] < 0.25:
cols.append('red')
elif row['scaled_ratings'] < 0.5:
cols.append('orange')
elif row['scaled_ratings'] < 0.75:
cols.append('lightgreen')
else:
cols.append('darkgreen')
#setting up for the sizing system so that difference can be evident
sizes=[]
for ind, row in office_df.iterrows():
if row['has_guests'] == False:
sizes.append(25)
else: sizes.append(250)
#adding these to the dataframe
office_df['colors'] = cols
office_df['sizes'] = sizes
#Segregating the data into with guest and without guest
non_guest_df = office_df[office_df['has_guests'] == False]
guest_df = office_df[office_df['has_guests'] == True]
#plotting the graph
fig =plt.figure()
plt.scatter(x=non_guest_df['episode_number'] ,
y = non_guest_df['viewership_mil'] ,
c=non_guest_df['colors'] ,
s =non_guest_df['sizes'])
plt.scatter(x=guest_df['episode_number'] ,
y = guest_df['viewership_mil'] ,
c=guest_df['colors'] ,
s =guest_df['sizes'],
marker = '*')
plt.title("Popularity, Quality, and Guest Appearances on the Office")
plt.xlabel("Episode Number")
plt.ylabel("Viewership (Millions)")
plt.show()#Finding the most starred Guest star from the Dataset
office_df[office_df['viewership_mil'] > 20]['guest_stars']
top_star='Cloris Leachman'