Skip to content

1. Welcome!

Markdown.

The Office! What started as a British mockumentary series about office culture in 2001 has since spawned ten other variants across the world, including an Israeli version (2010-13), a Hindi version (2019-), and even a French Canadian variant (2006-2007). Of all these iterations (including the original), the American series has been the longest-running, spanning 201 episodes over nine seasons.

In this notebook, we will take a look at a dataset of The Office episodes, and try to understand how the popularity and quality of the series varied over time. To do so, we will use the following dataset: datasets/office_episodes.csv, which was downloaded from Kaggle here.

This dataset contains information on a variety of characteristics of each episode. In detail, these are:

datasets/office_episodes.csv
  • episode_number: Canonical episode number.
  • season: Season in which the episode appeared.
  • episode_title: Title of the episode.
  • description: Description of the episode.
  • ratings: Average IMDB rating.
  • votes: Number of votes.
  • viewership_mil: Number of US viewers in millions.
  • duration: Duration in number of minutes.
  • release_date: Airdate.
  • guest_stars: Guest stars in the episode (if any).
  • director: Director of the episode.
  • writers: Writers of the episode.
  • has_guests: True/False column for whether the episode contained guest stars.
  • scaled_ratings: The ratings scaled from 0 (worst-reviewed) to 1 (best-reviewed).

First of all, we should import all packages we are going to use.

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
Hidden output

We load our data and check the firsts rows.

the_office_df = pd.read_csv("datasets/office_episodes.csv")
the_office_df.head()

Here we custome all the features of the plot

np_base_size = np.array([25]*the_office_df.shape[0])
np_size_increase = np.array(the_office_df["has_guests"]) * np.full_like(np_base_size, 225)
np_size = np_base_size + np_size_increase
np_scaled_ratings = np.array(the_office_df["scaled_ratings"])
def color_selection(a):
    if a < 0.25:
        return "red"
    elif a < 0.5:
        return "orange"
    elif a < 0.75:
        return "lightgreen"
    else:
        return "darkgreen"
    
the_office_df["colors"] = the_office_df["scaled_ratings"].apply(color_selection)
Hidden output

Now we plot the figure demanded by the instructions

plt.rcParams['figure.figsize'] = [11, 7]
fig = plt.figure()
plt.scatter(the_office_df["episode_number"], the_office_df["viewership_mil"], s = np_size, c = the_office_df["colors"], marker = "*")
plt.title("Popularity, Quality, and Guest Appearances on the Office")
plt.xlabel("Episode Number")
plt.ylabel("Viewership (Millions)")
plt.show()
the_office_df.sort_values("viewership_mil", ascending = False).head()