Investigating Netflix Movies and Guest Stars in The Office

1. Welcome!

Markdown .

The Office! What started as a British mockumentary series about office culture in 2001 has since spawned ten other variants across the world, including an Israeli version (2010-13), a Hindi version (2019-), and even a French Canadian variant (2006-2007). Of all these iterations (including the original), the American series has been the longest-running, spanning 201 episodes over nine seasons.

In this notebook, we will take a look at a dataset of The Office episodes, and try to understand how the popularity and quality of the series varied over time. To do so, we will use the following dataset: datasets/office_episodes.csv, which was downloaded from Kaggle here.

This dataset contains information on a variety of characteristics of each episode. In detail, these are:

datasets/office_episodes.csv

episode_number: Canonical episode number.
season: Season in which the episode appeared.
episode_title: Title of the episode.
description: Description of the episode.
ratings: Average IMDB rating.
votes: Number of votes.
viewership_mil: Number of US viewers in millions.
duration: Duration in number of minutes.
release_date: Airdate.
guest_stars: Guest stars in the episode (if any).
director: Director of the episode.
writers: Writers of the episode.
has_guests: True/False column for whether the episode contained guest stars.
scaled_ratings: The ratings scaled from 0 (worst-reviewed) to 1 (best-reviewed).

# Use this cell to begin your analysis, and add as many as you would like!
import pandas as pd 
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = [11, 7]

# Reading the data
office_df = pd.read_csv('datasets/office_episodes.csv', parse_dates=['release_date'])

office_df.head()

Adding colors according to the scaled ratings

#adding colors
cls=[]

for ind,rows in office_df.iterrows():
    if rows['scaled_ratings'] < 0.25:
        cls.append('red')
    elif rows['scaled_ratings'] < 0.50 :
        cls.append('orange')
    elif rows['scaled_ratings'] < 0.75 :
        cls.append('lightgreen')
    else:
        cls.append('darkgreen')

Adding sizes according to guest appearances in the episode

size=[]

for ind,rows in office_df.iterrows():
    if rows['has_guests'] == True:
        size.append(250)
    else:
        size.append(25)

office_df['colors'] = cls
office_df['size'] = size

office_df.info()

Subsetting episodes according to guest appearances

no_guest_df = office_df[office_df["has_guests"] == False]
guest_df = office_df[office_df["has_guests"] == True]
print(no_guest_df)

Plotting the scatter plot to check Quality(ratings) and Popularity(views).

fig = plt.figure()

plt.scatter(office_df['episode_number'],
            office_df['viewership_mil'],
            c= cls,
            s= size)
plt.title("Popularity, Quality, and Guest Appearances on the Office")
plt.xlabel("Episode Number")
plt.ylabel("Viewership (Millions)")
plt.show()

1.We see that in early season's episode has mixed ratings i.e high , low and mid-level.

2.Over the times ratings got worse.

3.Most lowest ratings was seen in the latest seasons.

4.Last few episodes not getting more viewership but are getting strong ratings.

5.Viwership dropped in the latest seasons.

6.The most popular episode of Office also have the high rating

More about Guest appearances.

plt.style.use('fivethirtyeight')

plt.scatter(no_guest_df['episode_number'],
            no_guest_df['viewership_mil'],
            c= no_guest_df['colors'],
            s= no_guest_df['size'])

plt.scatter(guest_df['episode_number'],
            guest_df['viewership_mil'],
            c= guest_df['colors'],
            s= guest_df['size'],
            marker='*')


plt.title("Popularity, Quality, and Guest Appearances on the Office")
plt.xlabel("Episode Number")
plt.ylabel("Viewership (Millions)")
plt.show()

The Guest star in the most popular episode of Office