Project: Extracting TV Data Insights

Extracting TV Data Insights

Use data manipulation and visualization to explore a television broadcast dataset from the Super Bowl.

Whether or not you like football, the Super Bowl is a spectacle. There's a little something for everyone at your Super Bowl party. Drama in the form of blowouts, comebacks, and controversy for the sports fan. There are the ridiculously expensive ads, some hilarious, others gut-wrenching, thought-provoking, and weird. The half-time shows with the biggest musicians in the world, sometimes riding giant mechanical tigers or leaping from the roof of the stadium.

The dataset we'll use was scraped and polished from Wikipedia. It is made up of three CSV files, one with game data, one with TV data, and one with halftime musician data for 52 Super Bowls through 2018.

The Data

Three datasets have been provided, and summaries and previews of each are presented below.

1. halftime_musicians.csv

This dataset contains information about the musicians who performed during the halftime shows of various Super Bowl games. The structure is shown below, and it applies to all remaining files.

Column	Description
`'super_bowl'`	The Super Bowl number (e.g., 52 for Super Bowl LII).
`'musician'`	The name of the musician or musical group that performed during the halftime show.
`'num_songs'`	The number of songs performed by the musician or group during the halftime show.

2. super_bowls.csv

This dataset provides details about each Super Bowl game, including the date, location, participating teams, and scores, including the points difference between the winning and losing team ('difference_pts').

3. tv.csv

This dataset contains television viewership statistics and advertisement costs related to each Super Bowl.

Explore Super Bowl data to uncover insights about TV viewership, game outcomes, and halftime shows.

Has TV viewership increased over time? Save your answer as a boolean variable named viewership_increased.
How many matches finished with a point difference greater than 40? Save your answer as an integer named difference.
Who performed the most songs in Super Bowl halftime shows? Save your answer as a string named most_songs.

# Import libraries
import pandas as pd
from matplotlib import pyplot as plt

# Load the CSV data into DataFrames
super_bowls = pd.read_csv("datasets/super_bowls.csv")
super_bowls.head()

tv = pd.read_csv("datasets/tv.csv")
tv.head()

halftime_musicians = pd.read_csv("datasets/halftime_musicians.csv")
halftime_musicians.head()

1. Identifying the year with the highest viewership

Load the TV viewership data and find the maximum value of the average viewers.

def show_info_and_describe():
    dataframes = {
        "super_bowls": super_bowls,
        "tv": tv,
        "halftime_musicians": halftime_musicians
    }
    
    for name, df in dataframes.items():
        print(f"DataFrame: {name}")
        print("Info:")
        print(df.info())
        print("\nDescribe:")
        print(df.describe())
        print("\nNull Values:")
        print(df.isna().sum())
        print("\n" + "="*50 + "\n")

# Call the function to display the information
show_info_and_describe()

# Check the columns of the DataFrame to ensure the correct column name
print(tv.columns)

# Assuming the correct column name is 'avg_us_viewers' instead of 'avg_us_users'
import numpy as np

# axis{0 or ‘index’, 1 or ‘columns’}, default 0
# If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
avg_view_network = round(tv.groupby('network')['avg_us_viewers'].agg([np.mean, np.median]),1)
avg_view_network

# Find the year with the highest TV viewership
plt.plot(tv['super_bowl'], tv['avg_us_viewers'])
plt.title('Average Number of US Viewers')
viewership_increased = True
print(f"Super Bowl viewership increased over time is {viewership_increased}.")

2. Determining the matches with point difference above 40

Filter the Super Bowls data where the points difference is over 40.

super_bowls_40 = super_bowls[super_bowls['difference_pts'] > 40]
super_bowls_40

# Filter the data for point difference >40
difference = len(super_bowls[super_bowls["difference_pts"]>40])
difference

‌
‌
‌