Data Scientist Professional Practical Exam Submission
Dataset Overview
The dataset provides a comprehensive overview of various attributes related to movies and TV shows. Below is a summary of the dataset's structure and the missing values in specific columns.
Structure:
- Total Rows: 8,807
- Total Columns: 12
Missing Values:
- Director: 2,634 missing values
- Cast: 825 missing values
- Country: 831 missing values
- Date Added: 10 missing values
- Rating: 4 missing values
- Duration: 3 missing values
Data Loading and Initial Exploration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
# Load the dataset
df = pd.read_csv('netflix_titles.csv')
# Display the dataframe
df.head()This cell initializes the project by importing necessary libraries and loading the Netflix dataset. The code includes imports for:
- Data manipulation tools (pandas, numpy)
- Visualization libraries (matplotlib, seaborn)
- Machine learning algorithms and utilities from scikit-learn
The final output displays the first five rows of the dataset, showing the structure of the data with columns including show_id, type, title, director, cast, country, and other content metadata. The dataset appears to contain a mix of movies and TV shows with their associated details. This initial exploration provides a foundation for understanding what kind of data we're working with before proceeding with preprocessing and analysis.
Data Preprocessing and Cleaning
# Convert 'date_added' to datetime format
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
# Extract numerical duration (convert minutes to int, set NaN for seasons)
df['duration_num'] = df['duration'].str.extract('(\\d+)').astype(float)
# Fill missing values before mapping
if not df['type'].mode().empty:
df['type'].fillna(df['type'].mode()[0], inplace=True)
df.fillna({
'director': 'Unknown',
'cast': 'Unknown',
'country': 'Unknown',
'date_added': df['date_added'].mode()[0] if not df['date_added'].mode().empty else pd.Timestamp('1970-01-01'),
'rating': df['rating'].mode()[0] if not df['rating'].mode().empty else 'Unknown',
'duration_num': df['duration_num'].median() if not df['duration_num'].isna().all() else 0
}, inplace=True)
# Convert categorical data to numerical
df['type'] = df['type'].map({'Movie': 0, 'TV Show': 1})
# Check cleaned data
df.info(), df.head()Description:
This cell focuses on data cleaning and preprocessing, implementing several important transformations:
The 'date_added' column is converted to datetime format to enable time-based analysis A new 'duration_num' column is created by extracting numerical values from the 'duration' column Missing values are handled systematically:
- Category fields (director, cast, country) are filled with 'Unknown'
- Date fields use the mode value or a default timestamp
- Numerical duration uses the median value
- The 'type' column is encoded numerically: Movies as 0 and TV Shows as 1
The output confirms the cleaning was successful - all columns now have complete data (non-null count matches the total entries), and the datatypes have been properly converted. The first five rows show the transformed data with the new numerical type values and the extracted duration numbers. This preprocessing step is crucial as it prepares the data for both visualization and machine learning modeling by addressing missing values and ensuring appropriate data formats.
Basic Exploratory Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Step 1: Single-variable visualizations
# Plot 1: Count of content type (Movies vs. TV Shows)
plt.figure(figsize=(6, 4))
sns.countplot(x=df['type'], palette='viridis')
plt.title("Count of Movies vs TV Shows")
plt.xlabel("Type")
plt.ylabel("Count")
plt.show()
# Plot 2: Distribution of release years
plt.figure(figsize=(10, 5))
sns.histplot(df['release_year'], bins=30, kde=True, color='blue')
plt.title("Distribution of Release Years")
plt.xlabel("Release Year")
plt.ylabel("Frequency")
plt.show()
# Step 2: Multi-variable visualization
# Plot 3: Content distribution by rating and type
plt.figure(figsize=(12, 6))
sns.countplot(y=df['rating'], hue=df['type'], order=df['rating'].value_counts().index, palette='coolwarm')
plt.title("Distribution of Ratings by Content Type")
plt.xlabel("Count")
plt.ylabel("Rating")
plt.legend(title="Type")
plt.show()
Description:
This cell creates three essential visualizations to explore the dataset's fundamental characteristics:
- Content Type Distribution: The first plot shows that Movies (Type 0) significantly outnumber TV Shows (Type 1) in the dataset. The bar chart indicates approximately 6,000 movies compared to about 2,800 TV shows, revealing that Netflix's catalog is weighted toward movies.
- Release Year Distribution: The histogram displays a strong right-skewed distribution, showing that the vast majority of content was released in recent years (2015-2020). The sharp peak around 2018-2020 indicates Netflix's focus on acquiring/producing contemporary content, with very little representation from earlier decades.
- Rating Distribution by Content Type: This horizontal bar chart breaks down content by rating categories and type. Key insights include:
- TV-MA (Mature Audience) is the most common rating for both movies and TV shows
- TV-14 and TV-PG have significant representation
- Movies have more diversity across rating categories than TV shows
- Some ratings like G (General Audience) have minimal representation
These visualizations provide a foundation for understanding the dataset's composition and distribution across key variables, highlighting Netflix's content strategy that emphasizes recent releases and mature audience programming.
Advanced Duration Analysis Visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set the style and color palette
plt.style.use('seaborn-v0_8')
colors = {"Movie": "#1f77b4", "TV Show": "#ff7f0e"}
# Load and preprocess the data
df = pd.read_csv('netflix_titles.csv')
# Data preprocessing
df['duration_num'] = df['duration'].str.extract('(\\d+)').astype(float)
# Fill missing values
df['type'].fillna(df['type'].mode()[0] if not df['type'].mode().empty else 'Unknown', inplace=True)
df['duration_num'].fillna(df['duration_num'].median() if not df['duration_num'].isna().all() else 0, inplace=True)
# Create two visualizations: boxplot and histograms
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12), height_ratios=[1, 1.2])
# 1. Enhanced Boxplot with log scale
sns.boxplot(x='type', y='duration_num', data=df, palette=colors, ax=ax1)
ax1.set_yscale('log')
ax1.set_title('Duration Distribution by Content Type (Log Scale)', pad=20, fontsize=14)
ax1.set_xlabel('Content Type', fontsize=12)
ax1.set_ylabel('Duration (log scale)', fontsize=12)
# Add median annotations to boxplot
medians = df.groupby('type')['duration_num'].median()
for i, median in enumerate(medians):
ax1.text(i, median, f'Median: {median:.0f}', horizontalalignment='center',
verticalalignment='bottom', fontsize=10, color='black', fontweight='bold')
# Add grid for better readability
ax1.grid(True, axis='y', linestyle='--', alpha=0.7)
# 2. Separate Histograms for Movies and TV Shows
# Create figure with two subplots sharing y-axis
movies = df[df['type'] == 'Movie']['duration_num']
tv_shows = df[df['type'] == 'TV Show']['duration_num']
# Left subplot for Movies
sns.histplot(data=movies, bins=30, color=colors['Movie'], kde=True, ax=ax2)
ax2.set_title('Distribution of Movie Durations (in minutes)', fontsize=12)
ax2.set_xlabel('Duration (minutes)', fontsize=10)
ax2.set_ylabel('Count', fontsize=10)
# Add movie statistics
movie_stats = f'Median: {movies.median():.0f}\nMean: {movies.mean():.0f}\nStd: {movies.std():.0f}'
ax2.text(0.95, 0.95, movie_stats, transform=ax2.transAxes,
verticalalignment='top', horizontalalignment='right',
bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
# Add a secondary axis for TV Shows
ax3 = ax2.twinx()
sns.histplot(data=tv_shows, bins=20, color=colors['TV Show'], kde=True, ax=ax3)
ax3.set_ylabel('Count (TV Shows)', fontsize=10)
# Add TV show statistics
tv_stats = f'Median: {tv_shows.median():.1f}\nMean: {tv_shows.mean():.1f}\nStd: {tv_shows.std():.1f}'
ax3.text(0.05, 0.95, tv_stats, transform=ax3.transAxes,
verticalalignment='top', horizontalalignment='left',
bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
# Add legend
lines1, labels1 = ax2.get_legend_handles_labels()
lines2, labels2 = ax3.get_legend_handles_labels()
ax3.legend(lines1 + lines2, ['Movies', 'TV Shows'], loc='upper right')
plt.tight_layout()
plt.show()Description:
This cell provides advanced visualization of content duration patterns, revealing fundamental differences between movies and TV shows:
- Duration Distribution Boxplot (Log Scale) :
- The log-scale boxplot clearly illustrates the distinct duration patterns of the two content types
- Movies have a median duration of 98 minutes, with most falling between 80-120 minutes
- TV shows have a median duration of 1 season, with outliers extending to 10+ seasons
- Using a logarithmic scale effectively displays both distributions despite their different magnitudes
- Detailed Duration Histograms:
-
Movies Distribution (Blue) : Shows a bell-shaped curve centered around 90-100 minutes (the standard feature film length)
-
Median: 98 minutes
-
Mean: 100 minutes
-
Standard Deviation: 28 minutes
-
The distribution includes some longer films (150+ minutes) but few extremely short or long outliers
- TV Shows Distribution (Orange) : Reveals a heavily right-skewed distribution
- Median: 1 season
- Mean: 1.8 seasons
- Standard Deviation: 1.6 seasons
- Most TV content consists of single-season shows, with a rapid decline in frequency as season count increases
The dual-visualization approach effectively communicates both the central tendencies and the full distribution shapes. The log-scale boxplot highlights the significant scale difference between movie durations (minutes) and TV show durations (seasons), while the detailed histograms provide precise insights into the distribution patterns within each content type.
Multi-variable Relationships Analysis