Skip to content

Project: Investigate a Dataset - TMDb movie data

Table of Contents

  • Introduction (Invalid URL)
  • Data Wrangling (Invalid URL)
  • Exploratory Data Analysis (Invalid URL)
  • Conclusions (Invalid URL)

(Invalid URL)

Introduction

Dataset Description

In this project, I will be analyzing a movie dataset called The Movie Database(TMDb) associated with information collected from various movies. The dataset comprises of various movies and their properties respectively: The budget and adjusted budget indicates the amount spent in production. The revenue and adjusted revenue indicates the total income gotten from movie sales. There are other properties like a basic overview, movie directors, genres, release date and year, movie cast, percentage of viewers that enjoyed the movie(popularity), average ratings(vote_average) and the total number of votings(vote_count), and lastly the production companies in charge of the movie.

Particularly, I am interested in finding the budgets and the popularity of the movies and its relationship with its revenues.

Questions for Analysis

I would want to know:

  • the year with the highest number of movies
  • the relationship between the popularity, genre and budget of movies and its revenues.
# importing necessary packages:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline

(Invalid URL)

Data Wrangling

In this section, I would the wrangling the dataset to improve the quality of visualization and communication

General Properties of data wrangling

  • Gathering data
  • Accessing data
  • Cleaning data

Gathering data

# Loading the data and printing out a few lines to inspect the data:
df= pd.read_csv('tmdb-movies.csv')
df.head(5)

The output above shows the first 5 rows of the dataset and all the columns.

Accessing data

#checking for null  or missing data
df.info()

From the output above, there are missing data in the dataset

df.describe()

The above shows the descriptive statistics for each column eg the total count, the mean, etc

df.shape #This returns the number of rows(first value) and columns(second value) of the dataset. 
df.dtypes #returns the type of data present in each column
#to get more info on the object datatype. for ex, the original_title returns a string with the following code:
type(df['original_title'][0])