Project: Investigate a Dataset - TMDb movie data
Table of Contents
- Introduction (Invalid URL)
- Data Wrangling (Invalid URL)
- Exploratory Data Analysis (Invalid URL)
- Conclusions (Invalid URL)
(Invalid URL)
Introduction
Dataset Description
In this project, I will be analyzing a movie dataset called The Movie Database(TMDb) associated with information collected from various movies. The dataset comprises of various movies and their properties respectively: The budget and adjusted budget indicates the amount spent in production. The revenue and adjusted revenue indicates the total income gotten from movie sales. There are other properties like a basic overview, movie directors, genres, release date and year, movie cast, percentage of viewers that enjoyed the movie(popularity), average ratings(vote_average) and the total number of votings(vote_count), and lastly the production companies in charge of the movie.
Particularly, I am interested in finding the budgets and the popularity of the movies and its relationship with its revenues.
Questions for Analysis
I would want to know:
- the year with the highest number of movies
- the relationship between the popularity, genre and budget of movies and its revenues.
# importing necessary packages:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline
(Invalid URL)
Data Wrangling
In this section, I would the wrangling the dataset to improve the quality of visualization and communication
General Properties of data wrangling
- Gathering data
- Accessing data
- Cleaning data
Gathering data
# Loading the data and printing out a few lines to inspect the data:
df= pd.read_csv('tmdb-movies.csv')
df.head(5)
The output above shows the first 5 rows of the dataset and all the columns.
Accessing data
#checking for null or missing data
df.info()
From the output above, there are missing data in the dataset
df.describe()
The above shows the descriptive statistics for each column eg the total count, the mean, etc
df.shape #This returns the number of rows(first value) and columns(second value) of the dataset.
df.dtypes #returns the type of data present in each column
#to get more info on the object datatype. for ex, the original_title returns a string with the following code:
type(df['original_title'][0])