Skip to content

Step into the world of cinema

In which decade did we see most success for films?

... Find this out and more in this project!

Data source: DataCamp's Course Database's cinema schema

Useful info:

  1. IMDB score is a rating given by users on the Internet Movie Database for movies and TV shows. It's a number from 1 to 10 and is the average rating given by IMDB users.
  2. Navigate by clicking "Code" for details on my SQL code, which you'll find on the panel located on the top right of this page

Summary

  • The success of films has not increased with each new decade. In fact, the 2000s decade saw more success than 2010s!
  • USA, UK, Canada and France were among the countries with the highest gross earnings over the past century. An SQL technique called 'paging' was used to split the data into quartiles according to gross earnings in this project.

Contents

Click on this button on the top left of page to navigate to the different sections

Cleaning

1. Checking for repeated rows 2. Dealing with nulls

Analysis

1.0 Gross earnings 2.0 Correlation between film duration and imdb score 2.5 Correlation between gross earnings and imdb score 3.0 Countries releasing most movies 4.0 Most successful decade

Cleaning

We will work with 2 tables. The films table has information on movies including release year, language and country. The reviews table has detail on movies such as IMDB score and facebook likes.

1. Checking for repeated rows

In the films table the number of unique movie titles is less than the number of rows. This may be due to duplicates.

I used the CONCAT function in Excel to make a column merging the values for each row.

Then I used conditional formatting to find duplicate rows. I removed duplicate values based on the concatenated column.

121 duplicates were removed and 4848 rows remain (headers included).

The same method was used for the reviews table. No duplicates were found and 4969 rows remain (headers included).

2. Dealing with nulls

We need the following columns from the films table:

  • id
  • title - 0 nulls
  • release_year - 42 nulls
  • country - 2 nulls
  • duration - 13 nulls
  • language - 11 nulls
  • gross - 791 nulls

We need the following columns from the reviews table:

  • id
  • film_id
  • imdb_score - 0 nulls

The column with the highest number of nulls only takes up 0.87% of all the data, which can be considered insignifcant. For this reason, I decide to remove the nulls, but not for the gross column as it has much higher proprotion of nulls. Excel was used to filter the table for nulls and these rows were removed.

For nulls in the gross earnings column, I will replace each null with the average of the gross earnings of films which are from the same country and the same year. For example in 1933, USA released two films. One had a value for gross earnings and another didn't (see the "gross" column below).

I will create an extra column called "coalesced gross" that will take an average value of gross of other films from the same year and country. See below

Let's look at the release year, country and gross columns from the films table...

Spinner
DataFrameas
df4
variable
-- original table, gross by yr and country
SELECT release_year, country, gross
FROM 'films.csv'
ORDER BY release_year, country;

We can see from the result set that some films that had no data for gross earning were the only film from that country and year. In these cases, we cannot find an average of films from the same country and year to replace the missing gross value.

Spinner
DataFrameas
df6
variable
-- imputing avg gross values to replace nulls
-- removing nulls
SELECT 
	f1.release_year,
  	f1.country,
  	COALESCE(f1.gross, 
			 (SELECT AVG(f2.gross)
			 FROM 'films.csv' AS f2
			 WHERE f2.release_year = f1.release_year 
				AND f2.country = f1.country)) AS coalesced_gross		
FROM 'films.csv' AS f1
ORDER BY f1.release_year, f1.country;
Hidden output
Spinner
DataFrameas
df7
variable
-- Finding the number of nulls where average cannot be calculated
WITH CTE AS(
SELECT 
	f1.release_year,
  	f1.country,
  	COALESCE(f1.gross, 
			 (SELECT AVG(f2.gross)
			 FROM 'films.csv' AS f2
			 WHERE f2.release_year = f1.release_year 
				AND f2.country = f1.country)) AS coalesced_gross		
FROM 'films.csv' AS f1
ORDER BY f1.release_year, f1.country
	)
SELECT COUNT(*) AS null_count
FROM CTE
WHERE coalesced_gross IS NULL;

Only 129 films had null gross values AND were the only ones from their country and year. This makes up just 2.7% of the data. Should we leave these out? Or could we calculate a global average using all years?

To decide, let's take a closer look at gross earnings...

Spinner
DataFrameas
df14
variable
-- gross by year
SELECT *
FROM 'films.csv';

Gross earnings seem to have noticeably higher values in the 21st century compared to the 20th century.

Let's zoom in further and calculate some averages to see the difference a bit better...