Analyzing Streaming Service Content in SQL
Welcome to your webinar workspace! You can follow along as we analyze the data in a SQL database and visualize the results.
To set up your integration, create a PostgreSQL integration with the following credentials:
- Integration Name: Streaming Codealong
- Hostname: workspacedemodb.datacamp.com
- Database: streaming
- Username: streaming_codealong
- Password: streaming_codealong
Note: This is the solution workspace, and you will need to connect your own integration (or setup one using the credentials above) to use it!
Exploring our data
Let's start by checking out the data we will be working with. We can start with the amazon
, hulu
, netflix
, and disney
tables.
SELECT *
FROM amazon
SELECT *
FROM hulu
SELECT *
FROM netflix
SELECT *
FROM disney
We can also inspect the genres
table, which is different from the other tables.
SELECT *
FROM genres
Preparing our data
Joining the different tables
Our data appears to mostly have the same column names. So we can join the data with a series of UNION
s, which will append each table to the previous one.
We use UNION ALL
to preserve any possible duplicate rows, as we will want to count entries if they appear in multiple services.
SELECT *
FROM amazon
UNION ALL
SELECT *
FROM hulu
UNION ALL
SELECT *
FROM netflix
UNION ALL
SELECT *
FROM disney
One problem with the above approach is that we lose out on the streaming service information. So let's repeat our query, but add in the required info!
SELECT *, 'amazon' AS service
FROM amazon
UNION
SELECT *, 'hulu' AS service
FROM hulu
UNION
SELECT *, 'netflix' AS service
FROM netflix
UNION
SELECT *, 'disney' AS service
FROM disney
Great! But we have one more table that might prove useful. Let's add in the genre information with a join.
To do this, we will need to use a Common Table Expression, or CTE.
WITH service_data AS (
SELECT *, 'amazon' AS service
FROM amazon
UNION
SELECT *, 'hulu' AS service
FROM hulu
UNION
SELECT *, 'netflix' AS service
FROM netflix
UNION
SELECT *, 'disney' AS service
FROM disney
)
SELECT *
FROM service_data AS sd
LEFT JOIN genres AS g
ON sd.title = g.film
Inspecting missing data
It looks like we are missing some values in the age
and imdb
columns. We will also check the rotten_tomatoes
column because we may use it later. Let's see how extensive this problem is.
To calculate the null values per column, we will use a combination of SUM() and CASE WHEN (Invalid URL) to count the number of null values.