Project: Analyzing Unicorn Companies

Did you know that the average return from investing in stocks is 10% per year (not accounting for inflation)? But who wants to be average?!

You have been asked to support an investment firm by analyzing trends in high-growth companies. They are interested in understanding which industries are producing the highest valuations and the rate at which new high-value companies are emerging. Providing them with this information gives them a competitive insight as to industry trends and how they should structure their portfolio looking forward.

You have been given access to their unicorns database, which contains the following tables:

dates

Column	Description
`company_id`	A unique ID for the company.
`date_joined`	The date that the company became a unicorn.
`year_founded`	The year that the company was founded.

funding

Column	Description
`company_id`	A unique ID for the company.
`valuation`	Company value in US dollars.
`funding`	The amount of funding raised in US dollars.
`select_investors`	A list of key investors in the company.

industries

Column	Description
`company_id`	A unique ID for the company.
`industry`	The industry that the company operates in.

companies

Column	Description
`company_id`	A unique ID for the company.
`company`	The name of the company.
`city`	The city where the company is headquartered.
`country`	The country where the company is headquartered.
`continent`	The continent where the company is headquartered.

The output

Your query should return a table in the following format:

industry	year	num_unicorns	average_valuation_billions
industry1	2021	---	---
industry2	2020	---	---
industry3	2019	---	---
industry1	2021	---	---
industry2	2020	---	---
industry3	2019	---	---
industry1	2021	---	---
industry2	2020	---	---
industry3	2019	---	---

Where industry1, industry2, and industry3 are the three top-performing industries.

The Work

Step 1 - Looking at each column from the tables

Despite the table schema already provided above, it's also good practice to look at how the actual data on the table look. To do that, we showed the first 5 rows of each tables by selecting all columns and limiting the output to 5.

DataFrameas

df4

variable

-- Showing the dates table data
SELECT * FROM dates LIMIT 5;

DataFrameas

df5

variable

-- Showing the funding table data
SELECT * FROM funding LIMIT 5;

DataFrameas

df6

variable

-- Showing the industries table data
SELECT * FROM industries LIMIT 5;

DataFrameas

df7

variable

-- Showing the companies table data
SELECT * FROM companies LIMIT 5;

-- note that there is a null data in city column. Since we did'nt use the city column, it was left as is

Step 2 - Get the industry rank based on the total number of new unicorns from 2019 - 2021

To do that, two table were merged, the industries table to get the name of the industries and the dates table to filter the analysis just for the newcomers in 2019, 2020, and 2021. To rank the Industry, rank() window function is used over the number of unique id in each industries.

note : this query will act as a CTE in the full query that will be done in the later step. For this step, we do this query independently to show the result of this query.

DataFrameas

df8

variable

SELECT industry, 
	COUNT(DISTINCT company_id) AS newcomer_number,
	RANK() OVER(ORDER BY COUNT(DISTINCT company_id) DESC)
FROM dates
JOIN industries
	USING(company_id)
WHERE EXTRACT(year FROM date_joined) IN (2019,2020,2021)
GROUP BY industry

Step 3 - Get the average valuations and newcomers number of each industry in each year from 2019 - 2021.

In order to do that, three table were merged using company_id column. The industries table is used to get the industry names, the funding table is used to get the average valuations, and the dates table is used to filter the years used for analysis to only in 2019, 2020, 2021

note : this query will act as a CTE in the full query that will be done in the later step. For this step, we do this query independently to show the result of this query.

DataFrameas

df9

variable

SELECT  industry,
		d.year,
		ROUND(AVG(f.valuation),2) AS avg_valuation_billions,
		COUNT(i.company_id) as newcomers
FROM industries AS i
JOIN (
	SELECT company_id, valuation/1000000000 AS valuation
	FROM funding
) AS f
USING(company_id)
JOIN (
	SELECT company_id, EXTRACT(year FROM date_joined) AS year
	FROM dates
) AS d
USING(company_id)
WHERE d.year IN (2019,2020,2021)
GROUP BY industry, d.year

Step 4 - recreate the previous two queries as CTE and join them to match the desired output

Using WITH statement, the previous queries are converted to CTE. The query on step on is named best_industry, while the query on the step 2 is named industry_data.

Then, the two CTE's are joined, with the best_industry CTE for filtering the best three industries and the industry_data providing the data of each best industries from 2019 - 2021.

DataFrameas

df1

variable

WITH best_industry AS(
	SELECT industry, 
	COUNT(DISTINCT company_id) AS newcomer_number,
	RANK() OVER(ORDER BY COUNT(DISTINCT company_id) DESC)
	FROM dates
	JOIN industries
	USING(company_id)
	WHERE EXTRACT(year FROM date_joined) IN (2019,2020,2021)
	GROUP BY industry), 
industry_data AS(
	SELECT  industry,
		d.year,
		ROUND(AVG(f.valuation),2) AS avg_valuation_billions,
		COUNT(i.company_id) as newcomers
	FROM industries AS i
	JOIN (
		SELECT company_id, valuation/1000000000 AS valuation
		FROM funding
	) AS f
	USING(company_id)
	JOIN (
		SELECT company_id, EXTRACT(year FROM date_joined) AS year
		FROM dates
	) AS d
	USING(company_id)
	WHERE d.year IN (2019,2020,2021)
	GROUP BY industry, d.year
)

SELECT industry,
		year,
		newcomers AS num_unicorns,
		avg_valuation_billions
FROM industry_data
JOIN best_industry
USING(industry)
WHERE rank IN(1,2,3)
ORDER BY rank, year DESC, num_unicorns DESC;

Project: Analyzing Unicorn Companies

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}dates

funding

industries

companies

The output

The Work

Step 1 - Looking at each column from the tables

Step 2 - Get the industry rank based on the total number of new unicorns from 2019 - 2021

Step 3 - Get the average valuations and newcomers number of each industry in each year from 2019 - 2021.

Step 4 - recreate the previous two queries as CTE and join them to match the desired output

dates