Skip to content

Did you know that the average return from investing in stocks is 10% per year (not accounting for inflation)? But who wants to be average?!

You have been asked to support an investment firm by analyzing trends in high-growth companies. They are interested in understanding which industries are producing the highest valuations and the rate at which new high-value companies are emerging. Providing them with this information gives them a competitive insight as to industry trends and how they should structure their portfolio looking forward.

You have been given access to their unicorns database, which contains the following tables:

dates

ColumnDescription
company_idA unique ID for the company.
date_joinedThe date that the company became a unicorn.
year_foundedThe year that the company was founded.

funding

ColumnDescription
company_idA unique ID for the company.
valuationCompany value in US dollars.
fundingThe amount of funding raised in US dollars.
select_investorsA list of key investors in the company.

industries

ColumnDescription
company_idA unique ID for the company.
industryThe industry that the company operates in.

companies

ColumnDescription
company_idA unique ID for the company.
companyThe name of the company.
cityThe city where the company is headquartered.
countryThe country where the company is headquartered.
continentThe continent where the company is headquartered.

The output

Your query should return a table in the following format:

industryyearnum_unicornsaverage_valuation_billions
industry12021------
industry22020------
industry32019------
industry12021------
industry22020------
industry32019------
industry12021------
industry22020------
industry32019------

Where industry1, industry2, and industry3 are the three top-performing industries.

The Work

Step 1 - Looking at each column from the tables

Despite the table schema already provided above, it's also good practice to look at how the actual data on the table look. To do that, we showed the first 5 rows of each tables by selecting all columns and limiting the output to 5.

Spinner
DataFrameas
df4
variable
-- Showing the dates table data
SELECT * FROM dates LIMIT 5;
Spinner
DataFrameas
df5
variable
-- Showing the funding table data
SELECT * FROM funding LIMIT 5;
Spinner
DataFrameas
df6
variable
-- Showing the industries table data
SELECT * FROM industries LIMIT 5;
Spinner
DataFrameas
df7
variable
-- Showing the companies table data
SELECT * FROM companies LIMIT 5;

-- note that there is a null data in city column. Since we did'nt use the city column, it was left as is

Step 2 - Get the industry rank based on the total number of new unicorns from 2019 - 2021

To do that, two table were merged, the industries table to get the name of the industries and the dates table to filter the analysis just for the newcomers in 2019, 2020, and 2021. To rank the Industry, rank() window function is used over the number of unique id in each industries.

note : this query will act as a CTE in the full query that will be done in the later step. For this step, we do this query independently to show the result of this query.

Spinner
DataFrameas
df8
variable
SELECT industry, 
	COUNT(DISTINCT company_id) AS newcomer_number,
	RANK() OVER(ORDER BY COUNT(DISTINCT company_id) DESC)
FROM dates
JOIN industries
	USING(company_id)
WHERE EXTRACT(year FROM date_joined) IN (2019,2020,2021)
GROUP BY industry

Step 3 - Get the average valuations and newcomers number of each industry in each year from 2019 - 2021.

In order to do that, three table were merged using company_id column. The industries table is used to get the industry names, the funding table is used to get the average valuations, and the dates table is used to filter the years used for analysis to only in 2019, 2020, 2021

note : this query will act as a CTE in the full query that will be done in the later step. For this step, we do this query independently to show the result of this query.

Spinner
DataFrameas
df9
variable
SELECT  industry,
		d.year,
		ROUND(AVG(f.valuation),2) AS avg_valuation_billions,
		COUNT(i.company_id) as newcomers
FROM industries AS i
JOIN (
	SELECT company_id, valuation/1000000000 AS valuation
	FROM funding
) AS f
USING(company_id)
JOIN (
	SELECT company_id, EXTRACT(year FROM date_joined) AS year
	FROM dates
) AS d
USING(company_id)
WHERE d.year IN (2019,2020,2021)
GROUP BY industry, d.year

Step 4 - recreate the previous two queries as CTE and join them to match the desired output

Using WITH statement, the previous queries are converted to CTE. The query on step on is named best_industry, while the query on the step 2 is named industry_data.

Then, the two CTE's are joined, with the best_industry CTE for filtering the best three industries and the industry_data providing the data of each best industries from 2019 - 2021.

Spinner
DataFrameas
df1
variable
WITH best_industry AS(
	SELECT industry, 
	COUNT(DISTINCT company_id) AS newcomer_number,
	RANK() OVER(ORDER BY COUNT(DISTINCT company_id) DESC)
	FROM dates
	JOIN industries
	USING(company_id)
	WHERE EXTRACT(year FROM date_joined) IN (2019,2020,2021)
	GROUP BY industry), 
industry_data AS(
	SELECT  industry,
		d.year,
		ROUND(AVG(f.valuation),2) AS avg_valuation_billions,
		COUNT(i.company_id) as newcomers
	FROM industries AS i
	JOIN (
		SELECT company_id, valuation/1000000000 AS valuation
		FROM funding
	) AS f
	USING(company_id)
	JOIN (
		SELECT company_id, EXTRACT(year FROM date_joined) AS year
		FROM dates
	) AS d
	USING(company_id)
	WHERE d.year IN (2019,2020,2021)
	GROUP BY industry, d.year
)

SELECT industry,
		year,
		newcomers AS num_unicorns,
		avg_valuation_billions
FROM industry_data
JOIN best_industry
USING(industry)
WHERE rank IN(1,2,3)
ORDER BY rank, year DESC, num_unicorns DESC;