Using CTEs and JOINS to Analyze Unicorn Companies in PostgreSQL
Introduction
In today’s competitive investment landscape, identifying emerging high-growth companies, particularly unicorns (startups valued at over $1 billion), is crucial for maximizing returns. As part of an initiative to support the firm in analyzing industry trends, the goal of this project was to showcase advanced data analysis skills using SQL queries. The aim was to identify industries that have been producing the highest valuations and to determine the rate at which new high-value companies are emerging.
This was achieved by leveraging advanced SQL techniques, specifically SQL joins and Common Table Expressions (CTEs), to manipulate and analyze the data, providing valuable insights into the performance of high-growth industries. By completing this project, the goal was successfully realized, demonstrating the power of SQL in addressing complex data analysis tasks.
Data Overview
The following datasets were provided:
dates
| Column | Description |
|---|---|
company_id | A unique ID for the company. |
date_joined | The date that the company became a unicorn. |
year_founded | The year that the company was founded. |
funding
| Column | Description |
|---|---|
company_id | A unique ID for the company. |
valuation | Company value in US dollars. |
funding | The amount of funding raised in US dollars. |
select_investors | A list of key investors in the company. |
industries
| Column | Description |
|---|---|
company_id | A unique ID for the company. |
industry | The industry that the company operates in. |
companies
| Column | Description |
|---|---|
company_id | A unique ID for the company. |
company | The name of the company. |
city | The city where the company is headquartered. |
country | The country where the company is headquartered. |
continent | The continent where the company is headquartered. |
Objective
The primary objective of this project was to use SQL techniques to analyze trends in high-growth companies, specifically unicorns, by performing the following tasks:
Identifying the Top 3 Best-Performing Industries: Determine which three industries have created the highest number of new unicorns in the years 2019, 2020, and 2021 combined.
Detailed Unicorn Analysis: For the top 3 industries, extract and analyze the following data:
-The number of unicorns created in each of the years (2019, 2020, 2021).
-The year each unicorn became a unicorn.
-The average valuation of unicorns in these industries, rounded to two decimal places and converted into billions of dollars.
Data Presentation
The output was to be presented in the following format:
| industry | year | num_unicorns | average_valuation_billions |
|---|---|---|---|
| industry1 | 2021 | --- | --- |
| industry2 | 2020 | --- | --- |
| industry3 | 2019 | --- | --- |
| industry1 | 2021 | --- | --- |
| industry2 | 2020 | --- | --- |
| industry3 | 2019 | --- | --- |
| industry1 | 2021 | --- | --- |
| industry2 | 2020 | --- | --- |
| industry3 | 2019 | --- | --- |
Where industry1, industry2, and industry3 are the three top-performing industries.
Key SQL Techniques Employed
SQL Joins:
SQL joins were essential for linking the industries, dates, and funding tables. These joins allowed the extraction of data across different attributes, such as company ID, industry, year of joining, and valuation.
Common Table Expressions (CTEs):
Two key CTEs were used to structure the analysis:
Top Industries CTE: This CTE identified the top 3 industries based on the total number of unicorns created.
Yearly Unicorn Analysis CTE: This CTE provided detailed unicorn data, including the number of unicorns and their average valuations per year, for each of the top industries.
Aggregation and Calculations:
SQL aggregation functions like COUNT() and AVG() were used to compute the total number of unicorns and the average valuations for each industry. The valuation data was then converted into billions using SQL’s arithmetic operations.
--The goal is to first identify the three best-performing industries based on the number of new unicorns created in 2019, 2020, and 2021 combined.
--From those industries (1), find the number of unicorns within these industries (2), the year that they became a unicorn (3), and their average valuation, converted to billions of dollars and rounded to two decimal places (4).
--Finish the query by returning a table containing industry, year, num_unicorns, and average_valuation_billions. For readability, the firm asked to sort results by year and number of unicorns, both in descending order.
--SOLUTION
-- Creating CTE from joining industries and dates tables and filtering for top 3 industries in the year 2019, 2020 and 2021
WITH top_industries AS
(
SELECT i.industry,
COUNT(i.*)
FROM industries AS i
INNER JOIN dates AS d
ON i.company_id = d.company_id
WHERE EXTRACT(year FROM d.date_joined) in ('2019', '2020', '2021')
GROUP BY industry
ORDER BY count DESC
LIMIT 3
),
-- Yearly ranking for each year based on average valuation
yearly_rankings AS
(
SELECT COUNT(i.*) AS num_unicorns,
i.industry,
EXTRACT(year FROM d.date_joined) AS year,
AVG(f.valuation) AS average_valuation
FROM industries AS i
INNER JOIN dates AS d
ON i.company_id = d.company_id
INNER JOIN funding AS f
ON d.company_id = f.company_id
GROUP BY industry, year
)
SELECT industry,
year,
num_unicorns,
ROUND(AVG(average_valuation / 1000000000), 2) AS average_valuation_billions
FROM yearly_rankings
WHERE year in ('2019', '2020', '2021')
AND industry in (SELECT industry
FROM top_industries)
GROUP BY industry, num_unicorns, year
ORDER BY year DESC, num_unicorns DESCResults and Insights
The execution of the above SQL queries successfully achieved the project’s goal of providing valuable insights for the investment firm. The results provided a table that includes the following key metrics:
Top 3 Industries by Unicorn Creation: The industries producing the highest number of unicorns in 2019, 2020, and 2021 were identified, giving the firm a clear view of which sectors are the most dynamic.
Unicorn Trends by Year: The table shows the number of unicorns created each year in the selected industries, along with their average valuations. This helps the firm understand the evolving trends in each sector and anticipate future growth.
Valuation Insights: The average valuation of unicorns in these top industries was calculated and displayed in billions, making it easy to assess the financial potential of companies within these sectors.
Conclusion
This project effectively demonstrated advanced data analysis skills using SQL, focusing on data cleaning, aggregation, and manipulation through the use of joins and CTEs. By applying these techniques, we were able to identify the industries generating the highest growth in terms of unicorn creation and valuation. These insights are crucial for the firm to make data-driven investment decisions, particularly in high-growth sectors.
The successful completion of this analysis showcases the power of SQL in tackling complex business challenges, providing the firm with actionable insights into current market trends. This is essential for shaping a competitive and profitable investment portfolio in the future.