Skip to content

Unicorn Companies

Scenario: A unicorn company is a privately held company with a current valuation of over $1 billion USD. This dataset consists of unicorn companies and startups across the globe as of November 2021, including country of origin, sector, select investors, and valuation of each unicorn.


Methodology

A - prepare R

  • Data Collection -- provided dataset is in a unicorn_companies.csv file.
  • Data Cleaning -- clean the dataset by handling missing values, removing duplicates and validating the data.
  • Data Transformation -- perform any necessary data transformations, such as converting data types, correcting typos, creating new variables, or aggregating data, to facilitate analysis.

B - analyze SQL

  • Valuation Analysis -- analyze the valuation distribution of unicorn companies and identify any patterns or trends. This could involve calculating summary statistics, creating histograms or box plots to compare valuations across sectors or countries.
  • Sector Performance -- analyze the performance of different sectors by comparing their average valuations and growth rates. Determine which sectors have the highest valuations or the most significant changes over time.
  • Investor Impact -- investigate the influence of select investors on unicorn companies' valuations. Analyze if the presence of specific investors correlates with higher valuations.
  • Country Comparison -- compare the unicorn company landscape across different countries. Analyze the distribution of sectors, valuations, or investor involvement to identify any country-specific patterns or trends.

C - visualize Power BI

  • Exploratory Data Analysis -- conduct exploratory data analysis to understand the characteristics and distributions of the variables in the dataset. Utilize descriptive statistics, and appropriate visualizations to gain insights into the data.
  • Sector Analysis -- visualize the distribution of unicorn companies across different sectors. This could involve creating cahrts, or treemaps to show the relative representation of sectors.
  • Geographic Analysis -- visualize the geographic distribution of unicorn companies by country of origin. This can be done using geographical visualizations to highlight the countries with the highest number of unicorn companies.
  • Investor Analysis -- explore the involvement of select investors in unicorn companies. Visualize the most active investors, analyze their portfolio distribution.

Useful Links

  • LinkedIn
  • Portfolio

A - Prepare

Data Collection

The data set is provided by CB Insights, it consists of one table: unicorn_companies.csv. Table contains 6 columns and 917 records.

  • Company - name of a company, considered as id column as each company name is unique
  • Valuation - market valuation of a company (in $B), numerical with decimals
  • Date Added - date when company reached $1 billion valuation and was added into the list, datetime
  • Country - country from which company comes from, categorical
  • Category - industry in which the company operates, categorical
  • Select Investors - list of the companys' selected investors, categorical

Sample of the Unicorn Companies table is shown below.

suppressPackageStartupMessages(library(tidyverse))
df <- read_csv('data/unicorn_companies.csv', show_col_types = FALSE)
df

Data Preparation

We have spaces in some column names, it is better to rename them to avoid technical difficulties later. Next, let's make sure that our data complies with the description given in section above, this includes checking for any missing values, data types across columns.

Issues found

  • there are no missing values in a table
  • column company have 1 duplicate value which is Bolt. This company name appears in USA and Estonia, but those are two independent companies from different industries
  • columns valuation, date_added have wrong data types
  • column country have 2 typos "Indonesia,", and "United States,"
  • column category have typos in: "Artificial intelligence", and "Finttech"
# Renaming columns
df <- df %>%
	rename(
		'company' = "Company",
		'valuation' = 'Valuation ($B)',
		'date_added' = 'Date Added',
		'country' = 'Country',
		'category' = 'Category',
		'select_investors' = 'Select Investors'
	)
head(df)
# Data validation
# Checking data types and missing values
paste('Missing values:', is_null(df))
paste('Unique companies:', n_distinct(df$company))
paste('Valuation data type:', typeof(df$valuation))
paste('Date added data type:', typeof(df$date_added))
paste('Number of countries', n_distinct(df$country))
paste('Number of categories', n_distinct(df$category))

# Investigating duplicate in `company` column
df1 <- df %>% select(company) %>% mutate(d = duplicated(company)) %>% filter(d == 1)
paste('Duplicated company:', df1$company)

df2 <- df %>% filter(company == 'Bolt')
df2

# Checking `country` column for typos
df3 <- df %>% select(country) %>% mutate(d = duplicated(country)) %>% filter(d == 0) %>% arrange(country)
df3

# Checking column `category` for typos
df4 <- df %>% select(category) %>% mutate(d = duplicated(category)) %>% filter(d == 0) %>% arrange(category)
df4

remove(df1, df2, df3, df4)

Data Cleaning & Transformation

Clean the dataset by handling missing values, removing duplicates and validating the data; perform any necessary data transformations, such as converting data types, correcting typos, creating new variables, or aggregating data, to facilitate analysis.

Based on findings from previous step, the following transformations were made:

  • valuation column have been changed into numeric data type
  • date added have been changed into date data type
  • country, category typos have been replaced with proper values
# Data cleaning
library(lubridate)

df5 <- df %>%
	mutate(
		valuation = substr(valuation, 2, 10000), # Extracting numerical value from valuation column (to get rid of "$" sign)
		valuation = as.numeric(valuation), # Converting string numerical value into numeric data type
		date_added = mdy(date_added), # Converting date_added into date-time type
		country = str_replace(country, ",", ""), # Removing commas from countries
		category = str_replace(category, "Finttech", "Fintech"), # Correcting typos in category records
		category = str_replace(category, "Artificial Intelligence", "Artificial intelligence"),
		company_id = paste(date_added, "--", company) # Creating ID column for future analysis
	)
df5

# Checking data types for 'valuation' and 'date_added' columns
paste("Valuation data type: ", typeof(df5$valuation))
paste("Date_added data type: ", typeof(df5$date_added))
paste("Investors data type: ", typeof(df5$select_investors))

# Checking changes in 'country' and 'category' columns
unique(df5$country)
unique(df5$category)

# Saving clean data frame as new CSV file
write_csv(df5, "data/unicorn_companies_clean.csv")

Summary of Part A

The data preparation and cleaning phase in this project involved thorough processing of the dataset to ensure that the data was accurate, consistent, and ready for analysis. This crucial step set the foundation for reliable and meaningful results during subsequent stages of the project. In this part of the project we have explored our data set, validated data types and records, cleaned and prepared it for further analysis.


B - Analyze

In this section of the project we will use SQL and clean data created in previous step to perform:

  • Valuation Analysis -- analyze the valuation distribution of unicorn companies and identify any patterns or trends. This could involve calculating summary statistics, creating histograms or box plots to compare valuations across countries.
  • Sector Performance -- analyze the performance of different sectors by comparing their average valuations or growth rates. Determine which sectors have the highest valuations or the most significant changes over time.
  • Investor Impact -- investigate the influence of select investors on unicorn companies' valuations. Analyze if the presence of specific investors correlates with higher valuations.
  • Country Comparison -- compare the unicorn company landscape across different countries. Analyze the distribution of sectors, valuations, number of comapnies, or investor involvement to identify any country-specific patterns or trends.
Spinner
Data frameas
cte
variable
		-- Creating longer set of data that lists all investors for each company as well as geographical regions in order to group aggreage calcuations by investor name and region

SELECT
	*,
	unnest(string_to_array(select_investors, ', ')) AS investors,
	CASE WHEN country IN ('Sweden', 'United Kingdom', 'Belgium', 'Estonia', 'Lithuania', 'France', 'Austria', 'Ireland', 'Netherlands', 'Switzerland', 'Luxembourg', 'Finland', 'Denmark', 'Norway', 'Spain', 'Czech Republic', 'Croatia', 'Turkey', 'Germany') THEN 'Europe'
		WHEN country IN ('United States', 'Brazil', 'Canada', 'Mexico', 'Colombia', 'Argentina', 'Chile', 'Bermuda', 'Santa Clara') THEN 'Americas'
		WHEN country IN ('China', 'Hong Kong', 'India', 'Singapore', 'Indonesia', 'Israel', 'Vietnam', 'United Arab Emirates', 'Japan', 'Philippines', 'South Korea', 'Malaysia', 'Thailand') THEN 'Asia'
		WHEN country IN ('Nigeria', 'South Africa', 'Senegal') THEN 'Africa'
		WHEN country IN ('Australia') THEN 'Oceania'
	END AS region,
	CASE WHEN category IN ('Fintech') THEN 'Financial Technology (Fintech)'
			WHEN category IN ('Supply chain, logistics, & delivery', 'Auto & transportation') THEN 'Transportation & Logistics'
			WHEN category IN ('Edtech') THEN 'Education Technology (Edtech)'
			WHEN category IN ('E-commerce & direct-to-consumer', 'Consumer & retail') THEN 'E-commerce & Retail'
			WHEN category IN ('Hardware') THEN 'Hardware & Electronics'
			WHEN category IN ('Health') THEN 'Healthcare & Medical Technology'
			WHEN category IN ('Travel') THEN 'Travel & Tourism'
			WHEN category IN ('Artificial intelligence', 'Internet software & services', 'Data management & analytics', 'Cybersecurity', 'Mobile & telecommunications') THEN 'Technology & Software'
			ELSE 'Other'
		END AS industry
FROM 'data/unicorn_companies_clean.csv';

Valuation Analysis

In this part we are analyzing the valuation distribution across countries. Valuation analysis was performed by calculating descriptive statisctics measures, grouped by country. Calculation query with results as well as bar chart that shows distribution across regions are presented below.


  • Summary of valuation stats:
    • Total valuation of companies on a list is $3,036.82B
    • Overall average valuation is $3.31B
  • In terms of average valuation by country:
    • The best performer is the Sweden with $14.52B,
    • Croatia performs the worst showing only $1B of average valuation, however it is important to note, that Croatia have only one company at the list
  • Top 3 countries by total valuation are:
    • United States $1,579.26B | 52% of total
    • China $569.48B | 18.75%
    • India $158.3B | 5.21%
  • Valuation distribution across regions as follows:
    • Americas ~56% of total valuation
    • Asia is at ~29.6%
    • Europe accounts for ~12.3%
    • Oceania ~1.6%
    • Africa <1%
Spinner
Data frameas
b11
variable
		-- Preparing summary table
WITH prep AS (
	SELECT
		DISTINCT company_id,
		country,
		region,
		valuation
	FROM cte)

		-- Calcualtions for valuation analysis
SELECT
	country, -- selecting countries as anchor for grouping
	region,
	SUM(valuation) AS 'val_$B', -- calculating total, min, max, avg valuation for each country
	ROUND(SUM(valuation / (SELECT SUM(valuation) FROM prep) * 100.00), 3) AS distribution_prcnt,
	MIN(valuation) AS min_val,
	MAX(valuation) AS max_val,
	ROUND(AVG(valuation), 2) AS avg_val,
	(SELECT SUM(valuation) FROM prep) AS 'total_val_$B', -- listing total valuation and avg valuation of total
	(SELECT ROUND(AVG(valuation), 2) FROM prep) AS 'total_avg_val_$B'
FROM prep
GROUP BY 1, 2 -- grouping data by country
ORDER BY SUM(valuation) DESC; -- sorting results by country's total valuation in descending order

Sector (Industry) Performance

In this section we analyze the performance of different industries by comparing their average valuations and growth rates. And determining which industries have the highest valuations or the most significant changes over time.
Our data set contains records starting 2007 for some categories, but 2015 is the year when we have recors for all categories, so in order to compare growth rates across all industries we will narrow the scope of a research for a time period since 2015 till the end of the data in 2021.


  • Leading industies in terms of valuation are:
    • Technology & Software $1,090.23B | 36% of total valuation amount
    • Fintech $736.39B | 24%
    • E-commerce & Retail $347.25B | 11%
  • The most significant changes over time (2021/2015 valuation)
    • Transportation & Logistics increase in market valuation +2,707.5%
    • Edtech +1,157.5%
    • Technology & Software +1,148.84%
    • Fintech +891.59%
Spinner
Data frameas
b21
variable
		-- Pulling data from main table
WITH temp AS (
	SELECT
		DISTINCT company_id,
		category,
		valuation,
		date_added,
		industry
	FROM cte),

		-- Summarizing data by industry
summary AS (
	SELECT
		industry,
		DATE_TRUNC('year', date_added) AS year,
		SUM(valuation) AS sum_valuation,
		RANK() OVER(PARTITION BY industry ORDER BY year) AS r
	FROM temp
	GROUP BY 1, 2
	ORDER BY 1, 2)

		-- Industry valuation
SELECT
	industry,
	SUM(sum_valuation) AS industry_valuation,
	ROUND(SUM(sum_valuation) / (SELECT SUM(sum_valuation) FROM summary), 2) AS share
FROM summary
GROUP BY 1
ORDER BY 2 DESC;