Skip to content

Project

Identifying Trends Amongst the Most Successful Startups in the US

Source

DataCamp's "Unicorn Companies" PostgreSQL database

Objective

Analyze data for US-based "Unicorn Companies" (startup companies valued over $1 billion) and identify:

  • The most valuable startups in the US
  • The most common US cities where "Unicorn Companies" are based out of
  • The most common US industries "Unicorn Companies" specialize in
  • Most common investors for American "Unicorn Companies"

1. Write a query that returns all American "Unicorn Companies" along with their city, industry, investors, funding, and valuation.

Spinner
Queryas
unicorns_usa
variable
SELECT 
	companies.company_id, 
	company, city, industry, select_investors,
    ROUND((funding/1000000000.0),1) AS funding_billions, 
    ROUND((valuation/1000000000.0),1) AS valuation_billions 
FROM companies
INNER JOIN industries ON companies.company_id = industries.company_id
INNER JOIN funding ON companies.company_id = funding.company_id
WHERE country = 'United States';

2. Identify and visualize top 10 most valuable startups in the US

Spinner
DataFrameas
usa_top_startups
variable
SELECT 
	company, 
	valuation_billions,
	RANK() OVER (ORDER BY valuation_billions DESC) AS rank
FROM unicorns_usa
ORDER BY valuation_billions DESC
LIMIT 11;
#import packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#define ranking column modifier (adds a 'T-' to the number if it's tied with another entry)
def rank_mod(df):
    df['rank'] = df['rank'].astype(str)
    dup_ranks = df.groupby('rank').filter(lambda x: len(x) > 1)
    for index, row in dup_ranks.iterrows():
        df.loc[index, 'rank'] = 'T-' + row['rank']

#create label column
rank_mod(usa_top_startups)
usa_top_startups['label'] = usa_top_startups['rank'] + ". " + usa_top_startups['company']


#create bar graph
startups_vis = sns.barplot(data=usa_top_startups, y='label',x='valuation_billions', color='forestgreen').set(title='Top 10 Most Valuable Startups in US', xlabel='Valuation ($ Billions)', ylabel='')
plt.show()

3. Identify and visualize top 10 most popular US cities for "Unicorn Companies"

Spinner
DataFrameas
usa_top_cities
variable
SELECT 
	city,
	count(company_id) AS num_companies, 
	RANK() OVER (ORDER BY count(company_id) DESC) AS rank
FROM unicorns_usa
GROUP BY city 
ORDER BY num_companies DESC
LIMIT 11;
#create label column
rank_mod(usa_top_cities)
usa_top_cities['label'] = usa_top_cities['rank'] + ". " + usa_top_cities['city']

#create bar graph
cities_vis = sns.barplot(data=usa_top_cities, y='label',x='num_companies', color = 'forestgreen').set(title='Top 10 US Cities with Most Startups Valued Over $1B', xlabel='# of Companies', ylabel='')
plt.show()

4. Identify and visualize top 10 most poular US industries for "Unicorn Companies"

Spinner
DataFrameas
usa_top_industries
variable
SELECT 
	industry,
	count(company_id) AS num_companies, 
	RANK() OVER (ORDER BY count(company_id) DESC) AS rank
FROM unicorns_usa
GROUP BY industry 
ORDER BY num_companies DESC
LIMIT 10;
#create label column
rank_mod(usa_top_industries)
usa_top_industries['label'] = usa_top_industries['rank'] + ". " + usa_top_industries['industry']

#create bar graph
cities_vis = sns.barplot(data=usa_top_industries, y='label',x='num_companies', color = 'forestgreen').set(title='Top 10 US Industries with Most Startups Valued Over $1B', xlabel='# of Companies', ylabel='')
plt.show()

5. Identify and visualize top 10 investors who have invested in the most American "Unicorn Companies"

Spinner
DataFrameas
usa_investors
variable
SELECT company, select_investors
FROM unicorns_usa;
#Split comma-separated investors in select_investors column and stack into new df
investors_series = usa_investors['select_investors'].str.split(', ').apply(pd.Series).stack()

#count occurrences of each investor
investor_counts = investors_series.value_counts().rename_axis('investor').reset_index(name='num_companies')

#create ranking column
investor_counts['rank'] = investor_counts['num_companies'].rank(ascending=False)
investor_counts['rank'] = (investor_counts['rank'] - (investor_counts['rank'] % 1)).astype(int)

#create top 10 investors table
usa_top_investors = investor_counts[0:11]
print(usa_top_investors)