Skip to content
Identifying Trends Amongst Most Successful Startups in the US
Project
Identifying Trends Amongst the Most Successful Startups in the US
Source
DataCamp's "Unicorn Companies" PostgreSQL database
Objective
Analyze data for US-based "Unicorn Companies" (startup companies valued over $1 billion) and identify:
- The most valuable startups in the US
- The most common US cities where "Unicorn Companies" are based out of
- The most common US industries "Unicorn Companies" specialize in
- Most common investors for American "Unicorn Companies"
1. Write a query that returns all American "Unicorn Companies" along with their city, industry, investors, funding, and valuation.
Queryas
unicorns_usa
variable
SELECT
companies.company_id,
company, city, industry, select_investors,
ROUND((funding/1000000000.0),1) AS funding_billions,
ROUND((valuation/1000000000.0),1) AS valuation_billions
FROM companies
INNER JOIN industries ON companies.company_id = industries.company_id
INNER JOIN funding ON companies.company_id = funding.company_id
WHERE country = 'United States';
2. Identify and visualize top 10 most valuable startups in the US
DataFrameas
usa_top_startups
variable
SELECT
company,
valuation_billions,
RANK() OVER (ORDER BY valuation_billions DESC) AS rank
FROM unicorns_usa
ORDER BY valuation_billions DESC
LIMIT 11;
#import packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#define ranking column modifier (adds a 'T-' to the number if it's tied with another entry)
def rank_mod(df):
df['rank'] = df['rank'].astype(str)
dup_ranks = df.groupby('rank').filter(lambda x: len(x) > 1)
for index, row in dup_ranks.iterrows():
df.loc[index, 'rank'] = 'T-' + row['rank']
#create label column
rank_mod(usa_top_startups)
usa_top_startups['label'] = usa_top_startups['rank'] + ". " + usa_top_startups['company']
#create bar graph
startups_vis = sns.barplot(data=usa_top_startups, y='label',x='valuation_billions', color='forestgreen').set(title='Top 10 Most Valuable Startups in US', xlabel='Valuation ($ Billions)', ylabel='')
plt.show()
3. Identify and visualize top 10 most popular US cities for "Unicorn Companies"
DataFrameas
usa_top_cities
variable
SELECT
city,
count(company_id) AS num_companies,
RANK() OVER (ORDER BY count(company_id) DESC) AS rank
FROM unicorns_usa
GROUP BY city
ORDER BY num_companies DESC
LIMIT 11;
#create label column
rank_mod(usa_top_cities)
usa_top_cities['label'] = usa_top_cities['rank'] + ". " + usa_top_cities['city']
#create bar graph
cities_vis = sns.barplot(data=usa_top_cities, y='label',x='num_companies', color = 'forestgreen').set(title='Top 10 US Cities with Most Startups Valued Over $1B', xlabel='# of Companies', ylabel='')
plt.show()
4. Identify and visualize top 10 most poular US industries for "Unicorn Companies"
DataFrameas
usa_top_industries
variable
SELECT
industry,
count(company_id) AS num_companies,
RANK() OVER (ORDER BY count(company_id) DESC) AS rank
FROM unicorns_usa
GROUP BY industry
ORDER BY num_companies DESC
LIMIT 10;
#create label column
rank_mod(usa_top_industries)
usa_top_industries['label'] = usa_top_industries['rank'] + ". " + usa_top_industries['industry']
#create bar graph
cities_vis = sns.barplot(data=usa_top_industries, y='label',x='num_companies', color = 'forestgreen').set(title='Top 10 US Industries with Most Startups Valued Over $1B', xlabel='# of Companies', ylabel='')
plt.show()
5. Identify and visualize top 10 investors who have invested in the most American "Unicorn Companies"
DataFrameas
usa_investors
variable
SELECT company, select_investors
FROM unicorns_usa;
#Split comma-separated investors in select_investors column and stack into new df
investors_series = usa_investors['select_investors'].str.split(', ').apply(pd.Series).stack()
#count occurrences of each investor
investor_counts = investors_series.value_counts().rename_axis('investor').reset_index(name='num_companies')
#create ranking column
investor_counts['rank'] = investor_counts['num_companies'].rank(ascending=False)
investor_counts['rank'] = (investor_counts['rank'] - (investor_counts['rank'] % 1)).astype(int)
#create top 10 investors table
usa_top_investors = investor_counts[0:11]
print(usa_top_investors)