Immigration Data
Let's look into legal immigration patterns into the USA. I found a dataset from the Department of Homeland Security website, and downloaded two datasets in particular, 1. The number of individuals who obtained 'legal resident status' from the year 1820-2022, and 2. The number of individuals who became naturalized citizens from 1907-2022. The population data I used came from the macrotrends website and goes back to 1950.
I want to observe this time series to see if these factors increase or decrease throughout the time series on an absolute basis. I will also observe the time series on a relative basis, with year over year change.
Lastly, I will perform an autoregressive linear regression (with one lag) on annual legal residents, in order to make a 5 year prediction on incoming immigrants.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
residents = pd.read_excel('immigration_data.xlsx', sheet_name='Permanent Residence')
citizens = pd.read_excel('immigration_data.xlsx', sheet_name='Naturalization')
population = pd.read_excel('us_population_data.xlsx')Observe the dataset.
residents.info()
residents.head()citizens.info()
citizens.head()Clean the data. I want to make the year columns 'int' data types. And the years in the citizens dataframe have some unwanted characters in some rows.
residents.loc[residents['Year'].str.len() >= 4, 'Year'] = residents['Year'].str[:4]
residents['Year'] = residents['Year'].astype('int')
residents.info()
residentscitizens.loc[citizens['year'].str.len() >= 4, 'year'] = citizens['year'].str[:4]
citizens['year'] = citizens['year'].astype('int')
citizens.info()
citizensLegal Residents per year
The y axis is in millions.
fig, ax = plt.subplots(figsize=(10, 6))
sns.lineplot(data=residents, x='Year', y='Number')
sns.regplot(data=residents, x='Year', y='Number', scatter=False, ax=ax)
ax.set_title('Immigrants Obtaining Legal Resident Status per Year')
ax.spines[['top', 'right']].set_visible(False)
plt.show()
print("Correlation between time and immigrants: ", residents['Year'].corr(residents['Number']).round(3))Naturalized Citizens per year
The y axis is in millions.
fig, ax = plt.subplots(figsize=(10, 6))
sns.lineplot(data=citizens, x='year', y='accepted')
sns.regplot(data=citizens, x='year', y='accepted', scatter=False, ax=ax)
ax.set_title('Immigrants Obtaining Naturalization Status per Year')
ax.spines[['top', 'right']].set_visible(False)
plt.show()
print("Correlation between time and naturalized citizens: ", citizens['year'].corr(citizens['accepted']).round(3))residents_recent = residents[residents['Year'] > 2000]
residents_recentLegal Residents since 2001
fig, ax = plt.subplots(figsize=(10, 6))
sns.lineplot(data=residents_recent, x='Year', y='Number')
sns.regplot(data=residents_recent, x='Year', y='Number', scatter=False, ax=ax)
ax.set_title('Immigrants Obtaining Legal Resident Status Since 2001 (in millions)')
ax.spines[['top', 'right']].set_visible(False)
plt.show()
print(residents_recent['Year'].corr(residents_recent['Number']).round(3))