Predicting the Status of H-1B Visa Applications
Machine Learning is an application of data science and artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. It uses certain set of algorithms that can access data and use it learn for themselves. In this tutorial, you'll use Python and XGBoost to predict the final case status of a visa application.
- Exploratory data analysis
- Feature engineering and feature extraction
- Training on a dataset with the XGboost algorithm
- Using the trained model to predict the case status of the visa application
Loading the Libraries
The Python library is a collection of functions and methods that allows you to perform lots of actions without writing your own code. You need to download, install and import the libraries accordingly.
NumPy
, under its alias np
, is the fundamental package for scientific computing with Python. It contains, among other things, a powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and useful linear algebra and random number capabilities. The next important library, pandas
, under its alias pd
, is used for importing csv files. It is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Next, the scikit learn
/sklearn
library is used for machine learning algorithms. The Statistics
library is used for importing some statistical functions such as mode()
. re
is used for regular expressions, and XGboost
to import the XGBoost classifier as XGBClassifier()
.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from statistics import mode
import re
from xgboost import XGBClassifier
H1B VISA and Dataset
This dataset is available on Kaggle.
It contains five year's worth of H-1B petition data, with approximately 3 million records overall. The columns in the dataset include case status, employer name, worksite coordinates, job title, prevailing wage, occupation code, and year filed.
The Office of Foreign Labor Certification (OFLC) generates program data, including data about H1-B visas. The disclosure data is updated annually and is available online. The H1B Visa is a highly desired non-immigrant visa which permits foreign workers in specialty occupations to enter the country. H-1B visas are a category of employment-based, non-immigrant visas for temporary foreign workers in the United States. For a foreign national to apply for H1-B visa, a US employer must offer them a job and submit a petition for a H-1B visa to the US immigration department. This is also the most common visa status applied for and held by international students once they complete college or higher education and begin working in a full-time position.
The first step of the H1B application process is for the U.S. employer to file the H1B petition on behalf of the foreign worker. In the second step, the prevailing and actual wages should be confirmed by the State Employment Security Agency. If the prevailing wage exceeds the offer made by the prospective employer then a wage determination will be sought. The third step of the H1B application process is to file the Labor Condition Application. The next step is to prepare the petition and file it at the proper USCIS office. Processing times for H1B application petitions are subject to vary from location to location. If you would like your petition expedited you may elect for premium processing. The final step of the H1B application process is to check the status of your H1B visa petition by entering your receipt number. Once USCIS has your application on file, they will update your status on their system.
The dataset consists of more than 2 million data points.
The data labels are divided into 6 classes: WITHDRAWN
, INVALIDATED
, UNASSIGNED
, CERTIFIED-WITHDRAWN
, DENIED
,
PENDING QUALITY AND COMPLIANCE REVIEW – UNASSIGNED
.
You should have the following information for each sample:
- CASE_ID - A unique case ID for every applicant.
- CASE_STATUS – The target variable consists of 6 different classes or values. Status is associated with the last significant event.
- EMPLOYER_NAME – Name of the employer submitting the application.
- SOC_NAME – Occupational name associated with an occupational code
- JOB_Title – Title of the Job
- FULL_TIME_POSITION – Whether the position is full time or not
- PREVAILING_WAGE - The prevailing wage for a job position is defined as the average wage paid to similarly employed workers in the requested occupation in the area of intended employment
- YEAR – Year in which h1b petition was filed
- WORKSITE - City and State information of the foreign worker’s intended area of employment
- Lon – Longitude of the worksite
- Lat – Latitude of the worksite
Loading the Dataset
The first step you should be doing is to load the dataset in an object. .read_csv()
is a method in pandas to load a csv file. You will find the dataset loaded in an object df below.
df = pd.read_csv('C:/Users/djhanj/Downloads/h1b_TRAIN.csv')
Understanding the Data
When you have the data available with you, it is always advisable to explore the dataset. To gather all the information from the dataset. You need to be sure that the data you have loaded is in a proper structured format and all the variables or features are loaded properly.
You might want to use the .info()
method on the df object to check the data information. This shows that the data is stored in a DataFrame format and 1 of the variables is in integer format, 4 variables are floats and 6 variables are in object format.
The next code you could try is the .head()
method on the object to check the first 5 rows of the data. You could get a fair idea of the dataset.
The .describe()
method would tell you the min, max, mean, median, standard deviation and count of all the integer and float variables.
df.info()
df.head()
df.describe()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2251844 entries, 0 to 2251843
Data columns (total 11 columns):
CASE_ID int64
CASE_STATUS object
EMPLOYER_NAME object
SOC_NAME object
JOB_TITLE object
FULL_TIME_POSITION object
PREVAILING_WAGE float64
YEAR float64
WORKSITE object
lon float64
lat float64
dtypes: float64(4), int64(1), object(6)
memory usage: 189.0+ MB
CASE_ID | PREVAILING_WAGE | YEAR | lon | lat | |
---|---|---|---|---|---|
count | 2.251844e+06 | 2.251779e+06 | 2.251836e+06 | 2.171398e+06 | 2.171398e+06 |
mean | 1.501788e+06 | 1.498183e+05 | 2.013854e+03 | -9.213684e+01 | 3.815967e+01 |
std | 8.666892e+05 | 5.808395e+06 | 1.680675e+00 | 1.963492e+01 | 4.671896e+00 |
min | 1.000000e+00 | 0.000000e+00 | 2.011000e+03 | -1.578583e+02 | 1.343719e+01 |
25% | 7.512768e+05 | 5.439200e+04 | 2.012000e+03 | -1.119261e+02 | 3.416536e+01 |
50% | 1.502439e+06 | 6.502100e+04 | 2.014000e+03 | -8.623793e+01 | 3.910312e+01 |
75% | 2.252488e+06 | 8.143200e+04 | 2.015000e+03 | -7.551381e+01 | 4.088399e+01 |
max | 3.002457e+06 | 6.997607e+09 | 2.016000e+03 | 1.457298e+02 | 6.483778e+01 |
This dataset has 11 columns, from which 1 is a target variable, which in this case is a case_status
. So, this data has 1 target variable and 10 independent or explanatory variables. You should definitely check the classes of the target variable. You can use the .unique()
method on the case_status
feature in df to check the unique classes in the target variable.
This is a classification problem. That means you need to predict the class of the case status.
df['CASE_STATUS'].unique()
df.head(10)
Feature engineering and data preprocessing
Note that the target variable contains 6 different classes:
- Certified
- Certified Withdrawn
- Rejected
- Invalidatd
- Pending Quality and compliance review
- Denied
Now, depending on the business problem you should decide whether to go for multi class classification or binary class classification. In case of binary class classification, you would classify either Certified or Denied. So the first thing that you should do is to convert remaining classes into either denied or certified. Over here, Rejected and Invalidated are both cases of a denied visa, so you should convert those case status to Denied. Pending Quality and compliance is most likely to be rejected in case of US visa, so it should be changed to Denied as well. Certified withdrawn is a case of Certifed class because the visa is Certifed but the user decided to withdraw the file. In the section below, classes have been converted to either Certified or Denied.
The next class is Withdrawn, and as it is hard to predict the Withdrawn case in this dataset, you can simply remove the withdrawn class. That means all the features representing the withdrawn class will be removed from the dataset. Another reason to remove the Withdrawn class is that its percentage in the entire dataset is less that 1%, which means that the model would not probably not have accurately classified the Withdrawn class.
This is an imbalanced dataset and there are multiple classes of target variables, and the main objective is to find the likelihood of being eligible for an h1b visa, so you should include the cases for only certified and denied. So, finally the target variable will have only 2 classes i.e. Certified
and Denied
.
import warnings
warnings.filterwarnings("ignore")
df.CASE_STATUS[df['CASE_STATUS']=='REJECTED'] = 'DENIED'
df.CASE_STATUS[df['CASE_STATUS']=='INVALIDATED'] = 'DENIED'
df.CASE_STATUS[df['CASE_STATUS']=='PENDING QUALITY AND COMPLIANCE REVIEW - UNASSIGNED'] = 'DENIED'
df.CASE_STATUS[df['CASE_STATUS']=='CERTIFIED-WITHDRAWN'] = 'CERTIFIED'
Checking the percentage of Certified and Denied classes in the Dataset.
##Drop rows with withdrawn
df.EMPLOYER_NAME.describe()
df = df.drop(df[df.CASE_STATUS == 'WITHDRAWN'].index)
## Storing non null in df w.r.t. case status
df = df[df['CASE_STATUS'].notnull()]
print(df['CASE_STATUS'].value_counts())
CERTIFIED 2114025
DENIED 70606
Name: CASE_STATUS, dtype: int64
The decline class is only 3.2% of the total dataset, that means you now have approx 96.8% Certified cases in your dataset. This shows that the datset is highly imbalanced. One of the problems with imbalanced datasets is that the model is more biased towards the class with higher occurence; in this case the model would be biased towards Certified. Although there are differerent techniques to solve this imbalanced classification problem, you won't be needing those techniques in this tutorial.
print(70606/(70606+2114025))
0.03231941687177377
Treating missing and NA values
This dataset is not clean and it contains many missing values. This is the most important step as you must treat the missing values. The most simple way is to remove them, but you should not remove those values as you would lose the information. Let's look step by step how the missing values have been treated:
##check count of NAN
count_nan = len(df) - df.count()
print(count_nan)
CASE_ID 0
CASE_STATUS 0
EMPLOYER_NAME 11
SOC_NAME 12725
JOB_TITLE 6
FULL_TIME_POSITION 1
PREVAILING_WAGE 41
YEAR 0
WORKSITE 0
lon 77164
lat 77164
dtype: int64
If you would look at the number of missing values for EMPLOYER_NAME
, you could fill those 11 missing values with the mode, which is the most occuring value. This has been done below
## Filling na in employer name with mode
df['EMPLOYER_NAME'] = df['EMPLOYER_NAME'].fillna(df['EMPLOYER_NAME'].mode()[0])
The assert
Statement is used below. When it encounters an assert statement, Python evaluates the accompanying expression, which is hopefully true. If the expression is false, Python raises an AssertionError exception. If the assertion fails, Python uses ArgumentExpression as the argument for the AssertionError.
assert pd.notnull(df['EMPLOYER_NAME']).all().all()
The next feature is prevailing_wage
. You will find that most of the petitions have a wage in the range between 40k and 80k US Dollars. There are certain petitions with wages of more than $500k, and some with 0 dollars. Since there are very few such cases, they should be treated as outliers by capping them at the 2nd and 98th percentile
##to check the percentile in wages
print(np.nanpercentile(df.PREVAILING_WAGE,98))
df.PREVAILING_WAGE.median()
138703.0
65000.0
After capping the wages with 98th and 2nd percentile, the mean and median are very similar. The median as above is 65K and mean is 68K. Finally, you can replace the NA value with the mean. You could replace them with the median as well because both values are almost similar.
## replacing min and max with 2 and 98 percentile
df.loc[df.PREVAILING_WAGE < 34029, 'PREVAILING_WAGE']= 34029
df.loc[df['PREVAILING_WAGE'] > 138703, 'PREVAILING_WAGE']= 138703
df.PREVAILING_WAGE.fillna(df.PREVAILING_WAGE.mean(), inplace = True)
For the the JOB_TITLE
, FULL_TIME_POSITION
and SOC_NAME
columns, you could fill the missing values with the mode(most occuring value). No need to apply imputation techniques over here as the percentage of missing values is very low. This has been done below.
## Filling na in JOB_TITLE and FULL_TIME_POSITION with mode
df['JOB_TITLE'] = df['JOB_TITLE'].fillna(df['JOB_TITLE'].mode()[0])
df['FULL_TIME_POSITION'] = df['FULL_TIME_POSITION'].fillna(df['FULL_TIME_POSITION'].mode()[0])
df['SOC_NAME'] = df['SOC_NAME'].fillna(df['SOC_NAME'].mode()[0])
The next feature is FULL_TIME_POSITION
: Y
indicates the petitioner has a full time role and N
indicates a part time role. Around 85% of the jobs applied for are full time jobs. Just 2 different ways to calculate the same thing.
foo1 = df['FULL_TIME_POSITION']=='Y'
foo2 = df['CASE_STATUS']=='CERIFIED'
print(len(df[foo1])/len(df))
fooy = df.FULL_TIME_POSITION[df['FULL_TIME_POSITION']=='Y'].count()
foox = df.CASE_STATUS[df['CASE_STATUS']=='CERIFIED'].count()
print(fooy/df.FULL_TIME_POSITION.count())
0.8589569588639913
0.858956958864
Dropping lat
and lon
Columns
While making the model you can drop the lat
and lon
columns as of now, as this will be covered by the worksite column afterwards.
Use the drop
method on the df DataFrame with the column name.
Note that you need to define the axis as well. Axis 0 stands for rows and Axis 1 stands for columns.
# Dropping lat and lon columns
df = df.drop('lat', axis = 1)
df = df.drop('lon', axis = 1)
Feature Creation
Now, it is not possbile to have each and every column while making a model but at the same time these columns possess some information that you should extract.
EMPLOYER_NAME
contains the names of the employers and there are lot of unique employers. It is the company which submits the application for its employee. You cannot use EMPLOYER_NAME
directly in the model because it has got many unique string values or categories; more than 500 employers. These employers act as factors or levels. It is not advisable to use this many factors in a single column.
The top 5 companies submitting the application for their employees are Infosys, TCS, Wipro, Deloitte and IBM. However, if any University is submitting an application then it is more likely to be accepted.
So, the question is, how can you extract some information from this feature?
Well, you can probably create a new feature called NEW_EMPLOYER
:
If the employer name contains the string 'University' (for instance if a US university is filing a visa petition, then it has more chances of approval for the employee).
So, if the EMPLOYER_NAME
contains 'University', then NEW_EMPLOYER
contains the university value.
It is pretty easy to create a new empty columns, as below
df['NEW_EMPLOYER'] = np.nan
df.shape
(2184631, 10)
Note that, there is a catch while checking University string in EMPLOYER_NAME
.
What if the the University string is in upper case in some of the rows and in lower case in some other rows? If you are mapping lowercase university it would then miss the uppercase UNIVERSITY and vice-versa. So in order to correctly map and check the University string, you should first convert all the strings into the same case; either lowercase or uppercase.
In the code below, all the values of EMPLOYER_NAME
are converted into a lower case using the function str.lower()
Now, all the values in EMPLOYER_NAME
are in the same case, and it will be very easy to find the rows containing University as an Employer name.
All the strings in EMPLOYER_NAME containing the keyword university will have 'university' as value in the NEW_EMPLOYER
column. All the remaining empty rows will be filled with 'non university'.
warnings.filterwarnings("ignore")
df['EMPLOYER_NAME'] = df['EMPLOYER_NAME'].str.lower()
df.NEW_EMPLOYER[df['EMPLOYER_NAME'].str.contains('university')] = 'university'
df['NEW_EMPLOYER']= df.NEW_EMPLOYER.replace(np.nan, 'non university', regex=True)
Now, the next variable is SOC_NAME
, it consists of an occupation name. There are lot of values associated with SOC_NAME
, so you might want to create a new feature that will contain the important occupation of the applicant, mapping it with the SOC_NAME
value. You can create a new variable called OCCUPATION
. For example computer, programmer and software are all computer ocupations. This will cover the top 80% of the occupations, and minor and remaining occupations will be categorized as others
.
# Creating occupation and mapping the values
warnings.filterwarnings("ignore")
df['OCCUPATION'] = np.nan
df['SOC_NAME'] = df['SOC_NAME'].str.lower()
df.OCCUPATION[df['SOC_NAME'].str.contains('computer','programmer')] = 'computer occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('software','web developer')] = 'computer occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('database')] = 'computer occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('math','statistic')] = 'Mathematical Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('predictive model','stats')] = 'Mathematical Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('teacher','linguist')] = 'Education Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('professor','Teach')] = 'Education Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('school principal')] = 'Education Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('medical','doctor')] = 'Medical Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('physician','dentist')] = 'Medical Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('Health','Physical Therapists')] = 'Medical Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('surgeon','nurse')] = 'Medical Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('psychiatr')] = 'Medical Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('chemist','physicist')] = 'Advance Sciences'
df.OCCUPATION[df['SOC_NAME'].str.contains('biology','scientist')] = 'Advance Sciences'
df.OCCUPATION[df['SOC_NAME'].str.contains('biologi','clinical research')] = 'Advance Sciences'
df.OCCUPATION[df['SOC_NAME'].str.contains('public relation','manage')] = 'Management Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('management','operation')] = 'Management Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('chief','plan')] = 'Management Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('executive')] = 'Management Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('advertis','marketing')] = 'Marketing Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('promotion','market research')] = 'Marketing Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('business','business analyst')] = 'Business Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('business systems analyst')] = 'Business Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('accountant','finance')] = 'Financial Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('financial')] = 'Financial Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('engineer','architect')] = 'Architecture & Engineering'
df.OCCUPATION[df['SOC_NAME'].str.contains('surveyor','carto')] = 'Architecture & Engineering'
df.OCCUPATION[df['SOC_NAME'].str.contains('technician','drafter')] = 'Architecture & Engineering'
df.OCCUPATION[df['SOC_NAME'].str.contains('information security','information tech')] = 'Architecture & Engineering'
df['OCCUPATION']= df.OCCUPATION.replace(np.nan, 'Others', regex=True)
Since visa applications majorly depend on State location, you should split the state information from the WORKSITE
variable.
## Splitting city and state and capturing state in another variable
df['state'] = df.WORKSITE.str.split('\s+').str[-1]
California has the highest number of petitions cementing its place as the base of choice for IT ops, followed by Texas, New York and New Jersey.
print(df.head())
CASE_ID CASE_STATUS EMPLOYER_NAME \
0 1 CERTIFIED university of michigan
1 2 CERTIFIED goodman networks, inc.
2 3 CERTIFIED ports america group, inc.
3 4 CERTIFIED gates corporation, a wholly-owned subsidiary o...
4 6 CERTIFIED burger king corporation
SOC_NAME \
0 biochemists and biophysicists
1 chief executives
2 chief executives
3 chief executives
4 chief executives
JOB_TITLE FULL_TIME_POSITION \
0 POSTDOCTORAL RESEARCH FELLOW N
1 CHIEF OPERATING OFFICER Y
2 CHIEF PROCESS OFFICER Y
3 REGIONAL PRESIDEN, AMERICAS Y
4 EXECUTIVE V P, GLOBAL DEVELOPMENT AND PRESIDEN... Y
PREVAILING_WAGE YEAR WORKSITE NEW_EMPLOYER \
0 36067.0 2016.0 ANN ARBOR, MICHIGAN university
1 138703.0 2016.0 PLANO, TEXAS non university
2 138703.0 2016.0 JERSEY CITY, NEW JERSEY non university
3 138703.0 2016.0 DENVER, COLORADO non university
4 138703.0 2016.0 MIAMI, FLORIDA non university
OCCUPATION state
0 Advance Sciences MICHIGAN
1 Management Occupation TEXAS
2 Management Occupation JERSEY
3 Management Occupation COLORADO
4 Management Occupation FLORIDA
Now, in order to calculate the probabilities, you need to convert the target classes to binary, i.e. 0 and 1. You can use CERTIFIED
and DENIED
, but you won't be able to calculate probabilites and the AUC score in this case. So you have to map it to 0 and 1.
from sklearn import preprocessing
class_mapping = {'CERTIFIED':0, 'DENIED':1}
df["CASE_STATUS"] = df["CASE_STATUS"].map(class_mapping)
print(df.head())
CASE_ID CASE_STATUS EMPLOYER_NAME \
0 1 0 university of michigan
1 2 0 goodman networks, inc.
2 3 0 ports america group, inc.
3 4 0 gates corporation, a wholly-owned subsidiary o...
4 6 0 burger king corporation
SOC_NAME \
0 biochemists and biophysicists
1 chief executives
2 chief executives
3 chief executives
4 chief executives
JOB_TITLE FULL_TIME_POSITION \
0 POSTDOCTORAL RESEARCH FELLOW N
1 CHIEF OPERATING OFFICER Y
2 CHIEF PROCESS OFFICER Y
3 REGIONAL PRESIDEN, AMERICAS Y
4 EXECUTIVE V P, GLOBAL DEVELOPMENT AND PRESIDEN... Y
PREVAILING_WAGE YEAR WORKSITE NEW_EMPLOYER \
0 36067.0 2016.0 ANN ARBOR, MICHIGAN university
1 138703.0 2016.0 PLANO, TEXAS non university
2 138703.0 2016.0 JERSEY CITY, NEW JERSEY non university
3 138703.0 2016.0 DENVER, COLORADO non university
4 138703.0 2016.0 MIAMI, FLORIDA non university
OCCUPATION state
0 Advance Sciences MICHIGAN
1 Management Occupation TEXAS
2 Management Occupation JERSEY
3 Management Occupation COLORADO
4 Management Occupation FLORIDA
For JOB_TITLE
, you could do a numerical transformatio.n
There are 238176 unique Job titles in the dataset, as shown below.
test1 = pd.Series(df['JOB_TITLE'].ravel()).unique()
print(pd.DataFrame(test1))
0
0 POSTDOCTORAL RESEARCH FELLOW
1 CHIEF OPERATING OFFICER
2 CHIEF PROCESS OFFICER
3 REGIONAL PRESIDEN, AMERICAS
4 EXECUTIVE V P, GLOBAL DEVELOPMENT AND PRESIDEN...
5 PRESIDENT
6 VICE PRESIDENT AND CHIEF HUMAN RESOURCES OFFICER
7 TREASURER AND COO
8 CHIEF COMMERCIAL OFFICER
9 BOARD MEMBER
10 CEO
11 PRESIDENT, NORTHEAST REGION
12 CHIEF OPERATING OFFICER (COO)
13 GENERAL MANAGER, OPERATIONS
14 CHIEF EXECUTIVE OFFICER
15 CHIEF BUSINESS OFFICER
16 EXECUTIVE DIRECTOR
17 VICE PRESIDENT, BUSINESS DEVELOPMENT
18 HEAD OF US SALES
19 VICE PRESIDENT OF OPERATIONS
20 VICE PRESIDENT, FINANCE AND OPERATIONS ANALYSIS
21 ACCOUNT DIRECTOR
22 TECHNICAL DIRECTOR
23 SVP BUSINESS OPERATIONS AND DEVELOPMENT
24 CHIEF FINANCIAL OFFICER
25 CHIEF MEDICAL OFFICER
26 ASSISTANT VICE PRESIDENT, BUSINESS DEVELOPMENT
27 VICE PRESIDENT OF QUALITY
28 PRESIDENT & CHIEF EXECUTIVE OFFICER
29 CHIEF EXECUTIVE OFFICER AND RESEARCH SCIENTIST
... ...
238146 LOGISTIC AND SALES MANAGER
238147 MEDICAL DIRECTOR, HQ MEDICAL IMMUNOLOGY - EARL...
238148 PHYSICAL THERAPIST/FACILITY REHAB DIRECTOR
238149 BIZTEK DEVELOPER
238150 SALES ANALYST - RETAIL CHANNEL
238151 ASSISTANT WOMEN'S DESIGNER
238152 ASSAY DEVELOPMENT ASSOCIATE
238153 ADMINISTRATIVE SUPPORT
238154 PROGRAM MANAGEMENT ANALYST - SAP
238155 RESEARCH ASSOCIATE/ CRIMINAL JUSTICE
238156 SENIOR MOBILE APPLICATIONS DEVELOPER-CLIENT DE...
238157 ASSOCIATE, IT RISK & BUSINESS ANALYST (LEAD)
238158 INDUSTRY LEAD
238159 SYSTEMS AND DATA ANALYST - TECHNICAL PROJECT M...
238160 ADMINISTRATIVE ANALYST 1
238161 POWER DEVICE PRODUCT DEVELOPMENT ENGINEER
238162 SECURITIES ANALYST, EQUITIES DERIVATIVES
238163 CONSULTANT, SYSTEMS ANALYST
238164 ENGLISH LANGUAGE EDUCATION TEACHER
238165 STAFF SYSTEMS ENGINEER (SOFTWARE)
238166 SR. SOFTWARE ENGINEER
238167 COMMUNICATIONS SPECIALIST (PARTNER DEVELOPMENT)
238168 MANAGER, SITE MERCHANDISING - HEALTH & SPORTS
238169 DIRECTOR EDUCATIONAL PROGRAMS
238170 REGISTERED NURSE (ER)
238171 FOOT AND ANKLE SURGEON
238172 DFT MODELING ENGINEER
238173 SENIOR SOFTWARE ENGINEER (LTE PROTOCOL STACK)
238174 SALES AND LOGISTIC MANAGER
238175 APPRAISERS, REAL ESTATE
[238176 rows x 1 columns]
Since, you have now generated new features from the variables below, you can drop them, running horizontally across columns, as axis = 1.
# dropping these columns
df = df.drop('EMPLOYER_NAME', axis = 1)
df = df.drop('SOC_NAME', axis = 1)
df = df.drop('JOB_TITLE', axis = 1)
df = df.drop('WORKSITE', axis = 1)
df = df.drop('CASE_ID', axis = 1)
df1 = df.copy()
Changing dtype to category: Now before moving to the modeling part, you should definitely check the data types of the variables. For instance, over here, a few variables should have been used as categories or factors, but instead they are in object string fromat.
So, you have to change the data type of these variables from object to category, as these are categorical features.
df1[['CASE_STATUS', 'FULL_TIME_POSITION', 'YEAR','NEW_EMPLOYER','OCCUPATION','state']] = df1[['CASE_STATUS', 'FULL_TIME_POSITION', 'YEAR','NEW_EMPLOYER','OCCUPATION','state']].apply(lambda x: x.astype('category'))
df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2184631 entries, 0 to 2251800
Data columns (total 7 columns):
CASE_STATUS category
FULL_TIME_POSITION category
PREVAILING_WAGE float64
YEAR category
NEW_EMPLOYER category
OCCUPATION category
state category
dtypes: category(6), float64(1)
memory usage: 45.8 MB
Splitting Data in Training and Test Sets
It's a standard practice to split the dataset into a training and testing set. The reason behind this is that you should fit and train your model using the training set, and then finally predict and check your accuracy on the test set.
X = df.drop('CASE_STATUS', axis=1)
y = df.CASE_STATUS
seed = 7
test_size = 0.40
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
X_train.columns
Index(['FULL_TIME_POSITION', 'PREVAILING_WAGE', 'YEAR', 'NEW_EMPLOYER',
'OCCUPATION', 'state'],
dtype='object')
Check if there are any null values in the training set. There should not be any.
print(X_train.isnull().sum())
FULL_TIME_POSITION 0
PREVAILING_WAGE 0
YEAR 0
NEW_EMPLOYER 0
OCCUPATION 0
state 0
dtype: int64
Encode X_train
and X_test
to get them ready for Xgboost, as it only works on numeric data. The function pd.get_dummies()
is used to encode the categorical values to integers. It will create a transpose of all the categorical values and then map 1 wherever the value is present or 0 if it's not present. You should definitely try at your end to to print the X_train_encode
below to check the transpose.
X_train_encode = pd.get_dummies(X_train)
X_test_encode = pd.get_dummies(X_test)
y_train.head()
1688702 0
1808423 0
1550318 0
1495419 0
294782 0
Name: CASE_STATUS, dtype: int64
XGBoost
XGBoost is short term for “Extreme Gradient Boosting”, which is a supervised learning problem. Here you use the training data (with multiple features) x(i) to predict a target variable y(i).
It is an implementation of gradient boosted decision trees designed for speed and performance.
Boosting is an ensemble method that seeks to create a strong classifier (model) based on “weak” classifiers. In this context, weak and strong refer to a measure of how correlated the learners are to the actual target variable. By adding models on top of each other iteratively, the errors of the previous model are corrected by the next predictor, until the training data is accurately predicted or reproduced by the model.
Now, gradient boosting also comprises an ensemble method that sequentially adds predictors and corrects previous models. However, instead of assigning different weights to the classifiers after every iteration, this method fits the new model to new residuals of the previous prediction and then minimizes the loss when adding the latest prediction.
So, in the end, you are updating your model using gradient descent and hence the name, gradient boosting.
To learn more about XGBoost, take a look at our Extreme Gradient Boosting course.
XGBoost can be installed easily using pip
pip install xgboost
To use the xgboost
package, keep these things in mind:
- Convert categorical variables into numeric ones using one hot encoding
- For classification, if the dependent variable belongs to the class factor, convert it to numeric
as_matrix
is quick enough to implement one hot encoding. You can convert the dataset to a matrix format, as shown below.
train_X = X_train_encode.as_matrix()
train_y = y_train.as_matrix()
The XGBoost model for classification is called XGBClassifier()
. You can create it and fit it to your training dataset. You can create a XGBClassifier()
with max_features
as sqrt
. max_features
is the number of features to consider when looking for the best split. So if n_features
is 100, then max_features
would be 10.
import xgboost
gbm=xgboost.XGBClassifier(max_features='sqrt', subsample=0.8, random_state=10)
Using GridSearchCV()
to tune hyperparameters:
GridSearchCV()
implements afit
and ascore
method. It also implementspredict
,predict_proba
,decision_function
,transform
andinverse_transform
if they are implemented in the estimator used.- The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
- You can use
n_estimators
as 1, 10 and 100, andlearning_rate
as 0.1, 0.01 and 0.5. The same is used below. n_estimators
: The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting, so a large number usually results in better performance.- learning_rate : A problem with gradient boosted decision trees is that they are quick to learn and overfit training data. One effective way to slow down learning in the gradient boosting model is to use a learning rate, also called shrinkage. Gradient boosting involves creating and adding trees to the model sequentially. New trees are created to correct the residual errors in the predictions from the existing sequence of trees. The effect is that the model can quickly fit, then overfit the training dataset. A technique to slow down the learning is to apply a weighting factor for the corrections by new trees when added to the model. This weighting is called the shrinkage factor or the learning rate, depending on the literature or the tool. It is common to have small values in the range of 0.1 to 0.3, as well as values less than 1.
- Based on the gbm classifier's accuracy, it will create 3 different folds and choose the best
learning_rate
andn_estimators
from sklearn.model_selection import GridSearchCV
parameters = [{'n_estimators': [10, 100]},
{'learning_rate': [0.1, 0.01, 0.5]}]
grid_search = GridSearchCV(estimator = gbm, param_grid = parameters, scoring='accuracy', cv = 3, n_jobs=-1)
grid_search = grid_search.fit(train_X, train_y)
After fitting GridSearchCV()
on the training set, it got a score of 97% accuracy with a learning rate of 0.5:
warnings.filterwarnings("ignore")
grid_search.grid_scores_, grid_search.best_params_, grid_search.best_score_
([mean: 0.96764, std: 0.00000, params: {'n_estimators': 10},
mean: 0.96764, std: 0.00000, params: {'n_estimators': 100},
mean: 0.96764, std: 0.00000, params: {'learning_rate': 0.1},
mean: 0.96764, std: 0.00000, params: {'learning_rate': 0.01},
mean: 0.96765, std: 0.00006, params: {'learning_rate': 0.5}],
{'learning_rate': 0.5},
0.96764669532140457)
The code below will give you the best estimator for your model. You should try the GridSearchCV()
several times with different values of learning_rate
and n_estimators
. You will get an output of accuracy scores using different combinations of hyperparameters. You can then choose the best combination.
grid_search.best_estimator_
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.5, max_delta_step=0,
max_depth=3, max_features='sqrt', min_child_weight=1, missing=None,
n_estimators=100, n_jobs=1, nthread=None,
objective='binary:logistic', random_state=10, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
subsample=0.8)
Now you should use the same estimator to finally fit the model as this is the best estimator for the model.
gbm=xgboost.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.5, max_delta_step=0,
max_depth=3, max_features='sqrt', min_child_weight=1, missing=None,
n_estimators=100, n_jobs=1, nthread=None,
objective='binary:logistic', random_state=10, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
subsample=0.8).fit(train_X, train_y)
Perform predictions on the test set with gbm.predict
.
y_pred = gbm.predict(X_test_encode.as_matrix())
You can now print the confusion matrix and classification report.
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
[[845548 118]
[ 28057 130]]
precision recall f1-score support
0 0.97 1.00 0.98 845666
1 0.52 0.00 0.01 28187
avg / total 0.95 0.97 0.95 873853
The Overall accuracy of the model is 96.56% with Certified having a precision of 97% and Denied with a precision of 56%. Kindly note, this is a highly imbalanced data set with the minority class just 3% of the total class so the result is more biased towards the majority class CERTIFIED
. So to deal with this there are many techniques available such as undersampling or oversampling. One of the technique you can try is SMOTE
– SYNTHETIC MINORITY OVER SAMPLING.
You can now look at the AUC score and ROC curve for your predictions. The dataset was highly imbalanced, only 3% were Denied out of all the visa applications, so this model is highly biased towards the majority class, which is Certified. The ROC graph shows that 50% of the area is under the curve.
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_pred)
0.50223626010451605
from sklearn import metrics
import matplotlib.pyplot as plt
fpr_xg, tpr_xg, thresholds = metrics.roc_curve(y_test, y_pred)
print(metrics.auc(fpr_xg, tpr_xg))
auc_xgb = np.trapz(tpr_xg,fpr_xg)
plt.plot(fpr_xg,tpr_xg,label=" auc="+str(auc_xgb))
plt.legend(loc=4)
plt.show()
0.502236260105
Finally, you can save the model on your disk so you don't have to run it again next time. This is done using the Pickle module in Python.
import pickle
"""Saving the Model"""
XGB_Model_h1b = 'XGB_Model_h1b.sav'
pickle.dump(gbm, open(XGB_Model_h1b, 'wb'))
Conclusion
The most important thing while making a model is the feature engineering and selection process. You should be able to extract maximum information from the features to make your model more robust and accurate. Feature selection and extraction comes with respect to time and experience. There could be several ways to deal with the information available in the dataset.
There are lot of different ways to make your model learn. The learning algorithm should give you the best results. You can probably use different learning algorithms and then ensemble them to make your model more robust. A/B testing could also be done while in producion so that you would know which model is performing much better. Go ahead, hit the code and try different methods. Happy coding.