Predicting the Status of H-1B Visa Applications

Learn how you can predict the status of a H-1B visa application with Machine Learning in Python.

Machine Learning is an application of data science and artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. It uses certain set of algorithms that can access data and use it learn for themselves. In this tutorial, you'll use Python and XGBoost to predict the final case status of a visa application.

Loading the Libraries

The Python library is a collection of functions and methods that allows you to perform lots of actions without writing your own code. You need to download, install and import the libraries accordingly.

NumPy, under its alias np, is the fundamental package for scientific computing with Python. It contains, among other things, a powerful N-dimensional array object, sophisticated (broadcasting) functions, tools for integrating C/C++ and useful linear algebra and random number capabilities. The next important library, pandas, under its alias pd, is used for importing csv files. It is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Next, the scikit learn/sklearn library is used for machine learning algorithms. The Statistics library is used for importing some statistical functions such as mode(). re is used for regular expressions, and XGboost to import the XGBoost classifier as XGBClassifier().

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from statistics import mode
import re
from xgboost import XGBClassifier

H1B VISA and Dataset

This dataset is available on Kaggle.

It contains five year's worth of H-1B petition data, with approximately 3 million records overall. The columns in the dataset include case status, employer name, worksite coordinates, job title, prevailing wage, occupation code, and year filed.

The Office of Foreign Labor Certification (OFLC) generates program data, including data about H1-B visas. The disclosure data is updated annually and is available online. The H1B Visa is a highly desired non-immigrant visa which permits foreign workers in specialty occupations to enter the country. H-1B visas are a category of employment-based, non-immigrant visas for temporary foreign workers in the United States. For a foreign national to apply for H1-B visa, a US employer must offer them a job and submit a petition for a H-1B visa to the US immigration department. This is also the most common visa status applied for and held by international students once they complete college or higher education and begin working in a full-time position.

The first step of the H1B application process is for the U.S. employer to file the H1B petition on behalf of the foreign worker. In the second step, the prevailing and actual wages should be confirmed by the State Employment Security Agency. If the prevailing wage exceeds the offer made by the prospective employer then a wage determination will be sought. The third step of the H1B application process is to file the Labor Condition Application. The next step is to prepare the petition and file it at the proper USCIS office. Processing times for H1B application petitions are subject to vary from location to location. If you would like your petition expedited you may elect for premium processing. The final step of the H1B application process is to check the status of your H1B visa petition by entering your receipt number. Once USCIS has your application on file, they will update your status on their system.

The dataset consists of more than 2 million data points. The data labels are divided into 6 classes: WITHDRAWN, INVALIDATED, UNASSIGNED, CERTIFIED-WITHDRAWN, DENIED, PENDING QUALITY AND COMPLIANCE REVIEW – UNASSIGNED. You should have the following information for each sample:

  1. CASE_ID - A unique case ID for every applicant.
  2. CASE_STATUS – The target variable consists of 6 different classes or values. Status is associated with the last significant event.
  3. EMPLOYER_NAME – Name of the employer submitting the application.
  4. SOC_NAME – Occupational name associated with an occupational code
  5. JOB_Title – Title of the Job
  6. FULL_TIME_POSITION – Whether the position is full time or not
  7. PREVAILING_WAGE - The prevailing wage for a job position is defined as the average wage paid to similarly employed workers in the requested occupation in the area of intended employment
  8. YEAR – Year in which h1b petition was filed
  9. WORKSITE - City and State information of the foreign worker’s intended area of employment
  10. Lon – Longitude of the worksite
  11. Lat – Latitude of the worksite

Loading the Dataset

The first step you should be doing is to load the dataset in an object. .read_csv() is a method in pandas to load a csv file. You will find the dataset loaded in an object df below.

df = pd.read_csv('C:/Users/djhanj/Downloads/h1b_TRAIN.csv')

Understanding the Data

When you have the data available with you, it is always advisable to explore the dataset. To gather all the information from the dataset. You need to be sure that the data you have loaded is in a proper structured format and all the variables or features are loaded properly.

You might want to use the .info() method on the df object to check the data information. This shows that the data is stored in a DataFrame format and 1 of the variables is in integer format, 4 variables are floats and 6 variables are in object format.

The next code you could try is the .head() method on the object to check the first 5 rows of the data. You could get a fair idea of the dataset.

The .describe() method would tell you the min, max, mean, median, standard deviation and count of all the integer and float variables.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2251844 entries, 0 to 2251843
Data columns (total 11 columns):
CASE_ID               int64
CASE_STATUS           object
EMPLOYER_NAME         object
SOC_NAME              object
JOB_TITLE             object
PREVAILING_WAGE       float64
YEAR                  float64
WORKSITE              object
lon                   float64
lat                   float64
dtypes: float64(4), int64(1), object(6)
memory usage: 189.0+ MB
count 2.251844e+06 2.251779e+06 2.251836e+06 2.171398e+06 2.171398e+06
mean 1.501788e+06 1.498183e+05 2.013854e+03 -9.213684e+01 3.815967e+01
std 8.666892e+05 5.808395e+06 1.680675e+00 1.963492e+01 4.671896e+00
min 1.000000e+00 0.000000e+00 2.011000e+03 -1.578583e+02 1.343719e+01
25% 7.512768e+05 5.439200e+04 2.012000e+03 -1.119261e+02 3.416536e+01
50% 1.502439e+06 6.502100e+04 2.014000e+03 -8.623793e+01 3.910312e+01
75% 2.252488e+06 8.143200e+04 2.015000e+03 -7.551381e+01 4.088399e+01
max 3.002457e+06 6.997607e+09 2.016000e+03 1.457298e+02 6.483778e+01

This dataset has 11 columns, from which 1 is a target variable, which in this case is a case_status. So, this data has 1 target variable and 10 independent or explanatory variables. You should definitely check the classes of the target variable. You can use the .unique() method on the case_status feature in df to check the unique classes in the target variable.

This is a classification problem. That means you need to predict the class of the case status.


Feature engineering and data preprocessing

Note that the target variable contains 6 different classes:

  1. Certified
  2. Certified Withdrawn
  3. Rejected
  4. Invalidatd
  5. Pending Quality and compliance review
  6. Denied

Now, depending on the business problem you should decide whether to go for multi class classification or binary class classification. In case of binary class classification, you would classify either Certified or Denied. So the first thing that you should do is to convert remaining classes into either denied or certified. Over here, Rejected and Invalidated are both cases of a denied visa, so you should convert those case status to Denied. Pending Quality and compliance is most likely to be rejected in case of US visa, so it should be changed to Denied as well. Certified withdrawn is a case of Certifed class because the visa is Certifed but the user decided to withdraw the file. In the section below, classes have been converted to either Certified or Denied.

The next class is Withdrawn, and as it is hard to predict the Withdrawn case in this dataset, you can simply remove the withdrawn class. That means all the features representing the withdrawn class will be removed from the dataset. Another reason to remove the Withdrawn class is that its percentage in the entire dataset is less that 1%, which means that the model would not probably not have accurately classified the Withdrawn class.

This is an imbalanced dataset and there are multiple classes of target variables, and the main objective is to find the likelihood of being eligible for an h1b visa, so you should include the cases for only certified and denied. So, finally the target variable will have only 2 classes i.e. Certified and Denied.

import warnings

Checking the percentage of Certified and Denied classes in the Dataset.

##Drop rows with withdrawn
df = df.drop(df[df.CASE_STATUS == 'WITHDRAWN'].index)

## Storing non null in df w.r.t. case status
df = df[df['CASE_STATUS'].notnull()]
CERTIFIED    2114025
DENIED         70606
Name: CASE_STATUS, dtype: int64

The decline class is only 3.2% of the total dataset, that means you now have approx 96.8% Certified cases in your dataset. This shows that the datset is highly imbalanced. One of the problems with imbalanced datasets is that the model is more biased towards the class with higher occurence; in this case the model would be biased towards Certified. Although there are differerent techniques to solve this imbalanced classification problem, you won't be needing those techniques in this tutorial.


Treating missing and NA values

This dataset is not clean and it contains many missing values. This is the most important step as you must treat the missing values. The most simple way is to remove them, but you should not remove those values as you would lose the information. Let's look step by step how the missing values have been treated:

##check count of NAN
count_nan = len(df) - df.count()
CASE_ID                   0
CASE_STATUS               0
EMPLOYER_NAME            11
SOC_NAME              12725
JOB_TITLE                 6
YEAR                      0
WORKSITE                  0
lon                   77164
lat                   77164
dtype: int64

If you would look at the number of missing values for EMPLOYER_NAME, you could fill those 11 missing values with the mode, which is the most occuring value. This has been done below

## Filling na in employer name with mode
df['EMPLOYER_NAME'] = df['EMPLOYER_NAME'].fillna(df['EMPLOYER_NAME'].mode()[0])

The assert Statement is used below. When it encounters an assert statement, Python evaluates the accompanying expression, which is hopefully true. If the expression is false, Python raises an AssertionError exception. If the assertion fails, Python uses ArgumentExpression as the argument for the AssertionError.

assert pd.notnull(df['EMPLOYER_NAME']).all().all()

The next feature is prevailing_wage. You will find that most of the petitions have a wage in the range between 40k and 80k US Dollars. There are certain petitions with wages of more than $500k, and some with 0 dollars. Since there are very few such cases, they should be treated as outliers by capping them at the 2nd and 98th percentile

##to check the percentile in wages


After capping the wages with 98th and 2nd percentile, the mean and median are very similar. The median as above is 65K and mean is 68K. Finally, you can replace the NA value with the mean. You could replace them with the median as well because both values are almost similar.

## replacing min and max with 2 and 98 percentile
df.loc[df.PREVAILING_WAGE < 34029, 'PREVAILING_WAGE']= 34029
df.loc[df['PREVAILING_WAGE'] > 138703, 'PREVAILING_WAGE']= 138703
df.PREVAILING_WAGE.fillna(df.PREVAILING_WAGE.mean(), inplace = True)

For the the JOB_TITLE, FULL_TIME_POSITION and SOC_NAME columns, you could fill the missing values with the mode(most occuring value). No need to apply imputation techniques over here as the percentage of missing values is very low. This has been done below.

## Filling na in JOB_TITLE and FULL_TIME_POSITION with mode
df['JOB_TITLE'] = df['JOB_TITLE'].fillna(df['JOB_TITLE'].mode()[0])
df['SOC_NAME'] = df['SOC_NAME'].fillna(df['SOC_NAME'].mode()[0])

The next feature is FULL_TIME_POSITION: Y indicates the petitioner has a full time role and N indicates a part time role. Around 85% of the jobs applied for are full time jobs. Just 2 different ways to calculate the same thing.

foo1 = df['FULL_TIME_POSITION']=='Y'
foo2 = df['CASE_STATUS']=='CERIFIED'

fooy = df.FULL_TIME_POSITION[df['FULL_TIME_POSITION']=='Y'].count()
foox = df.CASE_STATUS[df['CASE_STATUS']=='CERIFIED'].count()

Dropping lat and lon Columns

While making the model you can drop the lat and lon columns as of now, as this will be covered by the worksite column afterwards. Use the drop method on the df DataFrame with the column name. Note that you need to define the axis as well. Axis 0 stands for rows and Axis 1 stands for columns.

# Dropping lat and lon columns
df = df.drop('lat', axis = 1)
df = df.drop('lon', axis = 1)

Feature Creation

Now, it is not possbile to have each and every column while making a model but at the same time these columns possess some information that you should extract.

EMPLOYER_NAME contains the names of the employers and there are lot of unique employers. It is the company which submits the application for its employee. You cannot use EMPLOYER_NAME directly in the model because it has got many unique string values or categories; more than 500 employers. These employers act as factors or levels. It is not advisable to use this many factors in a single column.

The top 5 companies submitting the application for their employees are Infosys, TCS, Wipro, Deloitte and IBM. However, if any University is submitting an application then it is more likely to be accepted.

So, the question is, how can you extract some information from this feature?

Well, you can probably create a new feature called NEW_EMPLOYER: If the employer name contains the string 'University' (for instance if a US university is filing a visa petition, then it has more chances of approval for the employee).

So, if the EMPLOYER_NAME contains 'University', then NEW_EMPLOYER contains the university value.

It is pretty easy to create a new empty columns, as below

df['NEW_EMPLOYER'] = np.nan
(2184631, 10)

Note that, there is a catch while checking University string in EMPLOYER_NAME.

What if the the University string is in upper case in some of the rows and in lower case in some other rows? If you are mapping lowercase university it would then miss the uppercase UNIVERSITY and vice-versa. So in order to correctly map and check the University string, you should first convert all the strings into the same case; either lowercase or uppercase.

In the code below, all the values of EMPLOYER_NAME are converted into a lower case using the function str.lower()

Now, all the values in EMPLOYER_NAME are in the same case, and it will be very easy to find the rows containing University as an Employer name.

All the strings in EMPLOYER_NAME containing the keyword university will have 'university' as value in the NEW_EMPLOYER column. All the remaining empty rows will be filled with 'non university'.


df['EMPLOYER_NAME'] = df['EMPLOYER_NAME'].str.lower()
df.NEW_EMPLOYER[df['EMPLOYER_NAME'].str.contains('university')] = 'university'
df['NEW_EMPLOYER']= df.NEW_EMPLOYER.replace(np.nan, 'non university', regex=True)

Now, the next variable is SOC_NAME, it consists of an occupation name. There are lot of values associated with SOC_NAME, so you might want to create a new feature that will contain the important occupation of the applicant, mapping it with the SOC_NAME value. You can create a new variable called OCCUPATION. For example computer, programmer and software are all computer ocupations. This will cover the top 80% of the occupations, and minor and remaining occupations will be categorized as others.

# Creating occupation and mapping the values

df['OCCUPATION'] = np.nan
df['SOC_NAME'] = df['SOC_NAME'].str.lower()
df.OCCUPATION[df['SOC_NAME'].str.contains('computer','programmer')] = 'computer occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('software','web developer')] = 'computer occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('database')] = 'computer occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('math','statistic')] = 'Mathematical Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('predictive model','stats')] = 'Mathematical Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('teacher','linguist')] = 'Education Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('professor','Teach')] = 'Education Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('school principal')] = 'Education Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('medical','doctor')] = 'Medical Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('physician','dentist')] = 'Medical Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('Health','Physical Therapists')] = 'Medical Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('surgeon','nurse')] = 'Medical Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('psychiatr')] = 'Medical Occupations'
df.OCCUPATION[df['SOC_NAME'].str.contains('chemist','physicist')] = 'Advance Sciences'
df.OCCUPATION[df['SOC_NAME'].str.contains('biology','scientist')] = 'Advance Sciences'
df.OCCUPATION[df['SOC_NAME'].str.contains('biologi','clinical research')] = 'Advance Sciences'
df.OCCUPATION[df['SOC_NAME'].str.contains('public relation','manage')] = 'Management Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('management','operation')] = 'Management Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('chief','plan')] = 'Management Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('executive')] = 'Management Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('advertis','marketing')] = 'Marketing Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('promotion','market research')] = 'Marketing Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('business','business analyst')] = 'Business Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('business systems analyst')] = 'Business Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('accountant','finance')] = 'Financial Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('financial')] = 'Financial Occupation'
df.OCCUPATION[df['SOC_NAME'].str.contains('engineer','architect')] = 'Architecture & Engineering'
df.OCCUPATION[df['SOC_NAME'].str.contains('surveyor','carto')] = 'Architecture & Engineering'
df.OCCUPATION[df['SOC_NAME'].str.contains('technician','drafter')] = 'Architecture & Engineering'
df.OCCUPATION[df['SOC_NAME'].str.contains('information security','information tech')] = 'Architecture & Engineering'
df['OCCUPATION']= df.OCCUPATION.replace(np.nan, 'Others', regex=True)

Since visa applications majorly depend on State location, you should split the state information from the WORKSITE variable.

## Splitting city and state and capturing state in another variable
df['state'] = df.WORKSITE.str.split('\s+').str[-1]

California has the highest number of petitions cementing its place as the base of choice for IT ops, followed by Texas, New York and New Jersey.

   CASE_ID CASE_STATUS                                      EMPLOYER_NAME  \
0        1   CERTIFIED                             university of michigan   
1        2   CERTIFIED                             goodman networks, inc.   
2        3   CERTIFIED                          ports america group, inc.   
3        4   CERTIFIED  gates corporation, a wholly-owned subsidiary o...   
4        6   CERTIFIED                            burger king corporation   

                        SOC_NAME  \
0  biochemists and biophysicists   
1               chief executives   
2               chief executives   
3               chief executives   
4               chief executives   

                                           JOB_TITLE FULL_TIME_POSITION  \
0                       POSTDOCTORAL RESEARCH FELLOW                  N   
1                            CHIEF OPERATING OFFICER                  Y   
2                              CHIEF PROCESS OFFICER                  Y   
3                        REGIONAL PRESIDEN, AMERICAS                  Y   

0          36067.0  2016.0      ANN ARBOR, MICHIGAN      university   
1         138703.0  2016.0             PLANO, TEXAS  non university   
2         138703.0  2016.0  JERSEY CITY, NEW JERSEY  non university   
3         138703.0  2016.0         DENVER, COLORADO  non university   
4         138703.0  2016.0           MIAMI, FLORIDA  non university   

              OCCUPATION     state  
0       Advance Sciences  MICHIGAN  
1  Management Occupation     TEXAS  
2  Management Occupation    JERSEY  
3  Management Occupation  COLORADO  
4  Management Occupation   FLORIDA  

Now, in order to calculate the probabilities, you need to convert the target classes to binary, i.e. 0 and 1. You can use CERTIFIED and DENIED, but you won't be able to calculate probabilites and the AUC score in this case. So you have to map it to 0 and 1.

from sklearn import preprocessing
class_mapping = {'CERTIFIED':0, 'DENIED':1}
df["CASE_STATUS"] = df["CASE_STATUS"].map(class_mapping)
   CASE_ID  CASE_STATUS                                      EMPLOYER_NAME  \
0        1            0                             university of michigan   
1        2            0                             goodman networks, inc.   
2        3            0                          ports america group, inc.   
3        4            0  gates corporation, a wholly-owned subsidiary o...   
4        6            0                            burger king corporation   

                        SOC_NAME  \
0  biochemists and biophysicists   
1               chief executives   
2               chief executives   
3               chief executives   
4               chief executives   

                                           JOB_TITLE FULL_TIME_POSITION  \
0                       POSTDOCTORAL RESEARCH FELLOW                  N   
1                            CHIEF OPERATING OFFICER                  Y   
2                              CHIEF PROCESS OFFICER                  Y   
3                        REGIONAL PRESIDEN, AMERICAS                  Y   

0          36067.0  2016.0      ANN ARBOR, MICHIGAN      university   
1         138703.0  2016.0             PLANO, TEXAS  non university   
2         138703.0  2016.0  JERSEY CITY, NEW JERSEY  non university   
3         138703.0  2016.0         DENVER, COLORADO  non university   
4         138703.0  2016.0           MIAMI, FLORIDA  non university   

              OCCUPATION     state  
0       Advance Sciences  MICHIGAN  
1  Management Occupation     TEXAS  
2  Management Occupation    JERSEY  
3  Management Occupation  COLORADO  
4  Management Occupation   FLORIDA  

For JOB_TITLE, you could do a numerical transformatio.n

There are 238176 unique Job titles in the dataset, as shown below.

test1 = pd.Series(df['JOB_TITLE'].ravel()).unique()
0                            POSTDOCTORAL RESEARCH FELLOW
1                                 CHIEF OPERATING OFFICER
2                                   CHIEF PROCESS OFFICER
3                             REGIONAL PRESIDEN, AMERICAS
5                                               PRESIDENT
7                                       TREASURER AND COO
8                                CHIEF COMMERCIAL OFFICER
9                                            BOARD MEMBER
10                                                    CEO
11                            PRESIDENT, NORTHEAST REGION
12                          CHIEF OPERATING OFFICER (COO)
13                            GENERAL MANAGER, OPERATIONS
14                                CHIEF EXECUTIVE OFFICER
15                                 CHIEF BUSINESS OFFICER
16                                     EXECUTIVE DIRECTOR
18                                       HEAD OF US SALES
19                           VICE PRESIDENT OF OPERATIONS
21                                       ACCOUNT DIRECTOR
22                                     TECHNICAL DIRECTOR
24                                CHIEF FINANCIAL OFFICER
25                                  CHIEF MEDICAL OFFICER
27                              VICE PRESIDENT OF QUALITY
...                                                   ...
238146                         LOGISTIC AND SALES MANAGER
238149                                   BIZTEK DEVELOPER
238150                     SALES ANALYST - RETAIL CHANNEL
238151                         ASSISTANT WOMEN'S DESIGNER
238152                        ASSAY DEVELOPMENT ASSOCIATE
238153                             ADMINISTRATIVE SUPPORT
238154                   PROGRAM MANAGEMENT ANALYST - SAP
238158                                      INDUSTRY LEAD
238160                           ADMINISTRATIVE ANALYST 1
238163                        CONSULTANT, SYSTEMS ANALYST
238165                  STAFF SYSTEMS ENGINEER (SOFTWARE)
238166                             SR. SOFTWARE  ENGINEER
238169                     DIRECTOR  EDUCATIONAL PROGRAMS
238170                              REGISTERED NURSE (ER)
238171                             FOOT AND ANKLE SURGEON
238172                              DFT MODELING ENGINEER
238174                         SALES AND LOGISTIC MANAGER
238175                            APPRAISERS, REAL ESTATE

[238176 rows x 1 columns]

Since, you have now generated new features from the variables below, you can drop them, running horizontally across columns, as axis = 1.

# dropping these columns
df = df.drop('EMPLOYER_NAME', axis = 1)
df = df.drop('SOC_NAME', axis = 1)
df = df.drop('JOB_TITLE', axis = 1)
df = df.drop('WORKSITE', axis = 1)
df = df.drop('CASE_ID', axis = 1)
df1 = df.copy()

Changing dtype to category: Now before moving to the modeling part, you should definitely check the data types of the variables. For instance, over here, a few variables should have been used as categories or factors, but instead they are in object string fromat.

So, you have to change the data type of these variables from object to category, as these are categorical features.

df1[['CASE_STATUS', 'FULL_TIME_POSITION', 'YEAR','NEW_EMPLOYER','OCCUPATION','state']] = df1[['CASE_STATUS', 'FULL_TIME_POSITION', 'YEAR','NEW_EMPLOYER','OCCUPATION','state']].apply(lambda x: x.astype('category'))
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2184631 entries, 0 to 2251800
Data columns (total 7 columns):
CASE_STATUS           category
PREVAILING_WAGE       float64
YEAR                  category
NEW_EMPLOYER          category
OCCUPATION            category
state                 category
dtypes: category(6), float64(1)
memory usage: 45.8 MB

Splitting Data in Training and Test Sets

It's a standard practice to split the dataset into a training and testing set. The reason behind this is that you should fit and train your model using the training set, and then finally predict and check your accuracy on the test set.

X = df.drop('CASE_STATUS', axis=1)

seed = 7
test_size = 0.40
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=seed)
       'OCCUPATION', 'state'],

Check if there are any null values in the training set. There should not be any.

YEAR                  0
NEW_EMPLOYER          0
OCCUPATION            0
state                 0
dtype: int64

Encode X_train and X_test to get them ready for Xgboost, as it only works on numeric data. The function pd.get_dummies() is used to encode the categorical values to integers. It will create a transpose of all the categorical values and then map 1 wherever the value is present or 0 if it's not present. You should definitely try at your end to to print the X_train_encode below to check the transpose.

X_train_encode = pd.get_dummies(X_train)
X_test_encode = pd.get_dummies(X_test)
1688702    0
1808423    0
1550318    0
1495419    0
294782     0
Name: CASE_STATUS, dtype: int64


XGBoost is short term for “Extreme Gradient Boosting”, which is a supervised learning problem. Here you use the training data (with multiple features) x(i) to predict a target variable y(i).

It is an implementation of gradient boosted decision trees designed for speed and performance.

Boosting is an ensemble method that seeks to create a strong classifier (model) based on “weak” classifiers. In this context, weak and strong refer to a measure of how correlated the learners are to the actual target variable. By adding models on top of each other iteratively, the errors of the previous model are corrected by the next predictor, until the training data is accurately predicted or reproduced by the model.

Now, gradient boosting also comprises an ensemble method that sequentially adds predictors and corrects previous models. However, instead of assigning different weights to the classifiers after every iteration, this method fits the new model to new residuals of the previous prediction and then minimizes the loss when adding the latest prediction.

So, in the end, you are updating your model using gradient descent and hence the name, gradient boosting.

To learn more about XGBoost, take a look at our Extreme Gradient Boosting course.

XGBoost can be installed easily using pip

  • pip install xgboost

To use the xgboost package, keep these things in mind:

  • Convert categorical variables into numeric ones using one hot encoding
  • For classification, if the dependent variable belongs to the class factor, convert it to numeric

as_matrix is quick enough to implement one hot encoding. You can convert the dataset to a matrix format, as shown below.

train_X = X_train_encode.as_matrix()
train_y = y_train.as_matrix()

The XGBoost model for classification is called XGBClassifier(). You can create it and fit it to your training dataset. You can create a XGBClassifier() with max_features as sqrt. max_features is the number of features to consider when looking for the best split. So if n_features is 100, then max_features would be 10.

import xgboost
gbm=xgboost.XGBClassifier(max_features='sqrt', subsample=0.8, random_state=10)

Using GridSearchCV() to tune hyperparameters:

  1. GridSearchCV() implements a fit and a score method. It also implements predict, predict_proba, decision_function, transform and inverse_transform if they are implemented in the estimator used.
  2. The parameters of the estimator used to apply these methods are optimized by cross-validated grid-search over a parameter grid.
  3. You can use n_estimators as 1, 10 and 100, and learning_rate as 0.1, 0.01 and 0.5. The same is used below.
  4. n_estimators : The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting, so a large number usually results in better performance.
  5. learning_rate : A problem with gradient boosted decision trees is that they are quick to learn and overfit training data. One effective way to slow down learning in the gradient boosting model is to use a learning rate, also called shrinkage. Gradient boosting involves creating and adding trees to the model sequentially. New trees are created to correct the residual errors in the predictions from the existing sequence of trees. The effect is that the model can quickly fit, then overfit the training dataset. A technique to slow down the learning is to apply a weighting factor for the corrections by new trees when added to the model. This weighting is called the shrinkage factor or the learning rate, depending on the literature or the tool. It is common to have small values in the range of 0.1 to 0.3, as well as values less than 1.
  6. Based on the gbm classifier's accuracy, it will create 3 different folds and choose the best learning_rate and n_estimators
from sklearn.model_selection import GridSearchCV
parameters = [{'n_estimators': [10, 100]},
              {'learning_rate': [0.1, 0.01, 0.5]}]
grid_search = GridSearchCV(estimator = gbm, param_grid = parameters, scoring='accuracy', cv = 3, n_jobs=-1)
grid_search = grid_search.fit(train_X, train_y)

After fitting GridSearchCV() on the training set, it got a score of 97% accuracy with a learning rate of 0.5:


grid_search.grid_scores_, grid_search.best_params_, grid_search.best_score_
([mean: 0.96764, std: 0.00000, params: {'n_estimators': 10},
  mean: 0.96764, std: 0.00000, params: {'n_estimators': 100},
  mean: 0.96764, std: 0.00000, params: {'learning_rate': 0.1},
  mean: 0.96764, std: 0.00000, params: {'learning_rate': 0.01},
  mean: 0.96765, std: 0.00006, params: {'learning_rate': 0.5}],
 {'learning_rate': 0.5},

The code below will give you the best estimator for your model. You should try the GridSearchCV() several times with different values of learning_rate and n_estimators. You will get an output of accuracy scores using different combinations of hyperparameters. You can then choose the best combination.

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.5, max_delta_step=0,
       max_depth=3, max_features='sqrt', min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=10, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,

Now you should use the same estimator to finally fit the model as this is the best estimator for the model.

gbm=xgboost.XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.5, max_delta_step=0,
       max_depth=3, max_features='sqrt', min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='binary:logistic', random_state=10, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=True,
       subsample=0.8).fit(train_X, train_y)

Perform predictions on the test set with gbm.predict.

y_pred = gbm.predict(X_test_encode.as_matrix())

You can now print the confusion matrix and classification report.

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
[[845548    118]
 [ 28057    130]]
             precision    recall  f1-score   support

          0       0.97      1.00      0.98    845666
          1       0.52      0.00      0.01     28187

avg / total       0.95      0.97      0.95    873853

The Overall accuracy of the model is 96.56% with Certified having a precision of 97% and Denied with a precision of 56%. Kindly note, this is a highly imbalanced data set with the minority class just 3% of the total class so the result is more biased towards the majority class CERTIFIED. So to deal with this there are many techniques available such as undersampling or oversampling. One of the technique you can try is SMOTE – SYNTHETIC MINORITY OVER SAMPLING.

You can now look at the AUC score and ROC curve for your predictions. The dataset was highly imbalanced, only 3% were Denied out of all the visa applications, so this model is highly biased towards the majority class, which is Certified. The ROC graph shows that 50% of the area is under the curve.

from sklearn.metrics import roc_auc_score
roc_auc_score(y_test, y_pred)
from sklearn import metrics
import matplotlib.pyplot as plt

fpr_xg, tpr_xg, thresholds = metrics.roc_curve(y_test, y_pred)
print(metrics.auc(fpr_xg, tpr_xg))
auc_xgb = np.trapz(tpr_xg,fpr_xg)
plt.plot(fpr_xg,tpr_xg,label=" auc="+str(auc_xgb))

Finally, you can save the model on your disk so you don't have to run it again next time. This is done using the Pickle module in Python.

import pickle

"""Saving the Model"""
XGB_Model_h1b = 'XGB_Model_h1b.sav'
pickle.dump(gbm, open(XGB_Model_h1b, 'wb'))


The most important thing while making a model is the feature engineering and selection process. You should be able to extract maximum information from the features to make your model more robust and accurate. Feature selection and extraction comes with respect to time and experience. There could be several ways to deal with the information available in the dataset.

There are lot of different ways to make your model learn. The learning algorithm should give you the best results. You can probably use different learning algorithms and then ensemble them to make your model more robust. A/B testing could also be done while in producion so that you would know which model is performing much better. Go ahead, hit the code and try different methods. Happy coding.