Tutorials
data analysis
+3

Exploring H-1B Data with R: Part 3

Learn how to geocode locations and map them with R

DataCamp's blog "Can Data Help Your H-1B Visa Application" presented you with some of the results of an analysis of H-1B data over the years. Now, it's time to get started yourself! Ted Kwartler will guide you through this with a series of R tutorials. 

We’re back with our third post investigating H-1B data. If you want to learn how to programmatically collect and perform exploratory data analysis (EDA), check out the first post here.

In this final installment, you will learn how to geocode locations for latitude and longitude coordinates from an API. Then you create a map of the data. Next, you learn the top H-1B software developer employers and compare their salaries by year.

I hope these views help you see where you should apply, how much to negotiate for and regions of the country with high volume and good salaries for software developers.

I chose software developers because these are among the most widespread job titles in our data set. Plus, I have a few friends in need of sponsorship that are technically minded. Hopefully this code and the visuals can help them hone their employment efforts!

Get the Data

You could review the original H-1B posts here to web scrape the data like I did. If you do not have the patience, I suggest heading to my github repo to get the CSV file.

The first thing to do is set up your R session with the correct packages. The bit64 packages lets data.table work efficiently on vectors of 64bit integers.

The data.table package is once again used since it is so efficient. You next use lubridate in case you want to perform data related functions. After that, pbapply gives your R session progress bars when using apply functions. The jsonlite package is used to parse the geocoded API response from Bing Maps. Lastly, plyr is a great package for manipulating data.

The two lines set global options to first get rid of scientific notation and also to increase the decimal places printed to 15. Often you need many decimal points when working with latitude and longitude. Using set.seed ensures your visuals will look similar to this post by selecting a specific seed instead of random.

library(bit64)
library(data.table)
library(lubridate)
library(pbapply)
library(jsonlite)
library(plyr)
options(scipen=999, digits=15)
set.seed(1234)

The next step is to read in the CSV data from the repository. The fread() function is “fast and friendly file finagler” that quickly reads and provides a progress update. In my experience, it is the best way to import large data frames.

The next lines of code apply ymd to the data related vectors.

If you followed along from the previous posts, you wouldn’t need to do this but using fread() means you will need to correct the vector types.

h1b.data<-fread('h1b_data_clean.csv')
h1b.data$start_date<-ymd(h1b.data$start_date)
h1b.data$submit_date<-ymd(h1b.data$submit_date)

Now that your data is in, let’s identify the most frequent job titles. To create h1b.jobs, you should change the $job_title vector to categorical. Next, use table() to tally the unique job titles and nest the result in as.data.frame() so it can be easily manipulated. The third line orders the new data frame by the new $Freq vector. Since the default behavior of order is to arrange values ascending, you can review the 6 most frequent job titles with tail().

h1b.jobs<-as.factor(h1b.data$job_title)
h1b.jobs<-as.data.frame(table(h1b.jobs))
h1b.jobs<-h1b.jobs[order(h1b.jobs$Freq),]
tail(h1b.jobs)

Reviewing h1b.jobs, you can pick out the jobs you are interested in.

Here, you can create jobs by concatenating “SOFTWARE DEVELOPER,” “COMPUTER PROGRAMMER” and “SOFTWARE ENGINEER.”

Do you notice how each is separated without a space by the pipe operator? The “|” is above your enter key and represents “OR”. So you are going to subset your h1b.data by “A OR B OR C” e.g. “A|B|C”.

jobs<-c('SOFTWARE DEVELOPER|COMPUTER PROGRAMMER|SOFTWARE ENGINEER')

The subset() function will keep a row of data whenever a logical statement is TRUE. Using grepl() to get a logical outcome pass in jobs and search within $job_title. Now h1b.devs has ~262K rows instead of h1b.data’s ~1.8M.

h1b.devs<-subset(h1b.data,grepl(jobs,h1b.data$job_title)==T)

Geocoding with Bing Maps

To create a map, you will need lat/lon coordinates. There are multiple services to look up locations like google Maps.

However, I like to use Bing Maps since the limit is 25,000 lookups per day. To use this code sign up here for a Bing Maps API Key. You could look up all locations row by row. However, you would reach your Bing limit quickly and have to wait to finish.

Instead, I suggest looking up each unique location and then joining the information later to save time.

To find unique values in a character vector change the $location with as.factor() and then use levels to extract unique levels.

In this data set, the Software Developer H-1B Applications with 262,659 rows come from only 1,763 unique locations.

all.locs<-levels(as.factor(h1b.devs$location))

Some towns like New York have a space that need to be encoded to work properly. The URLencode() function will fill in spaces to make “new%20york.” In this code, URLencode() is applied to the unique location vector, all.locs, and the result is unlisted back to a character vector.

all.locs<-unlist(pblapply(all.locs,URLencode))

After signing up, paste your API key in between the quotes to make bing.key. Create the location URLs using paste0(). Pass in the base section of the URL, then the all.locs vector, some more URL information and lastly your API key string. R will recycle all the individual address parts as paste0() works through all.locs.

bing.key<-'BING_MAPS_API_KEY'
bing.urls<-paste0('http://dev.virtualearth.net/REST/v1/Locations?q=',all.locs,'&maxResults=1&key=', bing.key)

Next you need a function so you do not have to write code for each URL with location information.

The bing.get() function accepts an individual URL. Using fromJSON the response is collected.

Then, two ifelse statements identify the individual lat/lon values. In each statement if the API provided a response the lat and lon are selected from within the response. If the API couldn’t find the location NA is returned. This helps to overcome API responses without lat/lon values so your code doesn’t break.

Lastly, the lat/lon values are organized into a small data frame.

bing.get<-function(bing.url){
  bing.resp<-fromJSON(bing.url, simplifyDataFrame = T)
  ifelse(bing.resp$resourceSets$estimatedTotal > 0,
         lat<-bing.resp$resourceSets$resources[[1]]$point$coordinates[[1]][1], lat<-'NA')
  ifelse(bing.resp$resourceSets$estimatedTotal > 0,
         lon<-bing.resp$resourceSets$resources[[1]]$point$coordinates[[1]][2], lon<-'NA')
  lat.lon<-data.frame(lat=lat,lon=lon)
  return(lat.lon)
}

To test the function, apply it to the first bing.urls. If the code works, your R console will print the lat/lon for the first location.

bing.get(bing.urls[1])

Now that you verified that the function works and the URLs have been created correctly, you can apply them to the entire bing.urls vector. Using progress bar apply for vectors, pblapply(), pass in the Bing URLs and the bing.get function. This will return a list of 1763 data frames with a $lat and $lon columns.

A fast way to combine the data frames is with rbindlist(), which will bind the list elements row-wise to create a single data frame of lat/lon values.

all.coords<-pblapply(bing.urls,bing.get)
coords.df<-rbindlist(all.coords)

In order to join the original Software Developer H-1B visa application data to the location data you need a common value. This code adds $location using the same code to extract unique locations as before. Then $lat and $lon are changed from character to numeric values. The final data frame can be examined with head().

coords.df$location<-levels(as.factor(h1b.devs$location))
coords.df$lat<-as.numeric(as.character(coords.df$lat))
coords.df$lon<-as.numeric(as.character(coords.df$lon))
head(coords.df)

Next you need to add the lat/lon values to the h1b.data. Using join will treat the coords.df as a lookup table. Every time one of the locations in h1b.devs is identified in the cords.df data frame the additional $lat and $lon values will be added. To check the number of NA returned from the API use is.na() along with sum().

h1b.devs<- join(h1b.devs,coords.df,by='location')
sum(is.na(h1b.devs$lat))

The h1b.devs object contains all H-1B Visa applications for Software Developers along with lat/lon for the job location.

Mapping H-1B Software Developer Information

You can construct a map now that the Software Developer H-1B Data has the appropriate lat and lon. To start, let’s create a static map with unique locations only.

The code below references the 1763 unique lat/lon pairs to demonstrate software developer “hotspots”. For example, San Francisco and San José are distinct locations but using lat/lon you can see how the locations cluster on the map.

For this code add ggplot2 and ggthemes. These are popular grammar of graphics plotting libraries:

library(ggplot2)
library(ggthemes)

Start by loading the state map information. Using map_data() with “state” will create a data frame with coordinates and plotting information. This is used as the base layer of the US map.

state.map <- map_data("state")

Next create a base ggplot. The polygon layer declares state.map data and X, Y as long/lat respectively. This part of the code is separated so that you can see how the layers are added and changed in the next part.

ggplot() + geom_polygon(data = state.map, aes(x=long,y=lat,group=group), 
               colour = "grey", fill = NA)

This is the base USA Map layer.

On top of this layer you will add points with geom_point. In this layer you declare the coords.df object with lon and lat. You change the dots to be semi-transparent with alpha and specified color “red”. Next, you limit the plot to center it on the US, add a title, theme and finally remove a lot of elements.

  ggplot() + 
  geom_polygon(data = state.map, aes(x=long,y=lat,group=group), 
               colour = "grey", fill = NA) + 
  geom_point(data=coords.df, aes(x=lon, y=lat), alpha=0.05, col="red")  +
  xlim(-125,-65) + ylim(25,50) + 
  ggtitle("Certified H1B Visas") +theme_gdocs() +
  theme(panel.border = element_blank(),
        axis.text = element_blank(),
        line = element_blank(),
        axis.title = element_blank())

The plot is a static image showing clusters of H-1B locations:

Software Developer H-1B Application Locations

It’s better to have a dynamic map so you can interact with the data more freely. The JavaScript leaflet package has an R package to create HTML-based interactive maps. Load the leaflet library along with htmltools to get started.

library(leaflet)
library(htmltools)

Since the plotting of the points is done client side with javascript, it can be time-consuming. As a result, you can sample the data. Sampling a data table can be done while indexing by using sample with .N for number of rows and the sample size, 1000. Next, use jitter() to slightly adjust each lat/lon value in case there are duplicates.

leaf.sample<-h1b.devs[sample(.N,1000),]
leaf.sample$lat<-jitter(leaf.sample$lat)
leaf.sample$lon<-jitter(leaf.sample$lon) 

An easy way to use leaflet is with the %>% operator.

Confusingly, this is also called the pipe!

This time the pipe operator, %>%, forces an object forward into the next function. This makes code concise and easy to follow.

Here, the leaf.sample data frame is passed into the leaflet() function. The object is forwarded into addTiles() to get map graphics and that again is forwarded into addMarkers(). This could be done with 3 lines of code but only takes one with %>%.

Within the addMarkers() function a popup is added so that you can click on a marker to see the employer.

leaflet(leaf.sample)%>% addTiles() %>% addMarkers(popup = ~htmlEscape(employer))

A screen shot of the leaflet map with a marker popup:

To create a more interesting map, you need to add more information. Color code the markers by H-1B application $start_yr. Calling palette() will print the 8 different colors that are named in R. You can use this vector to color code your makers. Create pal using colorNumeric() applied to the palette information and the $start_yr vector.

palette()
pal<-colorNumeric(palette(),domain=h1b.data$start_yr)

Like before, pass in leaf.sample and start piping forward.

Instead of default markers, addCircleMarkers(). Within the function the radius is determined by salary divided by $10K.

Another way to add popup information is with paste(). Separating elements with an HTML line break <br> ensures that the popup is legible. Lastly, the color is selected by referencing the pal object.

leaflet(leaf.sample) %>% addTiles() %>%
  addCircleMarkers(
    radius = ~base_salary/10000, 
    popup = paste('Employer:', leaf.sample$employer, '<br>',
            'Job Title:', leaf.sample$job_title, '<br>',
            'Salary:', leaf.sample$base_salary, '<br>',
            'Year:', leaf.sample$start_yr,'<br>',
            'Case Status',leaf.sample$case_status),
    color = ~pal(start_yr), stroke = F, fillOpacity = 0.5)

This time, the map is more interesting as you navigate around. Colors are related to years, size of the marker is related to salary and the popups contain more information.

This is a screenshot of the resulting plot:

 

Make sure to also check out the leaflet map that visualizes the certified visas for the top five H-1B jobs over the years. You can find it here.

Top Companies Hiring H-1B Devs by Year

You may want to explore top software developer salaries by employer over time. Let’s focus on “CERTIFIED” applications using subset because these applications resulted in actual US jobs.

dev.certs<-subset(h1b.devs, h1b.devs$case_status=='CERTIFIED')

Use aggregate to get summary stats by multiple groupings. In this example, aggregate splits the $base_salary into groups by $submit_yr and $employer, then computes the mean, and returns a data frame.

avg.salary<-aggregate(dev.certs$base_salary, by=list(dev.certs$submit_yr,dev.certs$employer), FUN=mean) 

After a while, the summary statistics for the salary will be calculated for each employer for each year.

The next important question is to get the top H-1B Software Developer employers.

This code should look familiar by now since you used it to create h1b.jobs when working with $job_title. After calling head(), you should know the top 6 developer employers.

top.emp<-as.data.frame(table(dev.certs$employer))
top.emp<-top.emp[order(top.emp$Freq, decreasing=T),]
head(top.emp)

Once you know the top employers, you can use grep to return the row positions in avg.salary that match the employer name. This can be used while indexing avg.salary to leave only the employer in question.

goog<-avg.salary[grep('google', avg.salary$Group.2,ignore.case=T),]
msft<-avg.salary[grep('microsoft', avg.salary$Group.2,ignore.case=T),]
tata<-avg.salary[grep('TATA CONSULTANCY SERVICES LIMITED', avg.salary$Group.2,ignore.case=T),]
acc<-avg.salary[grep('accenture', avg.salary$Group.2,ignore.case=T),] 

Reviewing the data shows Google’s H-1B Software Developer salaries are stable. Microsoft’s has been increasing YoY. Both tech giants pay far better than other top H-1B software developer employers Tata and Accenture. Although Tata Consulting’s average developer salary jumped in 2016.

| Year | Google   | Microsoft | Tata Consulting | Accenture |
|------|----------|-----------|-----------------|-----------|
| 2012 | $129,462 |           | $64,707         | $65,800   |
| 2013 | $129,399 | $112,823  | $65,125         | $66,301   |
| 2014 | $127,520 | $124,508  | $66,417         | $70,997   |
| 2015 | $130,450 | $125,104  | $68,263         | $73,670   |
| 2016 | $133,450 | $129,304  | $89,438         | $72,680   |

Conclusion

I hope you enjoyed learning about H-1B Visa applications. Along the way, you learned some basic EDA functions and ways to explore data including time, location, frequency and summary statistics. H-1B data is very rich and there is probably a lot more that can be done to extract insights from the data set. You can always explore different professions, or drill down into specific employers or regions.

In the end, I hope this helped you learn more R and possibly have a successful H-1B Visa application!

Are you interested in doing more EDA with R? Consider taking our Exploratory Data Analysis and Exploratory Data Analysis in R: Case Study courses.