Skip to content

How can you determine which programming languages and technologies are most widely used? Which languages are gaining or losing popularity, helping you decide where to focus your efforts?

One excellent data source is Stack Overflow, a programming question-and-answer site with more than 16 million questions on programming topics. Each Stack Overflow question is tagged with a label identifying its topic or technology. By counting the number of questions related to each technology, you can estimate the popularity of different programming languages.

In this project, you will use data from the Stack Exchange Data Explorer to examine how the relative popularity of R, Python, Java, and JavaScript has changed over time.

You'll work with a dataset containing one observation per tag per year, including the number of questions for that tag and the total number of questions that year.

stack_overflow_data.csv

ColumnDescription
yearThe year the question was asked (2008-2020)
tagA word or phrase that describes the topic of the question, such as the programming language
num_questionsThe number of questions with a certain tag in that year
year_totalThe total number of questions asked in that year
# Load necessary packages
library(readr)
library(dplyr)
library(ggplot2)
# Load the dataset
data <- read_csv("stack_overflow_data.csv")
data
#Number of questions tagged with R each year compared to total no of qns across all tags per year.
r_over_time <- data %>%
				mutate(percentage = (num_questions / year_total) * 100) %>% 
				filter(tag == "r") 
r_over_time

#A line plot of percentage over time
ggplot(r_over_time, aes(x = year, y = percentage)) +
	geom_line()
#Percentage of the total number of questions asked in 2020that had the R tag
r_2020 <- r_over_time %>%
					filter(year == 2020)
r_2020_select <- r_2020 %>% select(percentage)

#Save as a numeric variable
r_percentage <- r_2020_select$percentage
r_percentage
#The five most asked-about tags between 2015-2020
sorted_tags <- data %>%
					filter(year >= 2015) %>%
					group_by(tag) %>%
					summarize(tag_total = sum(num_questions)) %>%
					arrange(desc(tag_total))

highest_tags <- head(sorted_tags$tag, n = 5)
highest_tags

#Visualize
data_subset <- data %>%
				mutate(percentage = (num_questions / year_total) * 100) %>% 	
				filter(tag %in% highest_tags, year >= 2015 )
data_subset

ggplot(data_subset, aes(x = year, y = percentage, color = tag)) +
geom_line()
#Which tag experienced the largest year-over-year increase in its percentage of questions?
#Percentage of questions for each tag per year
data_perc_year <- data %>% 
  group_by(year) %>% 
  mutate(year_total = sum(num_questions)) %>% 
  ungroup() %>% 
  mutate(percentage = (num_questions / year_total) * 100)
data_perc_year

#Ratio of the percentage of questions for each tag compared to the previous year
tag_ratios_filtered <- data_perc_year %>% 
  arrange(tag, year) %>% 
  group_by(tag) %>% 
  mutate(ratio = percentage / lag(percentage)) %>% 
  ungroup()
tag_ratios_filtered 

# Find the tag with the highest ratio increase
highest_ratios <- tag_ratios_filtered %>% 
  slice_max(ratio, n = 1)
highest_ratios

highest_ratio_tag <- highest_ratios$tag
highest_ratio_tag