Project: Analyze the Popularity of Programming Languages

How can you determine which programming languages and technologies are most widely used? Which languages are gaining or losing popularity, helping you decide where to focus your efforts?

One excellent data source is Stack Overflow, a programming question-and-answer site with more than 16 million questions on programming topics. Each Stack Overflow question is tagged with a label identifying its topic or technology. By counting the number of questions related to each technology, you can estimate the popularity of different programming languages.

In this project, you will use data from the Stack Exchange Data Explorer to examine how the relative popularity of R, Python, Java, and JavaScript has changed over time.

You'll work with a dataset containing one observation per tag per year, including the number of questions for that tag and the total number of questions that year.

stack_overflow_data.csv

Column	Description
`year`	The year the question was asked (2008-2020)
`tag`	A word or phrase that describes the topic of the question
`num_questions`	The number of questions with a certain tag in that year
`year_total`	The total number of questions asked in that year

# Load necessary packages
library(readr)
library(dplyr)
library(ggplot2)

Hidden output

# Load the dataset
data <- read_csv("stack_overflow_data.csv")

Hidden output

# Start coding here
# Use as many cells as you like!

# Question 1: Has R been growing or shrinking over time?

# Add fraction column
data_fraction <- data %>%
  mutate(fraction = num_questions / year_total * 100)

# Filter for R tags
r_over_time <- data_fraction %>%
  filter(tag == "r")

print(r_over_time)

# Bonus: create a line plot of fraction over time
# ggplot(r_over_time) +
#   geom_line(aes(x = year, y = fraction))

# Question 2: What fraction of the total number of questions asked in 2020 had the R tag?

# Filter for R tags in 2020
R_tag_2020 <- data_fraction %>% 
  filter(tag == "r", year == "2020")

# Select the fraction column
r_selected <- R_tag_2020 %>% select(fraction)

# Save as a numeric variable
r_percentage <- r_selected$fraction

# Question 3: What were the five most asked-about tags between 2015-2020?

# Find total number of questions for each tag in the period 2015-2020
sorted_tags <- data %>%
  filter(year >= 2015) %>% 
  group_by(tag) %>% 
  summarize(tag_total = sum(num_questions)) %>% 
  arrange(desc(tag_total))

# Get the five largest tags
highest_tags <- head(sorted_tags$tag, n = 5)

print(highest_tags)

# Filter for the five largest tags
data_subset <- data_fraction %>% 
filter(tag %in% highest_tags, year >= 2015)

# Plot tags over time on a line plot using color to represent tag
ggplot(data_subset, aes(x = year, y = fraction, color = tag)) + geom_line()

# Question 4: Which tag experienced the largest year-over-year increase in its fraction of questions?

# Calculate the fraction of questions for each tag per year
data_fraction_year <- data %>% 
  group_by(year) %>% 
  mutate(year_total = sum(num_questions)) %>% 
  ungroup() %>% 
  mutate(fraction = num_questions / year_total)

# Calculate the ratio of the fraction of questions for each tag compared to the previous year
tag_ratios_filtered <- data_fraction_year %>% 
  arrange(tag, year) %>% 
  group_by(tag) %>% 
  mutate(ratio = fraction / lag(fraction)) %>% 
  ungroup()

# Find the tag with the highest ratio increase
highest_ratios <- tag_ratios_filtered %>% 
  slice_max(ratio, n = 1)

highest_ratio_tag <- highest_ratios$tag

# Print the results
print(highest_ratio_tag)