How can you determine which programming languages and technologies are most widely used? Which languages are gaining or losing popularity, helping you decide where to focus your efforts?
One excellent data source is Stack Overflow, a programming question-and-answer site with more than 16 million questions on programming topics. Each Stack Overflow question is tagged with a label identifying its topic or technology. By counting the number of questions related to each technology, you can estimate the popularity of different programming languages.
In this project, you will use data from the Stack Exchange Data Explorer to examine how the relative popularity of R, Python, Java, and JavaScript has changed over time.
You'll work with a dataset containing one observation per tag per year, including the number of questions for that tag and the total number of questions that year.
stack_overflow_data.csv
| Column | Description |
|---|---|
year | The year the question was asked (2008-2020) |
tag | A word or phrase that describes the topic of the question, such as the programming language |
num_questions | The number of questions with a certain tag in that year |
year_total | The total number of questions asked in that year |
# Load necessary packages
library(readr)
library(dplyr)
library(ggplot2)# Load the dataset
data <- read_csv("stack_overflow_data.csv")# View the dataset
head(data)# Start coding here
# Use as many cells as you like!Let's begin by creating a percentage column for the percentage of questions tagged with r per year, using the mutate() and filter() functions from dplyr's package.
r_over_time <- data %>%
mutate(percentage = 100 * num_questions/year_total) %>%
filter(tag == "r")
head(r_over_time, 4)Let's visualize this in a plot below.
ggplot(r_over_time, aes(year, percentage, color = "num_questions")) +
geom_line(stat = "identity") +
theme_classic() +
labs(title = "Percentage of Questions Tagged with R Per Year", x = "Year", y = "Percentage of Questions", caption = "Source: Stack Overflow")Next, we filter for and select the percentage of R questions for 2020 with the codes below.
r_percentage <- r_over_time %>%
filter(tag == "r", year == 2020) %>%
select(percentage) %>%
pull()
r_percentageThus the percentage of R questions for 2020 is 96.6%
Next, we will filter for programming language tags with the highest total number of questions asked between 2015 to 2020 with the data grouped by tag. The summarize() and sum() functions will compute the highest total questions while the arrange() function displays the result in descending order.
highest_tags <- data %>%
filter(year >= 2015 & year <= 2020) %>%
group_by(tag) %>%
summarize(highest_total_quest = sum(num_questions)) %>%
arrange(desc(highest_total_quest))
head(highest_tags, 10)