How can you determine which programming languages and technologies are most widely used? Which languages are gaining or losing popularity, helping you decide where to focus your efforts?
One excellent data source is Stack Overflow, a programming question-and-answer site with more than 16 million questions on programming topics. Each Stack Overflow question is tagged with a label identifying its topic or technology. By counting the number of questions related to each technology, you can estimate the popularity of different programming languages.
In this project, you will use data from the Stack Exchange Data Explorer to examine how the relative popularity of R, Python, Java, and JavaScript has changed over time.
You'll work with a dataset containing one observation per tag per year, including the number of questions for that tag and the total number of questions that year.
stack_overflow_data.csv
Column | Description |
---|---|
year | The year the question was asked (2008-2020) |
tag | A word or phrase that describes the topic of the question, such as the programming language |
num_questions | The number of questions with a certain tag in that year |
year_total | The total number of questions asked in that year |
Learning how to code has become integral in many job requirements. Therefore, both early and seasoned professionals across different fields are now required to have at least basic knowledge in programming. However, choosing a particular programming language to kickstart your journey can be difficult due to the fact that there are several programming languages out there, ranging from free to paid.
Even though it is difficult to choose from the many, having knowledge about what people are talking about or using most can serve as a guide to choosing and focusing your attention on.
In this analysis, I explore data from Stack Overflow to determine the popularity of some of the programming languages from 2015 to 2020, which can help in decision-making towards your programming journey.
From the analysis, I realized that the top 5 most popular programming languages were 'JavaScript', 'Python', 'Java', 'Android', and 'C#' (Figure 2). This was determined based on the number of times they were tagged in a question from 2015-2020.
Relative to other programming languages, R has grown steadily from 2008 to 2020, achieving almost a 1% rate of the total questions that were tagged for the various programming languages over the years (Figure 1).
If you have read to this point, thank you very much for your time. I welcome any contributions and comments.
# Load necessary packages
library(readr)
library(dplyr)
library(ggplot2)
# Load the dataset
data <- read_csv("stack_overflow_data.csv")
# Analyzing question tagged R against the total questions and the proportion of questions tagged R
# R in each year.
glimpse(data)
head(data)
# Proportion of R tags per total tags ##figure 1
# Load necessary libraries
library(dplyr)
library(ggplot2)
# Assuming 'data' is a data frame, not a function
# If 'data' is not defined, you need to load or create it before this step
r_over_time <- data %>%
filter(tag == "r") %>%
group_by(year) %>%
mutate(percentage = num_questions/year_total * 100) %>%
arrange(desc(percentage))
print(r_over_time)
ggplot(r_over_time, aes(x=year, y=percentage)) + geom_line() + ggtitle("R Growth Across the Years")
r_percentage <- data %>%
filter(year == 2020, tag == "r") %>%
mutate(percentage = num_questions / year_total * 100) %>%
pull(percentage) %>%
as.numeric()
print(r_percentage)
# Propular programming language from 2015 to 2020
highest_tags <- data %>%
filter(year >= 2015 & year <= 2020) %>%
group_by(tag) %>%
summarize(total_year = sum(num_questions)) %>%
arrange(desc(total_year)) %>%
head(n = 5) %>%
pull(tag) # Pulling just the tag column into a vector
# Ensure it's a character vector
highest_tags <- as.character(highest_tags)
# Print the highest tags
highest_tags
# Popular programming languages from 2015 to 2020
highest_tags <- data %>%
filter(year >= 2015 & year <= 2020) %>%
group_by(tag) %>%
summarize(total_year = sum(num_questions)) %>%
arrange(desc(total_year)) %>%
head(n = 5)
# Plotting the data # igure 2
ggplot(highest_tags, aes(x=reorder(tag, -total_year), y=total_year, fill = tag)) +
geom_col() +
ggtitle("POPULAR PROGRAMMING LANG FROM 2015-2020") +
xlab("Programming Language") +
ylab("Total Questions") +
theme_minimal()
# Print the highest tags
highest_tags