Competition - Internet News and Consumer Engagement

(Invalid URL)

Internet News and Consumer Engagement

Ready to put your coding skills to the test? Join us for our Workspace Competition.
For more information, visit datacamp.com/workspacecompetition

Context

This dataset (source) consists of data about news articles collected from Sept. 3, 2019 until Nov. 4, 2019. Afterwards, it is enriched by Facebook engagement data, such as number of shares, comments and reactions. It was first created to predict the popularity of an article before it was published. However, there is a lot more you can analyze; take a look at some suggestions at the end of this template.

Load packages

library(skimr)
library(tidyverse)
library(ggplot2)
library(lubridate)
library(ggpubr)
library(kableExtra)
library(tidytext)
library(textdata)
library(igraph)
library(ggraph)
library(treemap)
library(circlize)
library(viridis)
library(hrbrthemes)
library(ggalluvial)
library(wordcloud)
library(patchwork)
library(tm)
library(SnowballC)

Load your Data

articles <- readr::read_csv('data/news_articles.csv.gz')
articles$source_id <- as.factor(articles$source_id)
articles$source_name <- as.factor(articles$source_name)
skim(articles) %>% 
  select(-(numeric.p0:numeric.p100)) %>%
  select(-(complete_rate))

Understand your data

Variable.	Description
source_id	publisher unique identifier
source_name	human-readable publisher name
author	article author
title	article headline
description	article short description
url	article URL from publisher website
url_to_image	URL to main image associated with the article
published_at	exact time and date of publishing the article
content	unformatted content of the article truncated to 260 characters
top_article	value indicating if article was listed as a top article on publisher website
engagement_reaction_count	users reactions count for posts on Facebook involving article URL
engagement_comment_count	users comments count for posts on Facebook involving article URL
engagement_share_count	users shares count for posts on Facebook involving article URL
engagement_comment_plugin_count	Users comments count for Facebook comment plugin on article website

Who are the publishers?

  distinct.publishers <- articles %>%
                            distinct(source_name)

In order to get acquainted with data, the publishers were plotted along theirs total amount of articles published, as can be seen in figure below. The only strange actor is publisher 460.0. It has just one article published and not enough information about it in the dataset.

  
  publishers <- articles %>%
                  count(source_name, sort = FALSE) %>%
                  rename(Publisher = source_name) %>%
                  mutate(Publisher = fct_reorder(Publisher, n, .desc = TRUE)) 

  ggplot(publishers, aes(x = Publisher, y = n, fill = Publisher)) +
    geom_bar(stat = "identity", alpha = 0.5) +
    
    ylim(-400,1600) +
    
    labs(caption = "Number of publications by publisher") +
    
    theme_minimal() +
    theme(
      axis.text = element_blank(),
      axis.title = element_blank(),
      panel.grid = element_blank(),
      plot.margin = unit(rep(-1,4), "cm")
    ) +
    coord_polar(start = 0) + geom_text(data = publishers, aes(x = Publisher, label = n))

How they have published over time?

  kbl(articles %>%
            filter(is.na(published_at)) %>%
            select(source_name, published_at),booktabs = TRUE,centering = T,align = 'c') %>%
            kable_classic(full_width = F, html_font = "Cambria")

# articles %>%
#             filter(is.na(published_at)) %>%
#             select(source_name, published_at)

Below, it is possible to see how the publishers have been publishing along the time. It is separated in three classes according to their amount of publications. The first group, A, are comprised by the top 6 publishers. They are followed by groups B and C in descending order of publications. Publisher 460.0 does not have date of publication filled, so it was purged from the analysis.


  publishers_over_date <- articles %>%
                  select(source_name, published_at) %>%
                  rename(Publisher = source_name) %>%
                  group_by(Publisher, date = as.Date(published_at, label = TRUE)) %>%
                  summarise(total = n()) %>%
                  drop_na() %>%
                  arrange(desc(total))

  
  A <- publishers_over_date %>%
       filter(Publisher %in% c('Reuters','BBC News','The Irish Times','ABC News','CNN',
                               'Business Insider'))
  
  B <- publishers_over_date %>%
       filter(Publisher %in% c('The New York Times','CBS News','Newsweek'))
  
  C <- publishers_over_date %>%
       filter(Publisher %in% c('Al Jazeera English','The Wall Street Journal', 'ESPN'))
  

  pA <- ggplot(A,aes(x = date, y = total, color = Publisher)) +
    geom_line() +
    geom_point() +
    ylim(0,100) +
    theme(legend.position = "bottom", legend.direction = "horizontal")
  
  pB <- ggplot(B,aes(x = date, y = total, colour = Publisher)) +
    geom_line() +
    geom_point() +
    ylim(0,100) +
    theme(legend.position = "bottom", legend.direction = "horizontal")
  
  pC <- ggplot(C,aes(x = date, y = total, color = Publisher)) +
    geom_line() +
    geom_point() +
    ylim(0,100) +
    theme(legend.position = "bottom", legend.direction = "horizontal")
  
  # figure <- ggarrange(pA, pB, pC, labels = c("A","B","C"),
  #                      ncol = 1,heights = c(1.1,1,1),hjust = -43) 
  # 
  # figure
  
  pA

From the figure above, one can see that Reuters remains in a constant rate of producing articles, while the others have a drop in productivity in two periods of time.

‌
‌
‌