Skip to content

(Invalid URL)

Internet News and Consumer Engagement

Ready to put your coding skills to the test? Join us for our Workspace Competition.
For more information, visit datacamp.com/workspacecompetition

Context

This dataset (source) consists of data about news articles collected from Sept. 3, 2019 until Nov. 4, 2019. Afterwards, it is enriched by Facebook engagement data, such as number of shares, comments and reactions. It was first created to predict the popularity of an article before it was published. However, there is a lot more you can analyze; take a look at some suggestions at the end of this template.

Load packages

library(skimr)
library(tidyverse)

Load your Data

articles <- readr::read_csv('data/news_articles.csv.gz')
articles$source_id <- as.factor(articles$source_id)
articles$source_name <- as.factor(articles$source_name)
skim(articles) %>% 
  select(-(numeric.p0:numeric.p100)) %>%
  select(-(complete_rate))

Understand your data

Variable.Description
source_idpublisher unique identifier
source_namehuman-readable publisher name
authorarticle author
titlearticle headline
descriptionarticle short description
urlarticle URL from publisher website
url_to_imageURL to main image associated with the article
published_atexact time and date of publishing the article
contentunformatted content of the article truncated to 260 characters
top_articlevalue indicating if article was listed as a top article on publisher website
engagement_reaction_countusers reactions count for posts on Facebook involving article URL
engagement_comment_countusers comments count for posts on Facebook involving article URL
engagement_share_countusers shares count for posts on Facebook involving article URL
engagement_comment_plugin_countUsers comments count for Facebook comment plugin on article website

Now you can start to explore this dataset with the chance to win incredible prices! Can't think of where to start? Try your hand at these suggestions:

  • Extract useful insights and visualize them in the most interesting way possible.
  • Categorize the articles into different categories based on, for example, sentiment.
  • Cluster the news articles, authors or publishers based on, for example, topic.
  • Make a title generator based on data such as content, description, etc.

Judging Criteria

CATEGORYWEIGHTAGEDETAILS
Analysis30%
  • Documentation on the goal and what was included in the analysis
  • How the question was approached
  • Visualisation tools and techniques utilized
Results30%
  • How the results derived related to the problem chosen
  • The ability to trigger potential further analysis
Creativity40%
  • How "out of the box" the analysis conducted is
  • Whether the publication is properly motivated and adds value