Skip to content

Welcome to New York City, one of the most-visited cities in the world. There are many Airbnb listings in New York City to meet the high demand for temporary lodging for travelers, which can be anywhere between a few nights to many months. In this project, I'm going to take you on a journey, as we take a closer look at the New York Airbnb market by combining data from multiple file types like .csv, .tsv, and .xlsx (Excel files).

Note that CSV, TSV, and Excel files are three common formats for storing data. Three files containing data on 2019 Airbnb listings are available to us:

data/airbnb_price.csv This is a CSV file containing data on Airbnb listing prices and locations.

  • listing_id: unique identifier of listing
  • price: nightly listing price in USD
  • nbhood_full: name of borough and neighborhood where listing is located

data/airbnb_room_type.xlsx This is an Excel file containing data on Airbnb listing descriptions and room types.

  • listing_id: unique identifier of listing
  • description: listing description
  • room_type: Airbnb has three types of rooms: shared rooms, private rooms, and entire homes/apartments

data/airbnb_last_review.tsv This is a TSV file containing data on Airbnb host names and review dates.

  • listing_id: unique identifier of listing
  • host_name: name of listing host
  • last_review: date when the listing was last reviewed

Now I'm going to load the necessary packages to run my commands swiftly. Then I am going to import the 3 different datasets into variables. My first line of action would be to clean the numeric data, which is the price column to remove non-numeric characters and convert to numeric characters. Then I'll clean all other important columns, like the room type column, to ensure consistency across all rows. Follow me as I draw out insights from the datasets.

# The necessary packages has been loaded
suppressMessages(library(dplyr)) 
options(readr.show_types = FALSE) 
library(readr)
library(readxl)
library(stringr)
library(lubridate) # Added this line to load the lubridate package
library(dplyr)
library(ggplot2)

# Let's begin coding ..
# I read the data from the files into variables
price_data <- read_csv("data/airbnb_price.csv")
room_type_data <- read_excel("data/airbnb_room_type.xlsx")
review_data <- read_tsv("data/airbnb_last_review.tsv")

# I'm cleaning the price column: to remove non-numeric characters and convert to numeric characters
price_data <- price_data %>%
  mutate(price = as.numeric(str_replace_all(price, "[^\\d.]", "")))


# Now I'm cleaning the room_type column to ensure consistency
room_type_data <- room_type_data %>%
  mutate(room_type = str_trim(str_to_lower(room_type)))


# I'm converting the review dates column to Date type for easy manipulation
review_data <- review_data %>%
  mutate(last_review = mdy(last_review))


# I found the earliest and most recent review dates
first_reviewed <- min(review_data$last_review, na.rm = TRUE)
last_reviewed <- max(review_data$last_review, na.rm = TRUE) 


# Let's filter to count the number of room types and save in variables
nb_private_rooms <- room_type_data %>%
  filter(room_type == "private room") %>%
  nrow()
nb_shared_rooms <- room_type_data %>%
  filter(room_type == "shared room") %>%
  nrow()
nb_entire_home_apt <- room_type_data %>%
  filter(room_type == "entire home/apt") %>%
  nrow()
# the count of all the room types 
print(paste("private room:", nb_private_rooms))
print(paste("shared room:", nb_shared_rooms))
print(paste("entire home/apt:", nb_entire_home_apt))

# Now let's calculate the average listing price (rounded to the nearest penny)
avg_price <- round(mean(price_data$price, na.rm = TRUE), 2)


# Combined the values into one tibble
review_dates <- tibble(
  first_reviewed = first_reviewed,
  last_reviewed = last_reviewed,
  nb_private_rooms = nb_private_rooms,
  avg_price = avg_price
)

# Final answer
print(review_dates)

Price Distribution by Room Type

I'm going to be examining pricing differences between shared, private, and entire-home listings.

# price dist by room type
price_by_type <- price_data %>%
  inner_join(room_type_data, by = "listing_id") %>%
  group_by(room_type) %>%
  summarise(
    count = n(),
    avg_price = round(mean(price, na.rm = TRUE), 2),
    median_price = round(median(price, na.rm = TRUE), 2),
    min_price = min(price, na.rm = TRUE),
    max_price = max(price, na.rm = TRUE)
  )

print("Summary stats of room type prices ->")
print(price_by_type)

Price by Neighborhood (Top 10 Most Expensive/Popular)

This is going to highlight high-earning or high-activity neighborhoods

price_by_nbhood <- price_data %>%
  group_by(nbhood_full) %>%
  summarise(
    avg_price = round(mean(price, na.rm = TRUE), 2),
    count = n()
  ) %>%
  arrange(desc(avg_price)) %>%
  slice_head(n = 10)

print(price_by_nbhood)

Room Types by Neighborhood

This shows which parts of NYC offer mostly entire homes vs. private or shared spaces.


room_by_nbhood <- room_type_data %>%
  inner_join(price_data, by = "listing_id") %>%
  group_by(nbhood_full, room_type) %>%
  summarise(count = n()) %>%
  arrange(nbhood_full, desc(count))

print(room_by_nbhood)