Welcome to New York City, one of the most-visited cities in the world. There are many Airbnb listings in New York City to meet the high demand for temporary lodging for travelers, which can be anywhere between a few nights to many months. In this project, I'm going to take you on a journey, as we take a closer look at the New York Airbnb market by combining data from multiple file types like .csv
, .tsv
, and .xlsx
(Excel files).
Note that CSV, TSV, and Excel files are three common formats for storing data. Three files containing data on 2019 Airbnb listings are available to us:
data/airbnb_price.csv This is a CSV file containing data on Airbnb listing prices and locations.
listing_id
: unique identifier of listingprice
: nightly listing price in USDnbhood_full
: name of borough and neighborhood where listing is located
data/airbnb_room_type.xlsx This is an Excel file containing data on Airbnb listing descriptions and room types.
listing_id
: unique identifier of listingdescription
: listing descriptionroom_type
: Airbnb has three types of rooms: shared rooms, private rooms, and entire homes/apartments
data/airbnb_last_review.tsv This is a TSV file containing data on Airbnb host names and review dates.
listing_id
: unique identifier of listinghost_name
: name of listing hostlast_review
: date when the listing was last reviewed
Now I'm going to load the necessary packages to run my commands swiftly. Then I am going to import the 3 different datasets into variables. My first line of action would be to clean the numeric data, which is the price column to remove non-numeric characters and convert to numeric characters. Then I'll clean all other important columns, like the room type column, to ensure consistency across all rows. Follow me as I draw out insights from the datasets.
# The necessary packages has been loaded
suppressMessages(library(dplyr))
options(readr.show_types = FALSE)
library(readr)
library(readxl)
library(stringr)
library(lubridate) # Added this line to load the lubridate package
library(dplyr)
library(ggplot2)
# Let's begin coding ..
# I read the data from the files into variables
price_data <- read_csv("data/airbnb_price.csv")
room_type_data <- read_excel("data/airbnb_room_type.xlsx")
review_data <- read_tsv("data/airbnb_last_review.tsv")
# I'm cleaning the price column: to remove non-numeric characters and convert to numeric characters
price_data <- price_data %>%
mutate(price = as.numeric(str_replace_all(price, "[^\\d.]", "")))
# Now I'm cleaning the room_type column to ensure consistency
room_type_data <- room_type_data %>%
mutate(room_type = str_trim(str_to_lower(room_type)))
# I'm converting the review dates column to Date type for easy manipulation
review_data <- review_data %>%
mutate(last_review = mdy(last_review))
# I found the earliest and most recent review dates
first_reviewed <- min(review_data$last_review, na.rm = TRUE)
last_reviewed <- max(review_data$last_review, na.rm = TRUE)
# Let's filter to count the number of room types and save in variables
nb_private_rooms <- room_type_data %>%
filter(room_type == "private room") %>%
nrow()
nb_shared_rooms <- room_type_data %>%
filter(room_type == "shared room") %>%
nrow()
nb_entire_home_apt <- room_type_data %>%
filter(room_type == "entire home/apt") %>%
nrow()
# the count of all the room types
print(paste("private room:", nb_private_rooms))
print(paste("shared room:", nb_shared_rooms))
print(paste("entire home/apt:", nb_entire_home_apt))
# Now let's calculate the average listing price (rounded to the nearest penny)
avg_price <- round(mean(price_data$price, na.rm = TRUE), 2)
# Combined the values into one tibble
review_dates <- tibble(
first_reviewed = first_reviewed,
last_reviewed = last_reviewed,
nb_private_rooms = nb_private_rooms,
avg_price = avg_price
)
# Final answer
print(review_dates)
Price Distribution by Room Type
I'm going to be examining pricing differences between shared, private, and entire-home listings.
# price dist by room type
price_by_type <- price_data %>%
inner_join(room_type_data, by = "listing_id") %>%
group_by(room_type) %>%
summarise(
count = n(),
avg_price = round(mean(price, na.rm = TRUE), 2),
median_price = round(median(price, na.rm = TRUE), 2),
min_price = min(price, na.rm = TRUE),
max_price = max(price, na.rm = TRUE)
)
print("Summary stats of room type prices ->")
print(price_by_type)
Price by Neighborhood (Top 10 Most Expensive/Popular)
This is going to highlight high-earning or high-activity neighborhoods
price_by_nbhood <- price_data %>%
group_by(nbhood_full) %>%
summarise(
avg_price = round(mean(price, na.rm = TRUE), 2),
count = n()
) %>%
arrange(desc(avg_price)) %>%
slice_head(n = 10)
print(price_by_nbhood)
Room Types by Neighborhood
This shows which parts of NYC offer mostly entire homes vs. private or shared spaces.
room_by_nbhood <- room_type_data %>%
inner_join(price_data, by = "listing_id") %>%
group_by(nbhood_full, room_type) %>%
summarise(count = n()) %>%
arrange(nbhood_full, desc(count))
print(room_by_nbhood)