Skip to content

Welcome to New York City, one of the most-visited cities in the world. There are many Airbnb listings in New York City to meet the high demand for temporary lodging for travelers, which can be anywhere between a few nights to many months. In this project, we will take a closer look at the New York Airbnb market by combining data from multiple file types like .csv, .tsv, and .xlsx.

Recall that CSV, TSV, and Excel files are three common formats for storing data. Three files containing data on 2019 Airbnb listings are available to you:

data/airbnb_price.csv This is a CSV file containing data on Airbnb listing prices and locations.

  • listing_id: unique identifier of listing
  • price: nightly listing price in USD
  • nbhood_full: name of borough and neighborhood where listing is located

data/airbnb_room_type.xlsx This is an Excel file containing data on Airbnb listing descriptions and room types.

  • listing_id: unique identifier of listing
  • description: listing description
  • room_type: Airbnb has three types of rooms: shared rooms, private rooms, and entire homes/apartments

data/airbnb_last_review.tsv This is a TSV file containing data on Airbnb host names and review dates.

  • listing_id: unique identifier of listing
  • host_name: name of listing host
  • last_review: date when the listing was last reviewed

Project Summary

In this project, we explore the New York City Airbnb market by analyzing data from multiple sources and file formats, including CSV, TSV, and Excel files. The data covers various aspects of Airbnb listings, such as prices, locations, room types, descriptions, host names, and review dates.

By combining and analyzing these datasets, we aim to gain insights into the pricing, availability, and characteristics of Airbnb listings in New York City.

This analysis can help travelers make informed decisions and provide hosts with valuable market insights.

# Import necessary packages
import pandas as pd
import numpy as np

# Begin coding here ...
# Use as many cells as you like
# Define file paths for the datasets
price_file = "data/airbnb_price.csv"
room_file = "data/airbnb_room_type.xlsx"
review_file = "data/airbnb_last_review.tsv"

# Read the CSV file containing price data into a pandas DataFrame
price_data = pd.read_csv(price_file)

# Read the TSV file containing review data into a pandas DataFrame
# Specify tab separator for TSV files using the 'sep' parameter
review_data = pd.read_csv(review_file, sep="\t")

# Read the XLSX file containing room type data into a pandas DataFrame
# Use the 'read_excel' function from pandas to handle Excel files
room_data = pd.read_excel(room_file)
# Define the common identifier column for merging DataFrames
merge_column = 'listing_id'

# Merge the price_data and room_data DataFrames on the 'listing_id' column
# This will combine rows with matching 'listing_id' values from both DataFrames
merged_data = price_data.merge(room_data, on=merge_column)

# Further merge the resulting DataFrame with the review_data DataFrame on the 'listing_id' column
# This will add columns from review_data to the already merged DataFrame where 'listing_id' matches
merged_data = merged_data.merge(review_data, on=merge_column)

# Display the first few rows of the merged DataFrame to inspect the result
# This is useful for verifying that the merge operation was successful and the data looks as expected
print(merged_data.head())  # Print the first few rows
# Define the date format to be used for parsing the 'last_review' column
# The format is specified as Month Day Year (e.g., May 21 2019)
date_format = "%B %d %Y"

# Convert the 'last_review' column in the merged_data DataFrame to datetime format
# This ensures that the 'last_review' column is correctly interpreted as dates
merged_data['last_review'] = pd.to_datetime(merged_data['last_review'], format=date_format)
# Find the oldest review date in the 'last_review' column of the merged_data DataFrame
# The 'min' function returns the earliest date in the column
oldest_review = merged_data['last_review'].min()

# Find the newest review date in the 'last_review' column of the merged_data DataFrame
# The 'max' function returns the latest date in the column
newest_review = merged_data['last_review'].max()

# Display the oldest review date
print(oldest_review)

# Display the newest review date
print(newest_review)
# Define a dictionary of replacements for standardizing room type values
# Keys are the original values, and values are the standardized replacements
replacements = {
    'Entire home/apt': 'Entire Home/Apt',
    'entire home/apt': 'Entire Home/Apt',
    'ENTIRE HOME/APT': 'Entire Home/Apt',
    'private room': 'Private Room',
    'Private room': 'Private Room',
    'PRIVATE ROOM': 'Private Room',
    'shared room': 'Shared Room',
    'Shared room': 'Shared Room',
    'SHARED ROOM': 'Shared Room'
}

# Replace the values in the 'room_type' column of the merged_data DataFrame
# The 'replace' method is used with the replacements dictionary to standardize the room type values
merged_data['room_type'] = merged_data['room_type'].replace(replacements)
# Filter for private rooms (case-insensitive)
# Create a boolean mask where 'room_type' is 'Private Room'
private_rooms_filter = merged_data['room_type'] == 'Private Room'

# Apply the filter to the merged_data DataFrame to get only the rows with 'Private Room'
filtered_data = merged_data[private_rooms_filter]

# Count the number of private rooms
# The shape attribute returns a tuple where the first element is the number of rows
num_private_rooms = filtered_data.shape[0]

# Display the count of private rooms
# The print statement outputs the count of private rooms
print('Count of Private Rooms: ')
print(num_private_rooms)
# Remove the word 'dollars' from the 'price' column in the merged_data DataFrame
merged_data['price'] = merged_data['price'].str.replace(' dollars', '', regex=True)

# Convert the cleaned 'price' column to numeric (float) type
merged_data['price'] = pd.to_numeric(merged_data['price'])
# Convert the 'price' column in the merged_data DataFrame to numeric (float) type
merged_data['price'] = merged_data['price'].astype(float)
# Calculate the average price of listings
# The 'mean' method computes the average of the 'price' column in the merged_data DataFrame
average_price_listing = merged_data['price'].mean()
# Round the average price to two decimal places for better readability
average_price_listing = round(average_price_listing, 2)

# Create a dictionary to hold the data for the new DataFrame
# The keys are the column names and the values are lists containing the data for each column
data = {
  "first_reviewed": [oldest_review],  # The date of the oldest review
  "last_reviewed": [newest_review],   # The date of the most recent review
  "nb_private_rooms": [num_private_rooms],  # The number of private rooms
  "avg_price": [average_price_listing]  # The average price of listings, rounded to two decimal places
}

# Create a new DataFrame from the dictionary
# The index is set to [0] to indicate that this DataFrame has a single row
review_dates = pd.DataFrame(data, index=[0])

# Print the DataFrame to verify its contents
print("Review dates DataFrame:")
print(review_dates)