Skip to content

Welcome to New York City, one of the most-visited cities in the world. There are many Airbnb listings in New York City to meet the high demand for temporary lodging for travelers, which can be anywhere between a few nights to many months. In this project, we will take a closer look at the New York Airbnb market by combining data from multiple file types like .csv, .tsv, and .xlsx.

Recall that CSV, TSV, and Excel files are three common formats for storing data. Three files containing data on 2019 Airbnb listings are available to you:

data/airbnb_price.csv This is a CSV file containing data on Airbnb listing prices and locations.

  • listing_id: unique identifier of listing
  • price: nightly listing price in USD
  • nbhood_full: name of borough and neighborhood where listing is located

data/airbnb_room_type.xlsx This is an Excel file containing data on Airbnb listing descriptions and room types.

  • listing_id: unique identifier of listing
  • description: listing description
  • room_type: Airbnb has three types of rooms: shared rooms, private rooms, and entire homes/apartments

data/airbnb_last_review.tsv This is a TSV file containing data on Airbnb host names and review dates.

  • listing_id: unique identifier of listing
  • host_name: name of listing host
  • last_review: date when the listing was last reviewed
# Import necessary packages
import pandas as pd
import numpy as np

# 1. - Loading the data:

# Load the CSV file
price_df = pd.read_csv('data/airbnb_price.csv')

# Load the Excel file
room_type_df = pd.read_excel('data/airbnb_room_type.xlsx')

# Load the TSV file
review_df = pd.read_csv('data/airbnb_last_review.tsv', sep='\t')

# 2. - Merging the three DataFrames:

# Merge the DataFrames based on the 'listing_id' column
merged_df = price_df.merge(room_type_df, on='listing_id').merge(review_df, on='listing_id')

# 3. - Determining the earliest and most recent review dates:

# Convert the 'last_review' column to datetime format
merged_df['last_review'] = pd.to_datetime(merged_df['last_review'], errors='coerce')

# Find the earliest and most recent review dates
first_reviewed = merged_df['last_review'].min()
last_reviewed = merged_df['last_review'].max()

# 4. - Finding how many listings are private rooms:

# Standardize the values in the 'room_type' column to lowercase
merged_df['room_type'] = merged_df['room_type'].str.lower()

# Count the number of private room listings
nb_private_rooms = merged_df[merged_df['room_type'] == 'private room'].shape[0]

# 5. - Finding the average price of listings

# Remove the dollar sign and convert the 'price' column to float
merged_df['price'] = merged_df['price'].replace('[^\d.]', '', regex=True).astype(float)

# Calculate the average price and round to two decimal places
avg_price = round(merged_df['price'].mean(), 2)

# 6. - Creating a DataFrame with the four solution values

# Create the final DataFrame with the required information
review_dates = pd.DataFrame({
    'first_reviewed': [first_reviewed],
    'last_reviewed': [last_reviewed],
    'nb_private_rooms': [nb_private_rooms],
    'avg_price': [avg_price]
})

# Display the final DataFrame
print(review_dates)

Project Conclusion

In this project, we analyzed Airbnb listing data in New York City to gain insights into the short-term rental market, specifically focusing on private rooms. The data was sourced from three different file formats (.csv, .xlsx, .tsv), and we successfully combined and processed these datasets to extract meaningful information.

Key Findings:

  • Earliest Review Date: 2019-01-01
  • Most Recent Review Date: 2019-07-09
  • Number of Private Room Listings: 11,356
  • Average Listing Price: $141.78

By merging the data from different sources, we were able to provide a comprehensive overview of the Airbnb market trends. This analysis helps in understanding the distribution of private rooms, their pricing, and review patterns over a specific period.

Summary of the Process:

  1. Data Loading: We used pandas to load data from .csv, .xlsx, and .tsv files.
  2. Data Merging: The datasets were merged into a single DataFrame based on the 'listing_id' column.
  3. Date Processing: The 'last_review' dates were converted to a datetime format to identify the earliest and most recent reviews.
  4. Data Cleaning: We standardized the 'room_type' column to count the number of private room listings accurately.
  5. Price Calculation: We cleaned the 'price' data by removing non-numeric characters and calculating the average price.
  6. Result Compilation: A final DataFrame was created to summarize the insights.

This project demonstrates the importance of data cleaning, integration, and analysis in deriving actionable insights from disparate data sources. The significant decrease in memory usage and the extraction of critical market trends provide valuable information for real estate stakeholders looking to understand and leverage the Airbnb market in New York City.

This analysis not only enhances our understanding of the Airbnb market but also showcases the application of various data science techniques in real-world scenarios. The ability to handle and process multiple data formats efficiently is crucial in today's data-driven decision-making processes.