Welcome to New York City, one of the most-visited cities in the world. There are many Airbnb listings in New York City to meet the high demand for temporary lodging for travelers, which can be anywhere between a few nights to many months. In this notebook, we will take a closer look at the New York Airbnb market by combining data from multiple file types like .csv
, .tsv
, and .xlsx
.
Recall that CSV, TSV, and Excel files are three common formats for storing data. Three files containing data on 2019 Airbnb listings are available to you:
data/airbnb_price.csv
listing_id
: unique identifier of listingprice
: nightly listing price in USDnbhood_full
: name of borough and neighborhood where listing is located
data/airbnb_room_type.xlsx This is an Excel file containing data on Airbnb listing descriptions and room types.
listing_id
: unique identifier of listingdescription
: listing descriptionroom_type
: Airbnb has three types of rooms: shared rooms, private rooms, and entire homes/apartments
data/airbnb_last_review.tsv This is a TSV file containing data on Airbnb host names and review dates.
listing_id
: unique identifier of listinghost_name
: name of listing hostlast_review
: date when the listing was last reviewed
Our goals are to convert untidy data into appropriate formats to analyze, and answer key questions including:
- What is the average price, per night, of an Airbnb listing in NYC?
- How does the average price of an Airbnb listing, per month, compare to the private rental market?
- How many adverts are for private rooms?
- How do Airbnb listing prices compare across the five NYC boroughs?
import numpy as np
import pandas as pd
import datetime as dt
# Step 1. Importing the data
# Load airbnb_price.csv, prices
prices = pd.read_csv("data/airbnb_price.csv")
# Load airbnb_room_type.xlsx, xls
xls = pd.ExcelFile("data/airbnb_room_type.xlsx")
# Parse the first sheet from xls, room_types
room_types = xls.parse(0)
# Load airbnb_last_review.tsv, reviews
reviews = pd.read_csv("data/airbnb_last_review.tsv", sep="\t")
# Step 2. Cleaning the price column
# Remove whitespace and string characters from prices column
prices["price"] = prices["price"].str.replace(" dollars", "")
# Convert prices column to numeric datatype
prices["price"] = pd.to_numeric(prices["price"])
# Step 3. Calculating average price
# Subset prices for listings costing $0, free_listings
free_listings = prices["price"] == 0
# Update prices by removing all free listings from prices
prices = prices.loc[~free_listings]
# Calculate the average price, avg_price
avg_price = round(prices["price"].mean(), 2)
# Step 4. Comparing costs to the private rental market
# Add a new column to the prices DataFrame, price_per_month
prices["price_per_month"] = prices["price"] * 365 / 12
# Calculate average_price_per_month
average_price_per_month = round(prices["price_per_month"].mean(), 2)
difference = round((average_price_per_month - 3100),2)
# Step 5. Cleaning the room_type column
# Convert the room_type column to lowercase
room_types["room_type"] = room_types["room_type"].str.lower()
# Update the room_type column to category data type
room_types["room_type"] = room_types["room_type"].astype("category")
# Create the variable room_frequencies
room_frequencies = room_types["room_type"].value_counts()
# Step 6. What timeframe are we working with?
# Change the data type of the last_review column to datetime
reviews["last_review"] = pd.to_datetime(reviews["last_review"])
# Create first_reviewed, the earliest review date
first_reviewed = reviews["last_review"].dt.date.min()
# Create last_reviewed, the most recent review date
last_reviewed = reviews["last_review"].dt.date.max()
# Step 7. Joining the DataFrames
# Merge prices and room_types to create rooms_and_prices
rooms_and_prices = prices.merge(room_types, how="outer", on="listing_id")
# Merge rooms_and_prices with the reviews DataFrame to create airbnb_merged
airbnb_merged = rooms_and_prices.merge(reviews, how="outer", on="listing_id")
# Drop missing values from airbnb_merged
airbnb_merged.dropna(inplace=True)
# Step 8. Analyzing listing prices by NYC borough
# Extract information from the nbhood_full column and store as a new column, borough
airbnb_merged["borough"] = airbnb_merged["nbhood_full"].str.partition(",")[0]
# Group by borough and calculate summary statistics
boroughs = airbnb_merged.groupby("borough")["price"].agg(["sum", "mean", "median", "count"])
# Round boroughs to 2 decimal places, and sort by mean in descending order
boroughs = boroughs.round(2).sort_values("mean", ascending=False)
# Step 9. Price range by borough
# Create labels for the price range, label_names
label_names = ["Budget", "Average", "Expensive", "Extravagant"]
# Create the label ranges, ranges
ranges = [0, 69, 175, 350, np.inf]
# Insert new column, price_range, into DataFrame
airbnb_merged["price_range"] = pd.cut(airbnb_merged["price"], bins=ranges, labels=label_names)
# Calculate occurence frequencies for each label, prices_by_borough
prices_by_borough = airbnb_merged.groupby(["borough", "price_range"])["price_range"].count()
# Step 10. Storing the final result
solution = {'avg_price':avg_price,
'average_price_per_month': average_price_per_month,
'difference':difference,
'room_frequencies':room_frequencies,
'first_reviewed': first_reviewed,
'last_reviewed': last_reviewed,
'prices_by_borough':prices_by_borough}
print(solution)