There are many Airbnb listings in New York City to meet the high demand for temporary lodging for travelers, which can be anywhere between a few nights to many months. As a consultant working for a real estate start-up, you have collected Airbnb listing data from various sources to investigate the short-term rental market in New York. You'll analyze this data to provide insights on private rooms to the real estate company.
Three files containing data on 2019 Airbnb listings are available to you:
data/airbnb_price.csv This is a CSV file containing data on Airbnb listing prices and locations.
listing_id: unique identifier of listingprice: nightly listing price in USDnbhood_full: name of borough and neighborhood where listing is located
data/airbnb_room_type.xlsx This is an Excel file containing data on Airbnb listing descriptions and room types.
listing_id: unique identifier of listingdescription: listing descriptionroom_type: Airbnb has three types of rooms: shared rooms, private rooms, and entire homes/apartments
data/airbnb_last_review.tsv This is a TSV file containing data on Airbnb host names and review dates.
listing_id: unique identifier of listinghost_name: name of listing hostlast_review: date when the listing was last reviewed
import pandas as pd
import numpy as np
# load in all the data-sets as individual dfs (initially)
price_df = pd.read_csv('data/airbnb_price.csv')
roomtype_df = pd.read_excel('data/airbnb_room_type.xlsx')
lastreview_df = pd.read_csv('data/airbnb_last_review.tsv', sep='\t')Though implied from the project instruction, I want to check that the 'listing_ID' is essentially an intersection.
# check the listing_id is an intersection between the three datasets
if all('listing_id' in df.columns for df in [price_df, roomtype_df, lastreview_df]):
id_sets = [set(df['listing_id']) for df in [price_df, roomtype_df, lastreview_df]]
shared_ids = set.intersection(*id_sets)
if shared_ids:
print(f"Checked. {len(shared_ids)} IDs are found in all three DataFrames.")
else:
print("No IDs are shared across all three DataFrames.")
else:
print("One or more DataFrames are missing the 'listing_id' column.")# merge the dfs together using listing_id as the intersection, we should expect 7 columns
wide_df = (
price_df
.merge(roomtype_df, on='listing_id', how='inner')
.merge(lastreview_df, on='listing_id', how='inner')
)
wide_df.info()wide_df.head()What are the dates of the earliest and most recent reviews?
Store these values as two separate variables with your preferred names.
earliest_review = wide_df['last_review'].min()
earliest_reviewmostrecent_review = wide_df['last_review'].max()
mostrecent_reviewThis is incorrect, looks like it's sorting the last_review column alphabetically which tells me I should of made sure the data-types were cleaned and prepped...
wide_df.head()# prices are currently all string - '### dollars' - convert to int32
wide_df['price'] = wide_df['price'].astype(str)
wide_df['price'] = (
wide_df['price']
.str.replace(' dollars', '', regex=False)
.astype('int32')
)
# convert last_review into datetime
wide_df['last_review'] = pd.to_datetime(wide_df['last_review'])Initially when I checked this, I was confused as to why the str dtypes were still appearing as objects. This is where I learned about pandas more recent use adoption of the 'string' type, which appears to be more a preference in current workflows...
# convert all text/object columns into str
wide_df[['nbhood_full', 'description', 'room_type', 'host_name']] = wide_df[['nbhood_full', 'description', 'room_type', 'host_name']].astype('string')
# make room type values consistent casing
wide_df['room_type'] = wide_df['room_type'].str.lower()
# check my moves...
wide_df.info()Let's try again with the earliest and most recent dates