Data Importing and Cleaning: An Airbnb Market in NYC

This project uses three datasets about the New York Airbnb market:

data/airbnb_price.csv has data on prices, boroughs and neighborhoods where listings are located.

data/airbnb_room_type.xlsx is an excel file coontaining listing descriptions and types, like a shared apartment.

data/airbnb_last_review.tsv is a TSV file with host names and review dates.

The datasets can be linked via the listing_id column.

The goal of this project is import different file types, clean data and do a cursory EDA.

# Import necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import datasets
df_prices = pd.read_csv("data/airbnb_price.csv") 
df_xlsx = pd.ExcelFile("data/airbnb_room_type.xlsx")
df_types = df_xlsx.parse(df_xlsx.sheet_names[0])
df_reviews = pd.read_csv('data/airbnb_last_review.tsv',sep='\t')

# exploring df_prices
df_prices.info()
df_prices.head()

# Checking missing values and duplicates
df_prices.isna().sum() # No missing values
df_prices.duplicated(subset='listing_id').any() # No duplicates

# The price column is of string type and contains the word 'dollars'. Fix itpe of the price column from string to integer.

df_prices['price'] = df_prices['price'].str.replace(' dollars', '')
df_prices['price'] = df_prices['price'].astype(int)

# A typical entry of the column nbhood_full looks like this: Manhattan, Murray Hill. It's better to store these two pieces of information in separate columns and drop the original column

df_prices[['borough','nbhood']] = df_prices['nbhood_full'].str.split(',',n=1 ,expand=True)
df_prices = df_prices.drop('nbhood_full',axis=1)

df_types['room_type'].value_counts()

# Checking missing values and duplicates
df_types.isna().sum() # 10 missing values in the description column
df_types.duplicated(subset='listing_id').any() # No duplicates.

# exploring df_types
df_types.info()

# Examine a random sample of the data
df_types.sample(n=100, random_state=1)

# Fix inconsitent spellings in the room_type column
df_types['room_type'] = df_types['room_type'].str.lower()
df_types['room_type'].value_counts()

# exploring df_reviews
df_reviews.head()
df_reviews.info()

# Checking missing values and duplicates
df_reviews.isna().sum() # 8 missing values in the host_name column
df_reviews.duplicated(subset='listing_id').any() # No duplicates.

# Merge df_prices and df_types
df = df_prices.merge(df_types,on='listing_id',how='inner')
df = df.drop('description',axis=1)

assert df['listing_id'].nunique() == df_prices['listing_id'].nunique() == df_types['listing_id'].nunique() # Make sure all the listing_ids are included in the new dataframe

# A quick look at the values distribution of the price column
df['price'].describe()
sns.set_style('darkgrid')
plt.figure(figsize=(5, 2))
sns.histplot(data=df, x='price',bins=50)
plt.show()

‌
‌
‌