Welcome to New York City, one of the most-visited cities in the world. There are many Airbnb listings in New York City to meet the high demand for temporary lodging for travelers, which can be anywhere between a few nights to many months. In this project, we will take a closer look at the New York Airbnb market by combining data from multiple file types like .csv
, .tsv
, and .xlsx
.
Recall that CSV, TSV, and Excel files are three common formats for storing data. Three files containing data on 2019 Airbnb listings are available to you:
data/airbnb_price.csv This is a CSV file containing data on Airbnb listing prices and locations.
listing_id
: unique identifier of listingprice
: nightly listing price in USDnbhood_full
: name of borough and neighborhood where listing is located
data/airbnb_room_type.xlsx This is an Excel file containing data on Airbnb listing descriptions and room types.
listing_id
: unique identifier of listingdescription
: listing descriptionroom_type
: Airbnb has three types of rooms: shared rooms, private rooms, and entire homes/apartments
data/airbnb_last_review.tsv This is a TSV file containing data on Airbnb host names and review dates.
listing_id
: unique identifier of listinghost_name
: name of listing hostlast_review
: date when the listing was last reviewed
1. Loading and checking the data
# Import libraries
import numpy as np
import pandas as pd
# Loading datasets and storing them in dataframes
airbnb_price = pd.read_csv('data/airbnb_price.csv')
airbnb_room_type = pd.read_excel('data/airbnb_room_type.xlsx')
airbnb_last_review = pd.read_csv('data/airbnb_last_review.tsv', sep='\t')
# Checking overall information in the 3 loaded dataframes
# Checking the airbnb_price information
print(airbnb_price.info())
print(airbnb_price['price'].describe())
print('Duplicated values: ', airbnb_price.duplicated().any())
print(airbnb_price.head())
- The 'price' column needs to be cleaned because it is not in the right data type and 'dollar' strings have to be removed before altering the data type to 'float'.
- No cleaning task is needed on the 'nbhood_full' column
# Checking the airbnb_room_type information
print(airbnb_room_type.info())
print('Duplicated values: ', airbnb_room_type.duplicated().any())
print(airbnb_room_type.head())
print(airbnb_room_type[airbnb_room_type['description'].isna()])
print(airbnb_room_type['room_type'].value_counts())
- The 'room_type' column is inconsistent because of the typo
- The 'description' column has some missing values, but it does not have any impact on the analysis result, so the cleaning task is not needed on this column
# Checking the airbnb_last_review information
print(airbnb_last_review.info())
print('Duplicated values: ', airbnb_last_review.duplicated().any())
print(airbnb_last_review[airbnb_last_review['host_name'].isna()])
print(airbnb_last_review.head(5))
- The 'host_name' have some missing values but it does not impact on the analysis result, so no cleaning task is needed on this column.
- The 'last_review' column is not in the right data type, it should be in 'datetime' type.
2. Cleaning the 3 loaded datasets
# Cleaning the 'airbnb_price' dataset, the 'price' column, data type
airbnb_price['price'] = airbnb_price['price'].str.strip(' dollars')
airbnb_price['price'] = airbnb_price['price'].astype('float64')
print(airbnb_price['price'].value_counts())
print(airbnb_price['price'].dtype)
# Cleaning the 'airbnb_room_type' dataset, the 'room_type' column, typo
airbnb_room_type['room_type'] = airbnb_room_type['room_type'].str.lower()
print(airbnb_room_type['room_type'].value_counts())
# Cleaning the 'airbnb_last_review' dataset, the 'last_review' column, data type
airbnb_last_review['last_review'] = pd.to_datetime(airbnb_last_review['last_review'])
print(airbnb_last_review['last_review'].head(10))
print(airbnb_last_review['last_review'].dtype)
3. Merging the 3 cleaned datasets into one dataframe