Welcome to New York City, one of the most-visited cities in the world. There are many Airbnb listings in New York City to meet the high demand for temporary lodging for travelers, which can be anywhere between a few nights to many months. In this project, we will take a closer look at the New York Airbnb market by combining data from multiple file types like .csv
, .tsv
, and .xlsx
.
Recall that CSV, TSV, and Excel files are three common formats for storing data. Three files containing data on 2019 Airbnb listings are available to you:
data/airbnb_price.csv This is a CSV file containing data on Airbnb listing prices and locations.
listing_id
: unique identifier of listingprice
: nightly listing price in USDnbhood_full
: name of borough and neighborhood where listing is located
data/airbnb_room_type.xlsx This is an Excel file containing data on Airbnb listing descriptions and room types.
listing_id
: unique identifier of listingdescription
: listing descriptionroom_type
: Airbnb has three types of rooms: shared rooms, private rooms, and entire homes/apartments
data/airbnb_last_review.tsv This is a TSV file containing data on Airbnb host names and review dates.
listing_id
: unique identifier of listinghost_name
: name of listing hostlast_review
: date when the listing was last reviewed
# Import necessary packages
import pandas as pd
import numpy as np
csv_data = pd.read_csv('data/airbnb_price.csv')
tsv_data = pd.read_csv('data/airbnb_last_review.tsv', sep='\t')
xlsx_data = pd.read_excel('data/airbnb_room_type.xlsx')
combined_data = pd.concat([csv_data, tsv_data, xlsx_data], ignore_index=True)
combined_data = pd.merge(csv_data, tsv_data, on='listing_id', how='outer')
combined_data = pd.merge(combined_data, xlsx_data, on='listing_id', how='outer')
combined_data['last_review_date'] = pd.to_datetime(combined_data['last_review'], format='%B %d %Y')
first_reviewed = combined_data['last_review_date'].min()
last_reviewed = combined_data['last_review_date'].max()
# How many of the listings are private rooms?
combined_data['room_type'] = combined_data['room_type'].str.lower()
private_room_count = combined_data[combined_data['room_type'] == 'private room'].shape[0]
# What is the average listing price?
combined_data['price_clean'] = combined_data['price'].str.replace(' dollars', '').astype(float)
avg_price = combined_data['price_clean'].mean()
review_dates = pd.DataFrame({
'first_reviewed': [first_reviewed],
'last_reviewed': [last_reviewed],
'nb_private_rooms': [private_room_count],
'avg_price': [round(avg_price, 2)]
})
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
# Load your CSV file
data = pd.read_csv('test_sample.csv', on_bad_lines='skip', engine='python')
# Preview the data
print(data.head())
data.info()
# Check empty observation
print(data.isna().sum())
# Droping non useful columns
columns_to_check = ['text_original', 'likes_count', 'created_time']
data.dropna(subset=columns_to_check, inplace=True)
print(data.isna().sum())
data.head()
# Droping non useful columns
data = data.drop(columns=['text_additional', 'shares_count', 'comments_count', 'views_count'])
# Extract hashtags from a specific column
data['hashtags'] = data['text_original'].apply(lambda x: re.findall(r'#\w+', x))
print(data)
# Function to extract hashtags from a text field
def extract_hashtags(text):
if pd.isnull(text):
return []
return re.findall(r'#\w+', text)
# Combine the two text fields and extract hashtags
data['Hashtags'] = data['text_original'].fillna('')
data['Hashtags'] = data['Hashtags'].apply(extract_hashtags)
print(data)
# Flatten the list of hashtags and count frequency
all_hashtags = [hashtag.lower() for hashtags in data['Hashtags'] for hashtag in hashtags]
hashtag_counts = pd.Series(all_hashtags).value_counts()
# Display the top 10 hashtags
print("Top 10 Hashtags:")
print(hashtag_counts.head(10))
import matplotlib.pyplot as plt
import seaborn as sns
# Visualize the top 10 hashtags
top_n = 10
top_hashtags = hashtag_counts.head(top_n)
plt.figure(figsize=(12, 6))
sns.barplot(x=top_hashtags.values, y=top_hashtags.index, palette='viridis')
plt.title(f"Top {top_n} Hashtags")
plt.xlabel("Frequency")
plt.ylabel("Hashtag")
plt.show()