Introduction
In this project we will be importing, cleaning, and analyzing the data from 2023 and buidling a predictive model for 2024 revenue based on booked nights.
Summary:
1. Data Distribution and Trends:
The average amount spent by guests and the average number of nights stayed were calculated, providing a baseline for understanding guest spending behavior. Most guests spent €400 or less, indicating a concentration of short-term, budget-conscious bookings.
2. Spending Patterns:
Guests were categorized into different spending groups, revealing how guest expenditure is distributed. This segmentation helps in understanding different customer profiles. The histogram of the amount spent showed a skewed distribution, suggesting that while most guests spend lower amounts, there is a significant number of higher-value bookings.
3. Guest-Specific Insights:
The analysis identified the guests who spent the most and the least, offering a glimpse into the range of customer spending on the platform.
4. Correlation Between Spending and Nights Stayed:
A moderate positive correlation (0.68) was found between the amount spent and the number of nights stayed, indicating that longer stays tend to correspond with higher expenditures.
5.Predictive Modeling for 2024:
A linear regression model predicted 2024 revenue based on the number of nights booked. The model's prediction indicated a range of expected revenue with a 95% confidence interval, providing a quantified forecast for 2024.
6. Scenario Analysis:
The scenario analysis with 1000 simulations offered a probabilistic view of potential revenue outcomes, enhancing the robustness of the forecast.
7. Evaluation of Model Performance:
The model's performance was assessed using the mean squared error, ensuring the reliability of the predictions.
Task:
- Import the data from airbnb2023.xlsx,
- Clean and analyise the data
- Make a predictive model for 2024. revenue and nights booked.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt# Read the Excel file and store it in the 'airbnb' dataframe
airbnb = pd.read_excel("Airbnb2023.xlsx", index_col=0)
# Display the contents of the 'airbnb' dataframe
display(airbnb)
# Drop the columns 'Unnamed: 9' and 'PDV' from the 'airbnb' dataframe
airbnb = airbnb.drop(["Unnamed: 9","PDV"], axis=1)
airbnb = airbnb.dropna()
# Display information about the airbnb dataframe
airbnb.info()
# Display the updated contents of the 'airbnb' dataframe
airbnb.head()# Calculate descriptive statistics for the "Amount" column in the airbnb dataframe
airbnb_amount = airbnb["Amount"].describe()
airbnb_amount# Calculate the descriptive statistics for the "Nights" column in the airbnb dataframe
airbnb_nights = airbnb["Nights"].describe()
airbnb_nights# Calculate the total amount in the airbnb dataframe
total_amount = airbnb["Amount"].sum()
# Calculate the total nights in the airbnb dataframe
total_nights = airbnb["Nights"].sum()
total_amount, total_nights# Divide guests into groups based on the amount they spent
airbnb['Amount Spent €'] = pd.cut(airbnb['Amount'], bins=[200,400,600,800,1000, np.inf], labels=['0-200', '200-400', '400-600', '600-800', '1000+'])
# Count the number of guests in each group
guests_per_group = pd.DataFrame(airbnb['Amount Spent €'].value_counts())
# Rename the index column to "Amount Spent" and the second column to "Guests"
guests_per_group.columns = ["Guests number"]
guests_per_group.index.name = "Amount Spent €"
# Display the number of guests in each group
print(guests_per_group)
# Create a bar plot to visualize the number of guests in each group
plt.figure(figsize=(10, 6))
sns.barplot(x=guests_per_group.index, y=guests_per_group["Guests number"])
plt.xlabel("Amount Spent €")
plt.ylabel("Number of Guests")
plt.title("Number of Guests in Each Amount Spent Group")
plt.show()# Plotting a histogram of the 'Amount' column
plt.figure(figsize=(10, 6))
plt.hist(airbnb['Amount'], bins=5, color='mediumseagreen')
plt.xlabel('Amount')
plt.ylabel('Frequency')
# Add a vertical line for the mean value of the 'Amount' column
plt.axvline(airbnb['Amount'].mean(), color='red', linestyle='dashed', linewidth=2)
# Add a vertical line for the median value of the 'Amount' column
plt.axvline(airbnb['Amount'].median(), color='blue', linestyle='dashed', linewidth=2)
# Add a legend for the mean and median lines
plt.legend(['Mean', 'Median'])
# Display the plot
plt.show()- Most guests spent 400 or less € during their stay
# Group the data by guest and calculate the sum of amount
guest_data = airbnb.groupby('Guest').agg({'Amount': 'sum', 'Nights': 'sum'})
# Convert the guest names to integers
guest_data.index = guest_data.index.astype(str)
# Create a figure and axis
fig, ax1 = plt.subplots(figsize=(20, 10))
# Plot the guest data as a bar chart with explicit x-axis ticks starting from 0
x_ticks = range(len(guest_data))
ax1.bar(x_ticks, guest_data['Amount'], color='mediumseagreen', tick_label=guest_data.index)
ax1.set_ylabel('Amount')
# Set the x-axis labels to be rotated
ax1.set_xticks(x_ticks)
ax1.set_xticklabels(guest_data.index, rotation=60, ha="right")
# Create a second y-axis for nights stayed
ax2 = ax1.twinx()
ax2.plot(x_ticks, guest_data['Nights'], marker='o', color='red')
ax2.set_ylabel('Nights')
# Set the title
plt.title('Amount Spent and Nights Stayed by Guest')
# Adjust the layout to prevent overlapping labels
plt.tight_layout()
plt.show()The guest that spent the most money:
# Find the guest who spent the most money
most_spent_guest = guest_data['Amount'].idxmax()
# Get the amount spent by the guest
most_spent_amount = guest_data.loc[most_spent_guest, 'Amount']
most_spent_guest, most_spent_amountThe guest that spent the least money: