Skip to content

DataCamp Certification Case Study

Project Brief

You have been hired by Inn the Neighborhood, an online platform that allows people to rent out their properties for short stays. Currently, the webpage for renters has a conversion rate of 2%. This means that most people leave the platform without signing up.

The product manager would like to increase this conversion rate. They are interested in developing an application to help people estimate the money they could earn renting out their living space. They hope that this would make people more likely to sign up.

The company has provided you with a dataset that includes details about each property rented, as well as the price charged per night. They want to avoid estimating prices that are more than 25 dollars off of the actual price, as this may discourage people.

You will need to present your findings in two formats:

  • You must submit a written report summarising your analysis to your manager. As a data science manager, your manager has a strong technical background and wants to understand what you have done and why.
  • You will then need to share your findings with the product manager in a 10 minute presentation. The product manager has no data science background but is familiar with basic data related terminology.

The data you will use for this analysis can be accessed here: "data/rentals.csv"

Initial Exploration

#Importing libraries and the data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

rentals_raw = pd.read_csv("data/rentals.csv")

rentals_raw.head()
#Feature exploration
print("Property Types:",rentals_raw["property_type"].unique())
print("Room Types:",rentals_raw["room_type"].unique())
print("Statistics:\n", rentals_raw[["bathrooms","bedrooms","minimum_nights"]].describe(percentiles=[.5]))

The id column appears to hold a unique identifier for each property. Latitude and longitude are continuous features that hold the coordinates of the property. Property type and room type are categorical features. Bathrooms and bedrooms are continuous features, but because of the limited number of possibilities, they could possibly be treated as categorical. These features contain nans. Because of the context, nans will be converted to zeros. Minimum nights is similar, however there appears to be outliers in the maximum amount.

#Outliers
coord_mask = rentals_raw.duplicated(subset = ["latitude", "longitude"], keep=False)
rentals_raw = rentals_raw[~coord_mask]
                          
rentals_raw = rentals_raw[rentals_raw["minimum_nights"]<=365]
rentals_raw.info()

There were four instances of duplicate coordinates, and three instances of minimum nights above one year. These are removed from the data as outliers.

#Formatting
rentals_raw.fillna(0,inplace=True)

rentals_raw["price"] = (rentals_raw["price"].str.strip("$")).str.replace(",","").astype(np.float)
rentals = rentals_raw.set_index("id")
print(rentals.shape)

The index is set to be id to facilitate row identification. NAs are dropped, and the price column is converted into floats. The rentals dataframe will be used from now on. There are 7 features, 1 target variable, and 8100 instances.

Exploratory Data Analysis

The goal is to create a model that is able to predict price given the 7 features. Price is continous, so some regression model will be used. Before modeling, we will look at the distribution and relationship between the features and target variable. Specifically at those whose price is below $1500, because there are a few outliers in terms of price.

#Continous Features
plt.rcParams["figure.figsize"] = (20,10)
sns.set_palette("colorblind")

fig, ax = plt.subplots(1,2)
sns.boxplot("bedrooms","price",data=rentals[rentals["price"]<1500], ax = ax[0])
ax[0].set(xlabel='Bedrooms', ylabel='Price', title="Prices based on Bedroom Number")

sns.boxplot("bathrooms","price",data=rentals[rentals["price"]<1500], ax = ax[1])
ax[1].set(xlabel='Bathrooms', ylabel='Price', title="Prices based on Bathroom Number")
plt.show()

Both the number of bedrooms and bathrooms seem to have some positive correlation with the price, albeit with some variance. A hypothesize for this is that more extravagant properties have more bedrooms and bathrooms, and more extravagant properties should be pricier. These types of linear relations can be captured well with a linear regression. It's worth noting that properties with one and two bedrooms and bathrooms appear to have more outliers than the rest. This is most likely a consequence of the distribution of houses with these characteristics.

#Categorical Features
plt.rcParams["figure.figsize"] = (20,10)
sns.set_palette("colorblind")

fig, ax = plt.subplots(1,2)
sns.boxplot(y="property_type",x="price",data=rentals[rentals["price"]<1500], ax = ax[0],
            order =rentals[rentals["price"]<1500].groupby("property_type").median().sort_values("price").index)
ax[0].set(xlabel='Prices', ylabel='Property Type', title="Prices based on Property Type")

sns.boxplot("room_type","price",data=rentals[rentals["price"]<1500], ax = ax[1],
           order = rentals[rentals["price"]<1500].groupby("room_type").median().sort_values("price").index)
ax[1].set(xlabel='Room Type', ylabel='Prices', title="Prices based on Room Type")
plt.show()

These categorical features inherently do not have an order. The property type and room type boxplots were ordered top to bottom and left to right respectively based on ascending price medians. We see a similar trend as above; a less extravagnt rv or hostel is cheaper than a resort or hotel. The large number of property types however might make these trends difficult to capture with a linear regression. A decision tree is an alternative model.