Buenos-Aires_Real_Estate: Predicting Price with Size
# Imported libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.utils.validation import check_is_fittedImport data
df = pd.read_csv("buenos-aires-real-estate-1.csv")
df.head()Task 2.1.1: Write a function named wrangle that takes a file path as an argument and returns a DataFrame.
For this project, we want to build a model for apartments in Buenos Aires proper ("Capital Federal") that cost less than $400,000. Looking at the first five rows of our DataFrame, we can already see that there properties that fall outside those parameters. So our first cleaning task is to remove those observations from our dataset. Since we're using a function to import and clean our data, we'll need to make changes there.
def wrangle(filepath):
# Read CSV file into DataFrame
df = pd.read_csv(filepath)
# Subset to propertise in "Capital Federal".
mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
# Subet to "apartments"
mask_apt = df["property_type"] == "apartment"
# Subset to propertise where "price_aprox_usd" < 400,000
mask_price = df["price_aprox_usd"] < 400_000
#Subset
df= df[mask_ba & mask_apt & mask_price]
# Remove outliers by "surface_covered_in_m2"
low, high= df["surface_covered_in_m2"].quantile([0.1, 0.9])
mask_area = df["surface_covered_in_m2"].between(low, high)
df = df[mask_area]
return dfTask 2.1.2: Use your wrangle function to create a DataFrame df from the CSV file data/buenos-aires-real-estate-1.csv.
df = wrangle("buenos-aires-real-estate-1.csv")
print("df shape:", df.shape)
df.head()Task 2.1.4: Create a histogram of "surface_covered_in_m2". Make sure that the x-axis has the label "Area [sq meters]" and the plot has the title "Distribution of Apartment Sizes".
plt.hist(df["surface_covered_in_m2"])
plt.xlabel("Area [sq meters]")
plt.title("Distribution of Apartment Sizes")
plt.show();Yikes! When you see a histogram like the one above, it suggests that there are outliers in your dataset. This can affect model performance — especially in the sorts of linear models we'll learn about in this project. To confirm, let's look at the summary statistics for the "surface_covered_in_m2" feature.
Task 2.1.5: Calculate the summary statistics for df using the describe method.
# Summary Statistics
df["surface_covered_in_m2"].describe()Task 2.1.7: Create a scatter plot that shows price ("price_aprox_usd") vs area ("surface_covered_in_m2") in our dataset. Make sure to label your x-axis "Area [sq meters]" and your y-axis "Price [USD]".
plt.scatter(x = df["surface_covered_in_m2"], y= df["price_aprox_usd"])
plt.xlabel("Area [sq meters]")
plt.ylabel("Price [USD]")
plt.title("Buenos Aires: Price vs Area")
plt.show();Split
A key part in any model-building project is separating your target (the thing you want to predict) from your features (the information your model will use to make its predictions). Since this is our first model, we'll use just one feature: apartment size.
Task 2.1.8: Create the feature matrix named X_train, which you'll use to train your model. It should contain one feature only: ["surface_covered_in_m2"]. Remember that your feature matrix should always be two-dimensional.