Skip to content
Buenos_Aires_Real_Estate: Predicting Price with Size, Location, and Neighborhood
Buenos_Aires_Real_Estate: Predicting Price with Size, Location, and Neighborhood
# Imported libraries
import glob
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
from ipywidgets import Dropdown, FloatSlider, IntSlider, interact
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import make_pipelineImport data
Task 2.1.1: Write a function named wrangle that takes a file path as an argument and returns a DataFrame.
def wrangle(filepath):
# Read CSV file
df = pd.read_csv(filepath)
# Subset data: Apartments in "Capital Federal", less than 400,000
mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
mask_apt = df["property_type"] == "apartment"
mask_price = df["price_aprox_usd"] < 400_000
df = df[mask_ba & mask_apt & mask_price]
# Subset data: Remove outliers for "surface_covered_in_m2"
low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
mask_area = df["surface_covered_in_m2"].between(low, high)
df = df[mask_area]
# Split "lat-lon" column
df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
df.drop(columns="lat-lon", inplace=True)
# Get place name
df["neighborhood"] = df["place_with_parent_names"].str.split("|", expand=True)[3]
df.drop(columns="place_with_parent_names", inplace=True)
# Drop features with high NAN
df.drop(columns=["floor", "expenses"], inplace=True)
# Drop the low and high cardility categorical variables
df.drop(columns=["operation", "property_type", "currency", "properati_url"], inplace=True)
# Drop leaky columns
df.drop(columns=['price',
'price_aprox_local_currency',
'price_per_m2',
'price_usd_per_m2'],
inplace = True)
# Drop columns with multicollinearlity
df.drop(columns=["surface_total_in_m2", "rooms"], inplace = True)
return dfUse glob to create a list that contains the filenames for all the Buenos Aires real estate CSV files in the directory. Assign this list to the variable name files
files = glob.glob("buenos-aires-real-estate-*.csv")
filesUse your wrangle function in a for loop to create a list named frames. The list should the cleaned DataFrames created from the CSV filenames your collected in files.
frames = []
for file in files:
df = wrangle(file)
#print(df.shape)
frames.append(df)Use pd.concat to concatenate the items in frames into a single DataFrame df. Make sure you set the ignore_index argument to True
df = pd.concat(frames, ignore_index=True)
print(df.info())
df.head()Explore
Calculate the number of unique values for each non-numeric feature in df
df.select_dtypes("object").nunique()Plot a correlation heatmap of the remaining numerical features in df. Since "price_aprox_usd" will be your target, you don't need to include it in your heatmap.