Skip to content

Buenos_Aires_Real_Estate: Predicting Price with Size, Location, and Neighborhood

# Imported libraries
import glob
import pandas as pd
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from ipywidgets import Dropdown, FloatSlider, IntSlider, interact
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression, Ridge 
from sklearn.metrics import mean_absolute_error
from sklearn.pipeline import make_pipeline

Import data

Task 2.1.1: Write a function named wrangle that takes a file path as an argument and returns a DataFrame.

def wrangle(filepath):
    # Read CSV file
    df = pd.read_csv(filepath)

    # Subset data: Apartments in "Capital Federal", less than 400,000
    mask_ba = df["place_with_parent_names"].str.contains("Capital Federal")
    mask_apt = df["property_type"] == "apartment"
    mask_price = df["price_aprox_usd"] < 400_000
    df = df[mask_ba & mask_apt & mask_price]

    # Subset data: Remove outliers for "surface_covered_in_m2"
    low, high = df["surface_covered_in_m2"].quantile([0.1, 0.9])
    mask_area = df["surface_covered_in_m2"].between(low, high)
    df = df[mask_area]

    # Split "lat-lon" column
    df[["lat", "lon"]] = df["lat-lon"].str.split(",", expand=True).astype(float)
    df.drop(columns="lat-lon", inplace=True)

    # Get place name
    df["neighborhood"] = df["place_with_parent_names"].str.split("|", expand=True)[3]
    df.drop(columns="place_with_parent_names", inplace=True)

    # Drop features with high NAN
    df.drop(columns=["floor", "expenses"], inplace=True)
    
    # Drop the low and high cardility categorical variables
    df.drop(columns=["operation", "property_type", "currency", "properati_url"], inplace=True)
    
    # Drop leaky columns
    df.drop(columns=['price',
                     'price_aprox_local_currency',
                     'price_per_m2',
                     'price_usd_per_m2'], 
            inplace = True)
    
    # Drop columns with multicollinearlity
    df.drop(columns=["surface_total_in_m2", "rooms"], inplace = True)
    return df

Use glob to create a list that contains the filenames for all the Buenos Aires real estate CSV files in the directory. Assign this list to the variable name files

files = glob.glob("buenos-aires-real-estate-*.csv")
files

Use your wrangle function in a for loop to create a list named frames. The list should the cleaned DataFrames created from the CSV filenames your collected in files.

frames = []

for file in files:
    df = wrangle(file)
    #print(df.shape)
    frames.append(df)

Use pd.concat to concatenate the items in frames into a single DataFrame df. Make sure you set the ignore_index argument to True

df = pd.concat(frames, ignore_index=True)
print(df.info())
df.head()

Explore

Calculate the number of unique values for each non-numeric feature in df

df.select_dtypes("object").nunique()

Plot a correlation heatmap of the remaining numerical features in df. Since "price_aprox_usd" will be your target, you don't need to include it in your heatmap.