Welcome! You are now in DataLab.
You've been testing out code in the Python Sandbox and were inspired to save it forever. This DataLab workbook contains all your Sandbox code, so you can continue to work on it using enhanced features, share it with friends and colleagues, or even turn it into a portfolio project! If you would like a quick overview of DataLab, please refer to the Help menu to view documents or take a guided video tour.
House Price Prediction Using Linear Regression and Random Forest
This project built a machine learning model to predict the house prices using features such as square meters and number of bedrooms. Generated a synthetic dataset to simulate real-world housing data. Implemented and compared two regression models—Linear Regression and Random Forest—and evaluated their performance using Root Mean Squared Error(RMSE).
This project predicts house prices using features like square meters and number of bedrooms. A synthetic dataset is created to simulate real-world housing data. Implemented and compared two regression models—Linear Regression and Random Forest and evaluated them using Root Mean Squared Error (RMSE) to measure performance.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# Create synthetic dataset
np.random.seed(0)
n = 200
square_meters = np.random.normal(100, 15, n)
num_bedrooms = np.random.randint(1, 5, n)
price = 3000 * square_meters + 10000 * num_bedrooms + np.random.normal(0, 20000, n)
# Put data in a DataFrame
df = pd.DataFrame({
'square_meters': square_meters,
'num_bedrooms': num_bedrooms,
'price': price
})
# Train-test split
X = df[['square_meters', 'num_bedrooms']]
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE: ${rmse:,.0f}")
print("Coefficients:", model.coef_)
print("Intercept:", model.intercept_)
# Assuming y_test and y_pred are defined
colors = np.where(y_pred > y_test, 'red', 'blue') # red if predicted > actual, blue otherwise
plt.scatter(y_test, y_pred, c=colors)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)])
plt.xlabel("Actual Price")
plt.ylabel("Predicted Price")
plt.title("Actual vs Predicted")
plt.legend()
plt.show()
#Random Forest mmodel
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_rmse = np.sqrt(mean_squared_error(y_test, rf_pred))
print("Random Forest RMSE:", rf_rmse)
#Comparing both models
print("Linear Regression RMSE:", rmse)
print("Random Forest RMSE:", rf_rmse)
The Linear Regression model achieved an RMSE of 20,450, slightly better than the Random Forest model at 20,734. While both models show similar performance on this dataset, further improvements could be achieved through feature engineering, hyperparameter tuning, or using a larger dataset to enhance predictive accuracy.