Seminar Data Science for Economics Final Assignment Combined with Kaggle Competition: House Prices - Advanced Regression Techniques
A Comprehensive Introductory Study of Machine Learning Methods and Applications
Weiyuan Liu, MSc Economics (2022-2023)
Introduction
This report thoroughly investigates a dataset sourced from one of the most renowned online machine learning competitions, Kaggle, and explore machine learning techniques, including unsupervised learning, supervised learning, and deep learning, with the overarching goal of achieving precise predictions. The competition, titled "House Prices - Advanced Regression Techniques," (https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques) serves as the foundation for this study. The competition's singular objective is to increase model accuracy - minimizing the Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the target. This objective aligns with the definition of the focus of the applied predictive modeling - to optimize prediction accuracy (Kuhn & Johnson, p. 1, 2013). The majority of the models learnt from our course are categorized as this type, except for the Bayesian Models, which extend toward causal analysis. Therefore I decide to exclude the Bayesian Models from this study. This report is structured into three primary sections.
First, the report unveils a comprehensive sequence of steps encompassing exploratory data analysis (EDA), rigorous data cleaning procedures, and meticulous data preparation for model implementation. The EDA primarily focuses on data visualization, presenting the distribution of the target variable, panels of the distributions of the numerical variables, and a correlation diagram. The data cleaning procedures address several critical aspects, inlcuding ensuring the uniformity of variables, the identification of missing values, imputing missing values and filling missing values. Data preparation involves two key transformations: the normalization of the numerical variables and one hot encoding of categorical variables. Both supervised and unsupervised learning methodologies have already been implemented at this stage. Filling missing numeric values incorporates a K-Nearest Neighbors (KNN) model, and the illustration of interactions among numeric and categorical variables utilizes t-Distributed Stochastic Neighbor Embedding (t-SNE) to generate intuitive plots.
Second, the report presents the construction of an array of supervized learning models, beginning with five basic learners, K-Nearest Neighbors (KNN), Ridge, Lasso, Support Vector Machine (SVM) and Decision Tree, followed by two widely-acknowledged and powerful ensemble learning techniques, Random Forest (RF) and Extreme Gradient Boosting (XGBoost). Upon completing the initial assessment of model performance, Ridge, RF and XGBoost are selected for hyperparameter tunning to improve their predictive capabilities. The section concludes with evaluation and comparison of the three selected models, and a detailed examination of feature importances from extracted from XGBoost.
Third, the report introduces a deep learning model, utilizing Neural Networks to enhance predictive accuracy. Concluding this report, the predictions generated by the chosen Ridge model, Random Forest model, XGBoost model, and Neural Networks model are subjected to competition. The ultimate scores, alongside the pinnacle position attained on the competition leaderboard, are presented at the end.
Choice of the Dataset
Nevertheless, the assignment suggests us to create a dataset from an economic organization. This report deviates from economic data. I choose this dataset in consideration that the dataset is clearly task oriented, and variables are generated from the same underlying data generate process. Usually, a macroeconomic model is required when we need to know which variables to collect. Randomly gathering data from websites resembles data dredging, and simply increases the risk of false positives.
Economic data are mostly time series. I avoid time series because they are very difficult. First, the train-test-split is different for time series. You can't use more recent data to predict the past data, therefore in evaluation procedures such as cross validation, a standard way is that whenever a fold is finished, data in the subsequent period are added to make the next fold, which sophisticates complexity. Second, some important economic datasets are small. For example, some countries publish GDP growth rates, or, unemployment rates, on monthly or quarterly basis. Even with a time span of twenty years, there are less than a hundred records, let alone general economic conditions progress so swiftly that data from five or ten years ago have much less implications for the present and future (for example, quatitative easing era displays much different trends than pre financial crisis era). In general, predictive modelling favors more data.
Moreover, time series exhibit correlations, but not necessarily causations, and it usually requires considerably more time to investigate if desired. Quoting an example from The effect: An introduction to research design and causality (Huntington-Klein, 2021, chapter 6.1), there are more people wearing shorts when people eat more ice cream, but the first phenomenon does not tend to cause the second phenomenon. This complexity would exert much difficulty on the analysis that might raise questions on the preditive power of features. For example, the Phillips curve suggests a relationship between unemployment rates and inflation rates, However, the correlations and causal relationships underlying this proposition have remained subjects of persistent debate. Despite theoretical considerations, each point I mention above really necessitates substantial amount of time applying statistician, econometrician and programming techniques to check and validate (they are economists' job, anyways). To circumvent those issues while showcasing my programming skills and understanding of machine learning and deep learning, my selection of dataset is a very good choice. Other good choices can be datasets from UC Ivrine, MNIST, etc.
# Interacting with API is data scientists' way of getting data. However, to download data, a token (json file) with account information is required. I've already expired the token. Datasets are provided when submitting the assignment.
# !pip install kaggle
# !mv kaggle.json ~/.kaggle/
# !chmod 600 /home/repl/.kaggle/kaggle.json
# !~/.local/bin/kaggle competitions download -c house-prices-advanced-regression-techniques
# !unzip house-prices-advanced-regression-techniques.zip
# Searching avaliable files in the directory and display file names
import os
for dirname, _, filenames in os.walk('.'):
for filename in filenames:
print(os.path.join(dirname, filename))
# This report uses scikit learn and xgboost for preprocessing data, train-test-split, and building machine learning models, Keras for deep learning, seaborn and plotly for visualization
# !pip install plotly
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import Ridge, Lasso
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential
import tensorflow as tf
import plotly.express as px
# Macros for plot
%matplotlib inline
sns.set_theme(style = 'darkgrid',palette='bright')
Data
There are two datasets. The training dataset have features and a target, and the test dataset only have features. The competition requires to build models on train dataset, and make predictions on test dataset, then submit them. Predictions are compared with unseen data on the server, and that's how scores are calculated.
Features are house characteristics such as area in squared feet, house type, house conditions, etc. The target variable is sale price. Models should utilize house characteristics to produce accurate predictions on sale price. The datasets consist of two types of data, numeric data and string data. Numeric data include continuous values such as area, and discrete values such as ratings and year. String data are basically categorical variables.
The training dataset has shape 1460 by 80, and the test dataset has shape 1459 by 79 (because it doesn't have the target variable). This is a workable size on the university's 16GB RAM and 8vCPU cloud computing plantform. However, some programs are still very slow, such as simulations. Datasets with larger sizes can be worked with Pyspark, however, that is out of the scope of this report and the assignment.
train = pd.read_csv('./train.csv', index_col='Id')
test = pd.read_csv('./test.csv',index_col='Id')
print(train.info())
print(test.info())
Exploratory Data Analysis and Data Preparation
As Andrew Ng clarifies, "Coming up with features is difficult, time-consuming, and requires expert knowledge. ‘Applied machine learning’ is basically feature engineering". Preveiling views on the internet suggest data scientist spend 80% of the time on data preparation. Undeniably, I would confess that is the case in this report. I roughly estimate I spent more than 50% of the time on data visualization (part of EDA) and cleaning and shaping data. The latter is also called preprocessing or feature engineering. This process is indispensable since it addresses the need of alleviating issues such as curse of dimensionality. Furthermore, it turns out this is the most crucial part for improving model performance. Readers may find there are just a few code blocks, but they are cleaned, tidied and concentrated for presentation of efficient Python codes. The final results based on the current scheme of feature engineering are better than the early generations. However, these's still much space to improve. I saw a group that improved the feature extraction techniques from another group have a better score than my best score even from their first generation that is just a simple Random Forest with default parameter settings. However, it should be clearly understood that feature engineering entails a lot of effort and task specific knowledge and characterizes diminishing returns. Given the scope of this assignment, this criterion is fullfilled more than satisfactory.
In my view, data visulization is very important for data scientists. I also spent a lot of time creating illustrative and good looking graphs.
# All columns until the last one are features (X variables). The last column is the target (y variable).
feature = train.iloc[:,:-1]
target = train.iloc[:,-1]
target = np.log(target)
Combined_data = pd.concat([feature,test],axis = 0) # I combine two datasets so I can preprocess them simutaneously
# Visulization of the target variable, a histogram and a kernel density estimate
fig,ax = plt.subplots(figsize = (4,4))
sns.histplot(x=target,bins = 20,ax= ax,kde=True).set(ylabel = None, xlabel = 'Log SalePrice')
ax.lines[0].set_color('navy')
plt.show()
# Using pandas methods to select numeric variables, string variables and missing values
numeric = Combined_data.select_dtypes('number').columns
print('Number of numeric columns is {}'.format(numeric.shape[0]),'\n')
missing_num = Combined_data[numeric].isna().sum()
print('There are {} columns with missing values. Columns with missing values and number of missing values are:\n{}'.format(len(missing_num[missing_num > 0]),missing_num[missing_num > 0]),'\n')
missing_num_names = missing_num[missing_num > 0].keys()
strings = Combined_data.select_dtypes('object').columns
print('Number of string columns is {}'.format(strings.shape[0]),'\n')
missing_str = Combined_data[strings].isna().sum()
print('There are {} columns with missing values. Columns with missing values and number of missing values are:\n{}'.format(len(missing_str[missing_str > 0]),missing_str[missing_str > 0]))
# Here I define a function to plot the panel (4 by 9 histograms) below, because I need to plot this panel twice
def EDA_4by9_graph(features):
fig, axes = plt.subplots(4,9,figsize=(24,8))
axes = axes.flatten() # To create a mapping from multidimensional axes objects to one dimensional. If I explain this, I also have to explain to fig and axes objects above (which is different from the simple plotting procedure taught in the lecture), then I also have to explain object oriented programming in Python. I assume readers (graders) understand it
for i, value in enumerate(numeric):
ax = sns.histplot(x=features[value], ax=axes[i], bins=20, kde=True)
ax.set_ylabel(None)
ax.lines[0].set_color('navy')
fig.text(0.0,0.5,'Count',rotation='vertical',horizontalalignment='center',verticalalignment='center',fontsize = 15)
return fig, axes
# EDA
fig, axes = EDA_4by9_graph(Combined_data)
plt.tight_layout()
# This is a heatmap showing correlations among numeric variables. This is a fundamental step in preprocessing, as it prepares for correlation analysis such as principal component analysis (PCA). Decorrelation is also an important part in feature engineering. I don't implement it here, because it is also time consuming.
corr = Combined_data[numeric].corr()
mask = np.triu(corr)
fig2,ax2 = plt.subplots(figsize=(24, 8))
heat = sns.heatmap(corr,annot=False,cmap = 'Blues',mask=mask, xticklabels=True, yticklabels=True,ax = ax2).set(xlabel = 'Numeric features', ylabel = 'Numeric features', title = 'Correlation Between Numeric Features')
plt.tight_layout()
plt.show()