Predicting concrete strength with linear regression, my old friend

Can you predict the strength of concrete?

📖 Background

You work in the civil engineering department of a major university. You are part of a project testing the strength of concrete samples.

Concrete is the most widely used building material in the world. It is a mix of cement and water with gravel and sand. It can also include other materials like fly ash, blast furnace slag, and additives.

The compressive strength of concrete is a function of components and age, so your team is testing different combinations of ingredients at different time intervals.

The project leader asked you to find a simple way to estimate strength so that students can predict how a particular sample is expected to perform.

💾 The data

The team has already tested more than a thousand samples (source):

Compressive strength data:

"cement" - Portland cement in kg/m3
"slag" - Blast furnace slag in kg/m3
"fly_ash" - Fly ash in kg/m3
"water" - Water in liters/m3
"superplasticizer" - Superplasticizer additive in kg/m3
"coarse_aggregate" - Coarse aggregate (gravel) in kg/m3
"fine_aggregate" - Fine aggregate (sand) in kg/m3
"age" - Age of the sample in days
"strength" - Concrete compressive strength in megapascals (MPa)

Acknowledgments: I-Cheng Yeh, "Modeling of strength of high-performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).

💪 Challenge

Provide your project leader with a formula that estimates the compressive strength. Include:

The average strength of the concrete samples at 1, 7, 14, and 28 days of age.
The coefficients , ... , to use in the following formula:

Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from pyod.models.iforest import IForest
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats.mstats import winsorize
from statsmodels.stats.diagnostic import het_breuschpagan

from sklearn.model_selection import train_test_split
#from sklearn.linear_model import LinearRegression
#from sklearn.metrics import mean_squared_error, r2_score

sns.set_style('whitegrid')

df = pd.read_csv('data/concrete_data.csv')
df.head()

original_shape = df.shape[0]
df = df.drop_duplicates()
print(f"Dropped {original_shape - df.shape[0]} rows.")

EDA

df.describe()

plt.figure(figsize = (16, 6))
sns.boxplot(data = pd.melt(df), y='variable', x = 'value')
plt.tight_layout()

Estimate average strength of the concrete samples at 1, 7, 14, and 28 days of age.

indices = [1, 7, 14, 28]
sdf = df.groupby('age')['strength'].mean()
sdf = sdf[sdf.index.isin(indices)].reset_index()
sdf

plt.figure(figsize = (16, 4))

sns.barplot(data = sdf, x = 'age', y = 'strength')
plt.tight_layout()

Correlation

features = ['cement', 'slag', 'fly_ash', 'water', 'superplasticizer', 'coarse_aggregate', 'fine_aggregate', 'age']
target = 'strength'

plt.figure(figsize = (16, 6))
sns.heatmap(df.corr(), annot = True, linewidths=2)
plt.tight_layout()

‌
‌
‌