Competition - predicting concrete strength

Can you predict the strength of concrete?

📖 Background

You work in the civil engineering department of a major university. You are part of a project testing the strength of concrete samples.

Concrete is the most widely used building material in the world. It is a mix of cement and water with gravel and sand. It can also include other materials like fly ash, blast furnace slag, and additives.

The compressive strength of concrete is a function of components and age, so your team is testing different combinations of ingredients at different time intervals.

The project leader asked you to find a simple way to estimate strength so that students can predict how a particular sample is expected to perform.

💾 The data

The team has already tested more than a thousand samples (source):

Compressive strength data:

"cement" - Portland cement in kg/m3
"slag" - Blast furnace slag in kg/m3
"fly_ash" - Fly ash in kg/m3
"water" - Water in liters/m3
"superplasticizer" - Superplasticizer additive in kg/m3
"coarse_aggregate" - Coarse aggregate (gravel) in kg/m3
"fine_aggregate" - Fine aggregate (sand) in kg/m3
"age" - Age of the sample in days
"strength" - Concrete compressive strength in megapascals (MPa)

Acknowledgments: I-Cheng Yeh, "Modeling of strength of high-performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).

import pandas as pd
df = pd.read_csv('data/concrete_data.csv')
df.head()

💪 Challenge

Provide your project leader with a formula that estimates the compressive strength. Include:

The average strength of the concrete samples at 1, 7, 14, and 28 days of age.
The coefficients , ... , to use in the following formula:

🧑‍⚖️ Judging criteria

This is a community-based competition. The top 5 most upvoted entries will win.

The winners will receive DataCamp merchandise.

✅ Checklist before publishing

Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
Remove redundant cells like the judging criteria, so the workbook is focused on your work.
Check that all the cells run without error.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import scipy.stats as stats
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
from mpl_toolkits.mplot3d import Axes3D
df = pd.read_csv('data/concrete_data.csv')
df.index
df

df.agg(['max', 'min',"mean"])

X=df[["cement","slag","fly_ash","water","superplasticizer","coarse_aggregate","fine_aggregate","age" ]]
Y=df["strength"]

#Kitchen sink model with constant
X2= sm.add_constant(X)
model = sm.OLS(Y, X2)
model_res =model.fit()
model_res.summary()


#Kitchen sink model without constant

model2 = sm.OLS(Y, X)
model2_res =model2.fit()
model2_res.summary()

#Checking for colinearity
sns.pairplot(X)

round(df.corr(),2)

#Residuals

sns.boxplot(model_res.resid, showmeans=True)
plt.show()

‌
‌
‌