## Can you predict the strength of concrete?

### 📖 Background

You work in the civil engineering department of a major university. You are part of a project testing the strength of concrete samples.

Concrete is the most widely used building material in the world. It is a mix of cement and water with gravel and sand. It can also include other materials like fly ash, blast furnace slag, and additives.

The compressive strength of concrete is a function of components and age, so your team is testing different combinations of ingredients at different time intervals.

The project leader asked you to find a simple way to estimate strength so that students can predict how a particular sample is expected to perform.

### 💾 The data

The team has already tested more than a thousand samples (source):

##### Compressive strength data:

- "cement" - Portland cement in kg/m3
- "slag" - Blast furnace slag in kg/m3
- "fly_ash" - Fly ash in kg/m3
- "water" - Water in liters/m3
- "superplasticizer" - Superplasticizer additive in kg/m3
- "coarse_aggregate" - Coarse aggregate (gravel) in kg/m3
- "fine_aggregate" - Fine aggregate (sand) in kg/m3
- "age" - Age of the sample in days
- "strength" - Concrete compressive strength in megapascals (MPa)

* Acknowledgments: I-Cheng Yeh, "Modeling of strength of high-performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998)*.

### 💪 Challenge

Provide your project leader with a formula that estimates the compressive strength. Include:

- The average strength of the concrete samples at 1, 7, 14, and 28 days of age.
- The coefficients
, ... , to use in the following formula:

## Imports

```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from pyod.models.iforest import IForest
from statsmodels.stats.outliers_influence import variance_inflation_factor
from scipy.stats.mstats import winsorize
from statsmodels.stats.diagnostic import het_breuschpagan
from sklearn.model_selection import train_test_split
#from sklearn.linear_model import LinearRegression
#from sklearn.metrics import mean_squared_error, r2_score
sns.set_style('whitegrid')
df = pd.read_csv('data/concrete_data.csv')
df.head()
```

```
original_shape = df.shape[0]
df = df.drop_duplicates()
print(f"Dropped {original_shape - df.shape[0]} rows.")
```

## EDA

`df.describe()`

```
plt.figure(figsize = (16, 6))
sns.boxplot(data = pd.melt(df), y='variable', x = 'value')
plt.tight_layout()
```

### Estimate average strength of the concrete samples at 1, 7, 14, and 28 days of age.

```
indices = [1, 7, 14, 28]
sdf = df.groupby('age')['strength'].mean()
sdf = sdf[sdf.index.isin(indices)].reset_index()
sdf
```

```
plt.figure(figsize = (16, 4))
sns.barplot(data = sdf, x = 'age', y = 'strength')
plt.tight_layout()
```

### Correlation

```
features = ['cement', 'slag', 'fly_ash', 'water', 'superplasticizer', 'coarse_aggregate', 'fine_aggregate', 'age']
target = 'strength'
```

```
plt.figure(figsize = (16, 6))
sns.heatmap(df.corr(), annot = True, linewidths=2)
plt.tight_layout()
```