Skip to content
Predicting concrete strength with linear regression, my old friend
  • AI Chat
  • Code
  • Report
  • Can you predict the strength of concrete?

    📖 Background

    You work in the civil engineering department of a major university. You are part of a project testing the strength of concrete samples.

    Concrete is the most widely used building material in the world. It is a mix of cement and water with gravel and sand. It can also include other materials like fly ash, blast furnace slag, and additives.

    The compressive strength of concrete is a function of components and age, so your team is testing different combinations of ingredients at different time intervals.

    The project leader asked you to find a simple way to estimate strength so that students can predict how a particular sample is expected to perform.

    💾 The data

    The team has already tested more than a thousand samples (source):

    Compressive strength data:
    • "cement" - Portland cement in kg/m3
    • "slag" - Blast furnace slag in kg/m3
    • "fly_ash" - Fly ash in kg/m3
    • "water" - Water in liters/m3
    • "superplasticizer" - Superplasticizer additive in kg/m3
    • "coarse_aggregate" - Coarse aggregate (gravel) in kg/m3
    • "fine_aggregate" - Fine aggregate (sand) in kg/m3
    • "age" - Age of the sample in days
    • "strength" - Concrete compressive strength in megapascals (MPa)

    Acknowledgments: I-Cheng Yeh, "Modeling of strength of high-performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).

    💪 Challenge

    Provide your project leader with a formula that estimates the compressive strength. Include:

    1. The average strength of the concrete samples at 1, 7, 14, and 28 days of age.
    2. The coefficients , ... , to use in the following formula:

    Imports

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    import statsmodels.api as sm
    from pyod.models.iforest import IForest
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    from scipy.stats.mstats import winsorize
    from statsmodels.stats.diagnostic import het_breuschpagan
    
    from sklearn.model_selection import train_test_split
    #from sklearn.linear_model import LinearRegression
    #from sklearn.metrics import mean_squared_error, r2_score
    
    sns.set_style('whitegrid')
    
    df = pd.read_csv('data/concrete_data.csv')
    df.head()
    original_shape = df.shape[0]
    df = df.drop_duplicates()
    print(f"Dropped {original_shape - df.shape[0]} rows.")

    EDA

    df.describe()
    plt.figure(figsize = (16, 6))
    sns.boxplot(data = pd.melt(df), y='variable', x = 'value')
    plt.tight_layout()

    Estimate average strength of the concrete samples at 1, 7, 14, and 28 days of age.

    indices = [1, 7, 14, 28]
    sdf = df.groupby('age')['strength'].mean()
    sdf = sdf[sdf.index.isin(indices)].reset_index()
    sdf
    plt.figure(figsize = (16, 4))
    
    sns.barplot(data = sdf, x = 'age', y = 'strength')
    plt.tight_layout()

    Correlation

    features = ['cement', 'slag', 'fly_ash', 'water', 'superplasticizer', 'coarse_aggregate', 'fine_aggregate', 'age']
    target = 'strength'
    plt.figure(figsize = (16, 6))
    sns.heatmap(df.corr(), annot = True, linewidths=2)
    plt.tight_layout()