Skip to content
Concrete Strength Predictor
  • AI Chat
  • Code
  • Report
  • What is Concrete?

    Concrete is the most widely used building material in the world. It is a mix of cement and water with gravel and sand. It can also include other materials like fly ash, blast furnace slag, and additives.Concrete has been used since the time of the ancient Romans and as gone through several modifications through the decade.Such modifications come about from statisitcal analysis on the mix ratio and resulting concrete strength. In the notebook, we are going to anaylize the experimental results of thousands of samples of concrete with the aim of developing a model that can predict the strength of concrete by inputing the obtained coeficients.

    Lets start by taking a look at our dataframe

    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import numpy as np
    conc = pd.read_csv('data/concrete_data.csv')
    display(conc)

    The dataframe reveals the mixture of cement, slah, fly ash,water,superplasticizer,coarse_aggregate,fine aggregate in different proportions and tested at differnt days recorded in the age colunm to obtain the strength recorded in the strength colunm.

    Checking for Data frame issues?

    let us check for data frame issues like missing values or duplicated values and perform a data cleaning exercise if necessary

    conc.info()
    conc.describe()
    conc[conc.duplicated()]
    conc.drop_duplicates(inplace=True)
    conc.shape

    The dataframe had some duplicated rows which have been removed and have now reduced the unqiue row numbers from 1030 to 1005.

    What gives Concrete its strength?

    There is a nunmber of different combination of variables that leads to the strength of concrete. There is an old saying "Age like fine wine" which translates to the older you get, the better you become, lets visualize the concrete strength as agaisnt the age to put the theory to test.

    conc_age_group=conc.groupby('age')['strength'].mean().round(2).reset_index()
    display(conc_age_group)
    figure, ax= plt.subplots()
    sns.regplot(data=conc_age_group,x='age',y='strength',order=2,ci=0)
    plt.show()

    The strength of the concrete increased averagely from day 1 to day 365 but the strength gain flattend from day 56 and in some samples reduced.The reduction in the strength migth be due to other factors which we would find out later.

    Age Distribution of The concrete samples

    The concrete samples were tested at different ages ranging from 1-365 days, lets visualizes the age distribution of the concrete samples.

    count=(conc.groupby('age')['strength']\
           .agg(['mean','count'])).round(2).reset_index()\
    .rename(columns={'mean':'Average Strength'})
    display(count.sort_values(by='count',ascending=False))
    figure, ax = plt.subplots()
    sns.countplot(x='age',data=conc);
    
    plt.title(label="Distrubution of concrete age groups")   
    plt.xlabel("Age (Days)")
    plt.ylabel("Count of samples")
    plt.show()

    From the table and graph above, we can see most of the samples are between the 1-100 days age group, with the strength increasing as the concrete gets older.

    Base Regression Model

    Let us define our target(Y) variables and features(X) and train our first regression model.

    
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    X=conc.iloc[:,0:-1]
    y=pd.DataFrame(conc.iloc[:,-1])
    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42,stratify=X.age)
    reg=LinearRegression()
    reg.fit(X_train,y_train)
    y_pred=reg.predict(X_test)
    print("The accuracy of the model(r^2) is : ",reg.score(X_test,y_test).round(2))
    coef=pd.DataFrame({'Materials':list(X.columns),'Coef':reg.coef_.flatten()})
    display(coef.style.background_gradient(cmap="PRGn"))
    fig,ax=plt.subplots()
    # plt.scatter(y_test,y_pred,alpha=0.7, edgecolors="k")
    sns.barplot(data=coef,x='Materials',y='Coef')
    plt.xticks(rotation=45)
    plt.show()

    The model has an r squared value of 0.62 meaning our model can onlt explain about 62% of the variability in the dataset. The superplasticizer has the highest weight of the features and water having the lowest weight of the feautrues.