Predicting industrial machine downtime (L3)

Predicting Industrial Machine Downtime: Level 3

📖 Background

You work for a manufacturer of high-precision metal components used in aerospace, automotives, and medical device applications. Your company operates three different machines on its shop floor that produce different sized components, so minimizing the downtime of these machines is vital for meeting production deadlines.

Your team wants to use a data-driven approach to predicting machine downtime, so proactive maintenance can be planned rather than being reactive to machine failure. To support this, your company has been collecting operational data for over a year and whether each machine was down at those times.

In this third level, you're going to develop a predictive model that could be combined with real-time operational data to detect likely machine failure. This level is aimed towards advanced learners. If you want to challenge yourself a bit less, check out the other levels!

💾 The data

The company has stored the machine operating data in a single table, available in 'data/machine_downtime.csv'.

Each row in the table represents the operational data for a single machine on a given day:

"Date" - the date the reading was taken on.
"Machine_ID" - the unique identifier of the machine being read.
"Assembly_Line_No" - the unique identifier of the assembly line the machine is located on.
"Hydraulic_Pressure(bar)", "Coolant_Pressure(bar)", and "Air_System_Pressure(bar)" - pressure measurements at different points in the machine.
"Coolant_Temperature", "Hydraulic_Oil_Temperature", and "Spindle_Bearing_Temperature" - temperature measurements (in Celsius) at different points in the machine.
"Spindle_Vibration", "Tool_Vibration", and "Spindle_Speed(RPM)" - vibration (measured in micrometers) and rotational speed measurements for the spindle and tool.
"Voltage(volts)" - the voltage supplied to the machine.
"Torque(Nm)" - the torque being generated by the machine.
"Cutting(KN)" - the cutting force of the tool.
"Downtime" - an indicator of whether the machine was down or not on the given day.

Competition Goals 💪

Train and evaluate a predictive model to predict machine failure.
Which dataset features are the strongest predictors of machine failure?
Are your predictions more accurate if you model each machine separately?

Import Data/Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# statsmodels
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

# sklearn
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC

import shap

from tqdm import tqdm
import time

data = pd.read_csv('data/machine_downtime.csv', parse_dates=['Date'], dayfirst=False)
data = data.drop('Machine_ID', axis = 1)

downtime_map = {
    'Machine_Failure': 1,
    'No_Machine_Failure': 0
}
data['Downtime'] = data['Downtime'].map(downtime_map)
data.head()

Review Dataset

data.info()

data = data.dropna()
data.shape

# a generally balanced dataset
data['Downtime'].value_counts(normalize = True)

df = data.copy()

# One-hot encode the Assembly_Line_No feature
df = pd.get_dummies(df, columns=['Assembly_Line_No'], prefix='Assembly_Line', dtype='int', drop_first=True)

# Drop the original Date column since we have extracted all relevant temporal features
df.drop(columns=['Date', 'Downtime'], inplace=True)

df = df.dropna()

df.shape

Dropping Features to Reduce Multicollinearity in Logistic Regression Model

df = data.copy()

# One-hot encode the Assembly_Line_No feature
df = pd.get_dummies(df, columns=['Assembly_Line_No'], prefix='Assembly_Line', dtype='int', drop_first=True)

# Drop the original Date column since we have extracted all relevant temporal features
df.drop(columns=['Date', 'Downtime', 'Air_System_Pressure(bar)', 'Hydraulic_Oil_Temperature', 'Spindle_Bearing_Temperature', 'Voltage(volts)', 'Spindle_Speed(RPM)', 'Coolant_Pressure(bar)', 'Torque(Nm)', 'Cutting(kN)', 'Tool_Vibration'], inplace=True)

print(df.shape)

# Calculate VIF for each feature
vif_data = pd.DataFrame()
vif_data['Feature'] = df.columns
vif_data['VIF'] = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]
vif_data = vif_data.sort_values('VIF', ascending = False)

# Display the VIF values
print(vif_data)

After removing all features with high VIF including: 'Air_System_Pressure(bar)', 'Hydraulic_Oil_Temperature', 'Spindle_Bearing_Temperature', 'Voltage(volts)', 'Spindle_Speed(RPM)', 'Coolant_Pressure(bar)', 'Torque(Nm)', 'Cutting(kN)', 'Tool_Vibration'; we have a dataset with only 5 features to train our logistic regression model.

‌
‌
‌