Skip to content
0

Can we predict machine failures before they happen?

Using machine learning to predict mechanical failure

Image from H.O.Penn

Creator Notes:

  • Use light mode view for optimal reading.
  • Plots are interactive, allowing you to explore the data in detail.

Executive Summary

  • LightGBM with RobustScaler achieved high accuracy (98.9%) in predicting machine failures.
  • Mechanical stress (torque, cutting force, pressure) are the most important factors for predicting failures.
  • A single model might be effective for all machines due to consistent failure patterns across them.
  • Investigate systematic issues causing failures (e.g., maintenance practices, environment).
  • Continuously monitor and evaluate the deployed LightGBM model's performance.

I. Background

Manufacturing downtime costs our aerospace and medical components facility both time and money. With three specialized machines producing different sized components, any unexpected failure disrupts production schedules and impacts delivery deadlines. Our current reactive maintenance approach means we fix problems after they occur — leading to longer downtimes and missed deadlines.

II. Objectives

This project aims to predict machine failures before they happen, enabling our maintenance team to plan repairs proactively. We will develop a predictive model using a year's worth of operational data to identify early warning signs of potential failures. Our key goals are identifying the most reliable failure indicators and determining whether machine-specific models perform better than a general approach.

III. The data

Our dataset contains daily operational measurements from three production machines spanning one year. Each record includes 13 sensor measurements — from hydraulic pressure to spindle vibration — along with machine identifiers and downtime status. The data comes from critical systems including cooling, hydraulics, and cutting mechanisms. All measurements follow standardized units and are recorded at consistent daily intervals.

Column NameDescriptionUnitSignificance
DateDaily timestamp of readingsYYYY-MM-DDTracks temporal patterns and maintenance history
Machine_IDUnique machine identifierTextEnables machine-specific analysis and comparisons
Assembly_Line_NoProduction line locationIntegerMaps physical layout and workflow dependencies
Hydraulic_PressureHydraulic system pressurebarIndicates fluid power system health
Coolant_PressureCooling system pressurebarMonitors heat dissipation efficiency
Air_System_PressurePneumatic system pressurebarReflects compressed air system status
Coolant_TemperatureCooling system temperatureCelsiusTracks thermal management effectiveness
Hydraulic_Oil_TemperatureHydraulic fluid temperatureCelsiusIndicates system stress and oil condition
Spindle_Bearing_TemperatureBearing temperatureCelsiusMonitors critical component health
Spindle_VibrationSpindle oscillationmicrometersDetects mechanical imbalances
Tool_VibrationCutting tool movementmicrometersIndicates tool wear and stability
Spindle_SpeedRotational velocityRPMMeasures cutting performance
VoltageElectrical inputvoltsMonitors power supply stability
TorqueRotational forceNmIndicates mechanical load
CuttingTool forceKNMeasures material removal effort
DowntimeOperational statusBooleanRecords machine availability

The company has stored the machine operating data in a single table, available in 'data/machine_downtime.csv'.

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

# Scikit-learn imports
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer

# Metrics and evaluation
from sklearn.metrics import (
   accuracy_score, precision_score, recall_score, f1_score,
   roc_curve, precision_recall_curve, auc, roc_auc_score,
   confusion_matrix, classification_report
)

from sklearn.model_selection import cross_val_score

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier 
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Configuration settings
import warnings
warnings.filterwarnings('ignore')
# Load the dataset and display the first 10 rows
data = pd.read_csv('data/machine_downtime.csv')
display(data.head(10))
# Display information about the dataset including the data types and non-null counts for each column
data.info()

IV. EDA Summary

For a comprehensive understanding of the data analysis, readers are encouraged to review my previous report, which provides detailed exploratory data analysis. Here are the key findings from How do machines behave before downtime? that inform our current modeling approach:

Correlation Analysis:

  • Sensor measurements show independence (correlation coefficients < 0.25)
  • All variables retained for modeling due to their independent predictive potential
  • Temporal features include day_of_week and is_weekday indicators

Temporal Patterns:

  • Peak failure incidents observed in March-April 2022
  • Weekday failures occur 3x more frequently than weekend failures
  • Strong correlation between production schedules and machine reliability

High-Predictive Variables:

  • Hydraulic pressure (bimodal distribution)
  • Tool vibration measurements
  • Spindle speed readings
  • Cutting force metrics
  • Torque values
  • Coolant pressure levels

Preprocessing Strategy:

  1. Missing Value Treatment:

    • Mean imputation for normal distributions
    • KNN imputation for bimodal distributions
  2. Outlier Handling:

    • Retain outliers as they represent valid operational states
    • Evaluate multiple scaling methods
    • RobustScaler anticipated as optimal choice due to outlier presence
  3. Class Balance:

    • No class imbalance treatment needed for 'downtime' variable
  4. Feature Engineering:

    • Initial approach without feature engineering due to low multicollinearity
    • Will reassess based on baseline model performance

Modeling Approach:

  1. Baseline: Logistic Regression
  2. Advanced Models:
    • Gradient Boosting (primary candidate for non-linear relationships)
    • Random Forest (alternative for handling outliers and interactions)

This structured approach will guide our model development and evaluation process.

V. Data Preparation

Before preprocessing, the dataset was prepared by standardizing column names to snake_case for consistency, expanding date features to extract relevant temporal information (e.g., day of week) for potential model improvement, and mapping the target variable.

# Replace parentheses with underscores and convert to lowercase
data.columns = [col.replace('(', '_').replace(')', '').lower() for col in data.columns]

# Convert to datetime to use .dt accessor
data['date'] = pd.to_datetime(data['date'] )

# Extract day of the week
data['day_of_week'] = data['date'].dt.strftime('%A')

# Create boolean column to check if is weekday
data['is_weekday'] = data['date'].dt.weekday < 5

# Define ordered categories for days
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Convert day to ordered categorical variable
data['day_of_week'] = pd.Categorical(data['day_of_week'], categories=day_order, ordered=True)

# Convert the target variable to binary
data['downtime'] = data['downtime'].map({
    'Machine_Failure': 1,
    'No_Machine_Failure': 0})
# Check encoding
print("Target value counts:")
print(data['downtime'].value_counts())
print("\nTarget unique values:")
print(data['downtime'].unique())

As observed, our target class distribution is balanced.

2. Train-Test Split

With only 2,500 records, we opt for a simple train-test split without validation. We allocate 80% (2,000 samples) for training and 20% (500 samples) for testing. A validation set is unnecessary for datasets under 10,000 records as it would reduce our training data significantly.

TARGET = 'downtime'
TEST_SIZE = 0.2
RANDOM_STATE = 1

# Separate features (X) and target (y)
X = data.drop(['date', TARGET], axis=1) # Drop date and target
y = data[TARGET] 

# Print dataset overview
print("Dataset Overview:")
print(f"Total samples: {X.shape[0]}")
print(f"Features: {X.shape[1]}")
print("\nTarget distribution:")
print(y.value_counts(normalize=True))

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=y
)

print("\nClass Distribution:")
print(f"  Training set:\n{pd.Series(y_train).value_counts(normalize=True).to_string()}") #to_string for better formatting
print(f"  Test set:\n{pd.Series(y_test).value_counts(normalize=True).to_string()}") #to_string for better formatting