Can we predict machine failures before they happen?

Using machine learning to predict mechanical failure

Creator Notes:

Use light mode view for optimal reading.
Plots are interactive, allowing you to explore the data in detail.

Executive Summary

LightGBM with RobustScaler achieved high accuracy (98.9%) in predicting machine failures.
Mechanical stress (torque, cutting force, pressure) are the most important factors for predicting failures.
A single model might be effective for all machines due to consistent failure patterns across them.
Investigate systematic issues causing failures (e.g., maintenance practices, environment).
Continuously monitor and evaluate the deployed LightGBM model's performance.

I. Background

Manufacturing downtime costs our aerospace and medical components facility both time and money. With three specialized machines producing different sized components, any unexpected failure disrupts production schedules and impacts delivery deadlines. Our current reactive maintenance approach means we fix problems after they occur — leading to longer downtimes and missed deadlines.

II. Objectives

This project aims to predict machine failures before they happen, enabling our maintenance team to plan repairs proactively. We will develop a predictive model using a year's worth of operational data to identify early warning signs of potential failures. Our key goals are identifying the most reliable failure indicators and determining whether machine-specific models perform better than a general approach.

III. The data

Our dataset contains daily operational measurements from three production machines spanning one year. Each record includes 13 sensor measurements — from hydraulic pressure to spindle vibration — along with machine identifiers and downtime status. The data comes from critical systems including cooling, hydraulics, and cutting mechanisms. All measurements follow standardized units and are recorded at consistent daily intervals.

Column Name	Description	Unit	Significance
Date	Daily timestamp of readings	YYYY-MM-DD	Tracks temporal patterns and maintenance history
Machine_ID	Unique machine identifier	Text	Enables machine-specific analysis and comparisons
Assembly_Line_No	Production line location	Integer	Maps physical layout and workflow dependencies
Hydraulic_Pressure	Hydraulic system pressure	bar	Indicates fluid power system health
Coolant_Pressure	Cooling system pressure	bar	Monitors heat dissipation efficiency
Air_System_Pressure	Pneumatic system pressure	bar	Reflects compressed air system status
Coolant_Temperature	Cooling system temperature	Celsius	Tracks thermal management effectiveness
Hydraulic_Oil_Temperature	Hydraulic fluid temperature	Celsius	Indicates system stress and oil condition
Spindle_Bearing_Temperature	Bearing temperature	Celsius	Monitors critical component health
Spindle_Vibration	Spindle oscillation	micrometers	Detects mechanical imbalances
Tool_Vibration	Cutting tool movement	micrometers	Indicates tool wear and stability
Spindle_Speed	Rotational velocity	RPM	Measures cutting performance
Voltage	Electrical input	volts	Monitors power supply stability
Torque	Rotational force	Nm	Indicates mechanical load
Cutting	Tool force	KN	Measures material removal effort
Downtime	Operational status	Boolean	Records machine availability

The company has stored the machine operating data in a single table, available in 'data/machine_downtime.csv'.

# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

# Scikit-learn imports
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer

# Metrics and evaluation
from sklearn.metrics import (
   accuracy_score, precision_score, recall_score, f1_score,
   roc_curve, precision_recall_curve, auc, roc_auc_score,
   confusion_matrix, classification_report
)

from sklearn.model_selection import cross_val_score

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier 
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Configuration settings
import warnings
warnings.filterwarnings('ignore')

# Load the dataset and display the first 10 rows
data = pd.read_csv('data/machine_downtime.csv')
display(data.head(10))

# Display information about the dataset including the data types and non-null counts for each column
data.info()

IV. EDA Summary

For a comprehensive understanding of the data analysis, readers are encouraged to review my previous report, which provides detailed exploratory data analysis. Here are the key findings from How do machines behave before downtime? that inform our current modeling approach:

Correlation Analysis:

Sensor measurements show independence (correlation coefficients < 0.25)
All variables retained for modeling due to their independent predictive potential
Temporal features include day_of_week and is_weekday indicators

Temporal Patterns:

Peak failure incidents observed in March-April 2022
Weekday failures occur 3x more frequently than weekend failures
Strong correlation between production schedules and machine reliability

High-Predictive Variables:

Hydraulic pressure (bimodal distribution)
Tool vibration measurements
Spindle speed readings
Cutting force metrics
Torque values
Coolant pressure levels

Preprocessing Strategy:

Missing Value Treatment:
- Mean imputation for normal distributions
- KNN imputation for bimodal distributions
Outlier Handling:
- Retain outliers as they represent valid operational states
- Evaluate multiple scaling methods
- RobustScaler anticipated as optimal choice due to outlier presence
Class Balance:
- No class imbalance treatment needed for 'downtime' variable
Feature Engineering:
- Initial approach without feature engineering due to low multicollinearity
- Will reassess based on baseline model performance

Modeling Approach:

Baseline: Logistic Regression
Advanced Models:
- Gradient Boosting (primary candidate for non-linear relationships)
- Random Forest (alternative for handling outliers and interactions)

This structured approach will guide our model development and evaluation process.

V. Data Preparation

Before preprocessing, the dataset was prepared by standardizing column names to snake_case for consistency, expanding date features to extract relevant temporal information (e.g., day of week) for potential model improvement, and mapping the target variable.

# Replace parentheses with underscores and convert to lowercase
data.columns = [col.replace('(', '_').replace(')', '').lower() for col in data.columns]

# Convert to datetime to use .dt accessor
data['date'] = pd.to_datetime(data['date'] )

# Extract day of the week
data['day_of_week'] = data['date'].dt.strftime('%A')

# Create boolean column to check if is weekday
data['is_weekday'] = data['date'].dt.weekday < 5

# Define ordered categories for days
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Convert day to ordered categorical variable
data['day_of_week'] = pd.Categorical(data['day_of_week'], categories=day_order, ordered=True)

# Convert the target variable to binary
data['downtime'] = data['downtime'].map({
    'Machine_Failure': 1,
    'No_Machine_Failure': 0})

# Check encoding
print("Target value counts:")
print(data['downtime'].value_counts())
print("\nTarget unique values:")
print(data['downtime'].unique())

As observed, our target class distribution is balanced.

2. Train-Test Split

With only 2,500 records, we opt for a simple train-test split without validation. We allocate 80% (2,000 samples) for training and 20% (500 samples) for testing. A validation set is unnecessary for datasets under 10,000 records as it would reduce our training data significantly.

TARGET = 'downtime'
TEST_SIZE = 0.2
RANDOM_STATE = 1

# Separate features (X) and target (y)
X = data.drop(['date', TARGET], axis=1) # Drop date and target
y = data[TARGET] 

# Print dataset overview
print("Dataset Overview:")
print(f"Total samples: {X.shape[0]}")
print(f"Features: {X.shape[1]}")
print("\nTarget distribution:")
print(y.value_counts(normalize=True))

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=TEST_SIZE,
    random_state=RANDOM_STATE,
    stratify=y
)

print("\nClass Distribution:")
print(f"  Training set:\n{pd.Series(y_train).value_counts(normalize=True).to_string()}") #to_string for better formatting
print(f"  Test set:\n{pd.Series(y_test).value_counts(normalize=True).to_string()}") #to_string for better formatting

‌
‌
‌

Can we predict machine failures before they happen?

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Can we predict machine failures before they happen?

Executive Summary

I. Background

II. Objectives

III. The data

IV. EDA Summary

V. Data Preparation

2. Train-Test Split

Can we predict machine failures before they happen?