Competition - Predicting industrial machine downtime 🔧 - Level 3

Edge-Enabled Predictive Maintenance: Real-Time Insights from Machine Learning Models

📖 Background

You work for a manufacturer of high-precision metal components used in aerospace, automotives, and medical device applications. Your company operates three different machines on its shop floor that produce different sized components, so minimizing the downtime of these machines is vital for meeting production deadlines.

Your team wants to use a data-driven approach to predicting machine downtime, so proactive maintenance can be planned rather than being reactive to machine failure. To support this, your company has been collecting operational data for over a year and whether each machine was down at those times.

In this third level, you're going to develop a predictive model that could be combined with real-time operational data to detect likely machine failure. This level is aimed towards advanced learners. If you want to challenge yourself a bit less, check out the other levels!

💾 The data

The company has stored the machine operating data in a single table, available in 'data/machine_downtime.csv'.

Each row in the table represents the operational data for a single machine on a given day:

"Date" - the date the reading was taken on.
"Machine_ID" - the unique identifier of the machine being read.
"Assembly_Line_No" - the unique identifier of the assembly line the machine is located on.
"Hydraulic_Pressure(bar)", "Coolant_Pressure(bar)", and "Air_System_Pressure(bar)" - pressure measurements at different points in the machine.
"Coolant_Temperature", "Hydraulic_Oil_Temperature", and "Spindle_Bearing_Temperature" - temperature measurements (in Celsius) at different points in the machine.
"Spindle_Vibration", "Tool_Vibration", and "Spindle_Speed(RPM)" - vibration (measured in micrometers) and rotational speed measurements for the spindle and tool.
"Voltage(volts)" - the voltage supplied to the machine.
"Torque(Nm)" - the torque being generated by the machine.
"Cutting(KN)" - the cutting force of the tool.
"Downtime" - an indicator of whether the machine was down or not on the given day.

import pandas as pd
downtime = pd.read_csv('data/machine_downtime.csv')
downtime.head()


import pandas as pd

# Load the uploaded dataset
file_path = 'data/machine_downtime.csv'
data = pd.read_csv(file_path)

# Check for missing values
missing_values = data.isnull().sum()

# Check for outliers (basic statistics for numerical columns)
numerical_columns = data.select_dtypes(include=['float64', 'int64']).columns
outliers_summary = data[numerical_columns].describe()


# Convert 'Date' to datetime
data['Date'] = pd.to_datetime(data['Date'], errors='coerce')

# Convert 'Downtime' to a binary categorical variable if not already
data['Downtime'] = data['Downtime'].astype('category')

missing_values, outliers_summary, data.dtypes.head()

Missing Values:

Some columns have missing values:
    Hydraulic_Pressure(bar): 10 missing values.
    Coolant_Pressure(bar): 19 missing values.
    Air_System_Pressure(bar): 17 missing values.
    Other columns like Coolant_Temperature, Torque(Nm), and Cutting(kN) also contain missing values.

Outliers:

Potential outliers are indicated by extreme min and max values in some columns:
    Hydraulic_Pressure(bar): Min = -14.32 (possibly invalid as pressure values should not be negative).
    Spindle_Speed(RPM): Min = 0, which might indicate a non-operational state.
    Spindle_Vibration: Min = -0.461 (negative vibration is likely incorrect).

Data Types:

Date: Successfully converted to datetime64[ns].
Downtime: Converted to a binary categorical variable.
Other numerical and categorical columns are appropriately typed.

Handle Missing Values and Outliers

# Handling Missing Values
# Fill missing values in numerical columns with the median
data[numerical_columns] = data[numerical_columns].apply(lambda x: x.fillna(x.median()))

# Handling Outliers
# Define rules for handling invalid negative values
columns_with_nonnegative_values = [
    "Hydraulic_Pressure(bar)",
    "Coolant_Pressure(bar)",
    "Air_System_Pressure(bar)",
    "Spindle_Speed(RPM)",
    "Spindle_Vibration",
    "Tool_Vibration",
]

# Replace negative values with 0 for specified columns
data[columns_with_nonnegative_values] = data[columns_with_nonnegative_values].applymap(lambda x: max(x, 0))

# Calculate the 99th percentile for each numerical column
percentile_99 = data[numerical_columns].quantile(0.99)

# Apply clipping for each column individually using the 99th percentile
for column in numerical_columns:
    data[column] = data[column].clip(upper=percentile_99[column])

# Verify changes: Check for remaining missing values and updated statistics
remaining_missing_values = data.isnull().sum()
outliers_summary_updated = data[numerical_columns].describe()

remaining_missing_values, outliers_summary_updated

Exploratory Data Analysis (EDA)

import matplotlib.pyplot as plt
import seaborn as sns

# 1. Distribution of numerical features
plt.figure(figsize=(16, 12))
for i, column in enumerate(numerical_columns, 1):
    plt.subplot(4, 4, i)
    sns.histplot(data[column], kde=True, bins=30, color='blue')
    plt.title(f"Distribution of {column}")
    plt.xlabel(column)
    plt.ylabel("Frequency")
plt.tight_layout()
plt.show()

# 2. Relationship with Downtime
plt.figure(figsize=(16, 12))
for i, column in enumerate(numerical_columns, 1):
    plt.subplot(4, 4, i)
    sns.boxplot(x=data['Downtime'], y=data[column], palette="Set2")
    plt.title(f"{column} vs Downtime")
    plt.xlabel("Downtime")
    plt.ylabel(column)
plt.tight_layout()
plt.show()

# 3. Correlation Heatmap
correlation_matrix = data[numerical_columns].corr()
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)
plt.title("Correlation Heatmap")
plt.show()

# Convert 'Downtime' to numerical for aggregation
data['Downtime_numeric'] = data['Downtime'].cat.codes

# Time-Series Analysis (Downtime over Time)
downtime_over_time = data.groupby('Date')['Downtime_numeric'].sum()

plt.figure(figsize=(12, 6))
downtime_over_time.plot()
plt.title("Downtime Over Time")
plt.xlabel("Date")
plt.ylabel("Downtime Count")
plt.grid(True)
plt.show()

Observations from EDA:

Feature Distributions:
    Most numerical features have unimodal distributions with varying degrees of skewness.
    Some features, such as Spindle_Speed(RPM) and Voltage(volts), show distinct peaks, indicating specific operational ranges.

Relationship with Downtime:
    Certain features, like Spindle_Vibration and Torque(Nm), show clear differences in distributions for downtime vs. no downtime.
    Features like Hydraulic_Pressure(bar) have overlapping distributions, suggesting limited predictive power.

Correlation Heatmap:
    Features like Pressure and Temperature are moderately correlated, which may indicate potential multicollinearity.
    Some features show little correlation with Downtime, implying their contribution to failure may not be linear.

Time-Series Analysis:
    Downtime occurrences show variability over time, with certain periods exhibiting spikes in failures.
    This indicates potential temporal patterns that could be explored further.

Feature Engineering and Model Preparation

from sklearn.model_selection import train_test_split

# Feature Engineering
# Adding rolling averages for selected columns
rolling_columns = ["Hydraulic_Pressure(bar)", "Coolant_Temperature", "Torque(Nm)"]
for col in rolling_columns:
    data[f"{col}_rolling_mean"] = data.groupby("Machine_ID")[col].transform(lambda x: x.rolling(window=7, min_periods=1).mean())

# Interaction terms: Product of pressures and temperatures
data["Pressure_Temperature_Interaction"] = (
    data["Hydraulic_Pressure(bar)"] * data["Coolant_Temperature"]
)

# Define features (X) and target (y)
target = "Downtime_numeric"
features = [col for col in data.columns if col not in ["Date", "Machine_ID", "Downtime", "Downtime_numeric"]]

X = data[features]
y = data[target]

# Split data into training, validation, and test sets (70%-15%-15% split)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

# Verify splits
X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape

Training

# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

# Function to evaluate model performance
def evaluate_model(y_true, y_pred, model_name):
    print(f"=== Evaluation Metrics for {model_name} ===")
    print(f"Accuracy: {accuracy_score(y_true, y_pred):.4f}")
    print(f"Precision: {precision_score(y_true, y_pred):.4f}")
    print(f"Recall: {recall_score(y_true, y_pred):.4f}")
    print(f"F1 Score: {f1_score(y_true, y_pred):.4f}")
    print(classification_report(y_true, y_pred))
    print("=" * 50)

# Preprocess the data
# Ensure categorical columns are encoded
categorical_columns = ["Assembly_Line_No"]
encoder = OneHotEncoder(sparse_output=False, drop='first')  # Use 'sparse_output' for newer versions
encoded_categorical = encoder.fit_transform(data[categorical_columns])
encoded_columns = pd.DataFrame(
    encoded_categorical,
    columns=encoder.get_feature_names_out(categorical_columns),
    index=data.index,
)

# Drop original categorical columns and merge encoded ones
X = X.drop(columns=categorical_columns)
X = pd.concat([X, encoded_columns], axis=1)

# Split data into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

# Initialize models
random_forest = RandomForestClassifier(random_state=42)
xgboost = XGBClassifier(use_label_encoder=False, eval_metric="logloss", random_state=42)
stacking_meta_model = LogisticRegression(random_state=42)

# Train Random Forest
print("Training Random Forest...")
random_forest.fit(X_train, y_train)
rf_preds_val = random_forest.predict(X_val)

# Train XGBoost
print("Training XGBoost...")
xgboost.fit(X_train, y_train)
xgb_preds_val = xgboost.predict(X_val)

# Stacking (Hybrid Model)
print("Training Hybrid Model...")
stacking_train = np.column_stack([
    random_forest.predict_proba(X_train)[:, 1],
    xgboost.predict_proba(X_train)[:, 1],
])
stacking_val = np.column_stack([
    random_forest.predict_proba(X_val)[:, 1],
    xgboost.predict_proba(X_val)[:, 1],
])
stacking_meta_model.fit(stacking_train, y_train)
stacking_preds_val = stacking_meta_model.predict(stacking_val)

# Evaluate models
evaluate_model(y_val, rf_preds_val, "Random Forest")
evaluate_model(y_val, xgb_preds_val, "XGBoost")
evaluate_model(y_val, stacking_preds_val, "Hybrid Model")

‌
‌
‌