Competition - Data exploratory analysis of industrial machine downtime dataset🔧 - Level 1

Predicting Industrial Machine Downtime: Level 1

📖 Background

You work for a manufacturer of high-precision metal components used in aerospace, automotives, and medical device applications. Your company operates three different machines on its shop floor that produce different sized components, so minimizing the downtime of these machines is vital for meeting production deadlines.

Your team wants to use a data-driven approach to predicting machine downtime, so proactive maintenance can be planned rather than being reactive to machine failure. To support this, your company has been collecting operational data for over a year and whether each machine was down at those times.

In this first level, you're going to explore and describe the data. This level is aimed towards beginners. If you want to challenge yourself a bit more, check out level two!

💾 The data

The company has stored the machine operating data in a single table, available in 'data/machine_downtime.csv'.

Each row in the table represents the operational data for a single machine on a given day:

"Date" - the date the reading was taken on.
"Machine_ID" - the unique identifier of the machine being read.
"Assembly_Line_No" - the unique identifier of the assembly line the machine is located on.
"Hydraulic_Pressure(bar)", "Coolant_Pressure(bar)", and "Air_System_Pressure(bar)" - pressure measurements at different points in the machine.
"Coolant_Temperature", "Hydraulic_Oil_Temperature", and "Spindle_Bearing_Temperature" - temperature measurements (in Celsius) at different points in the machine.
"Spindle_Vibration", "Tool_Vibration", and "Spindle_Speed(RPM)" - vibration (measured in micrometers) and rotational speed measurements for the spindle and tool.
"Voltage(volts)" - the voltage supplied to the machine.
"Torque(Nm)" - the torque being generated by the machine.
"Cutting(KN)" - the cutting force of the tool.
"Downtime" - an indicator of whether the machine was down or not on the given day.

Executive summary and recommandations

As proactiveness is very important in many areas, our manufacturing company has decided to take actions towards it. Data of machine exploitation were recorded and send for analysis. Here are what came out:

The total number of observation or readings primarily was 2500 but after cleaning one got 2378 records.
The readings were taken on a period of 7 months approximately starting from 2021-11-24 to 2022-06-19
The average torque of machines is 25.20
The Assembly line with the highest readings is Shopfloor-L1 either for machine failure or not.

Recommandations

Here are some recommendations:

The dataset is made up of 12 numeric field and 4 non numeric field. As it's mainly numeric fields which have missing values, they could be replaced by their median values for there are mean values affected by outliers.
Further analysis can be also done to establish relationship between other fields (relationship between two or more variables) to enhance model's target choices.

import pandas as pd
downtime = pd.read_csv('data/machine_downtime.csv')
downtime.head()

Getting to know the dataset

downtime.info()

downtime.describe()

downtime.drop_duplicates()

Explore negative and zero values for some variables in detail

downtime[(downtime['Hydraulic_Pressure(bar)'] <= 0) | (downtime['Spindle_Vibration'] <= 0) | 
         (downtime['Spindle_Speed(RPM)'] <= 0) | (downtime['Torque(Nm)'] <= 0)]

Cleaning the dataset

There are negative or zero values for some operational measures with a machine downtime value of No_Machine_Failure. This can't be. So we need to remove those line.

downtime_clean = downtime[~(((downtime['Hydraulic_Pressure(bar)'] <= 0) | (downtime['Spindle_Vibration'] <= 0) | 
         (downtime['Spindle_Speed(RPM)'] <= 0) | (downtime['Torque(Nm)'] <= 0)) & 
                          (downtime['Downtime'] != 'Machine_Failure'))]

downtime_clean.describe()

downtime_clean.isna().mean() * 100

There are missing values only for numeric fields. The proportion of Nan values for each numeric variable is less than 1%. Futher cleaning can be done by performing simple imputation using mean or median value or just dropping missing values from the dataset. One has choosen to drop the records with missing values.

‌
‌
‌