Skip to content

Cyber threats are a growing concern for organizations worldwide. These threats take many forms, including malware, phishing, and denial-of-service (DOS) attacks, compromising sensitive information and disrupting operations. The increasing sophistication and frequency of these attacks make it imperative for organizations to adopt advanced security measures. Traditional threat detection methods often fall short due to their inability to adapt to new and evolving threats. This is where deep learning models come into play.

Deep learning models can analyze vast amounts of data and identify patterns that may not be immediately obvious to human analysts. By leveraging these models, organizations can proactively detect and mitigate cyber threats, safeguarding their sensitive information and ensuring operational continuity.

As a cybersecurity analyst, you identify and mitigate these threats. In this project, you will design and implement a deep learning model to detect cyber threats. The BETH dataset simulates real-world logs, providing a rich source of information for training and testing your model. The data has already undergone preprocessing, and we have a target label, sus_label, indicating whether an event is malicious (1) or benign (0).

By successfully developing this model, you will contribute to enhancing cybersecurity measures and protecting organizations from potentially devastating cyber attacks.

The Data

ColumnDescription
processIdThe unique identifier for the process that generated the event - int64
threadIdID for the thread spawning the log - int64
parentProcessIdLabel for the process spawning this log - int64
userIdID of user spawning the log
mountNamespaceMounting restrictions the process log works within - int64
argsNumNumber of arguments passed to the event - int64
returnValueValue returned from the event log (usually 0) - int64
sus_labelBinary label as suspicous event (1 is suspicious, 0 is not) - int64

More information on the dataset: BETH dataset (Invalid URL)

# Make sure to run this cell to use torchmetrics. If you cannot use pip install to install the torchmetrics, you can use sklearn.
!pip install torchmetrics
# Torchmetrics is a library used with PyTorch to simplify the process of calculating various performance metrics for machine learning models. It provides pre-built functions for common metrics like accuracy, precision, recall, and F1 score. This makes it easier for developers to evaluate and monitor the performance of their models without writing complex code from scratch. It's particularly useful for tracking model performance during training and validation.
# Import required libraries
import pandas as pd # for data manipulation 
from sklearn.preprocessing import StandardScaler
import torch
import torch.nn as nn
import torch.nn.functional as functional
from torch.utils.data import DataLoader, TensorDataset
import torch.optim as optim
from torchmetrics import Accuracy
# from sklearn.metrics import accuracy_score  # uncomment to use sklearn
# Load preprocessed data
train_df = pd.read_csv('labelled_train.csv')
test_df = pd.read_csv('labelled_test.csv')
val_df = pd.read_csv('labelled_validation.csv')

# View the first 5 rows of training set
train_df.head()
# A log is a record of events or activities that happen within a system, application, or process. In computing, logs capture important information like errors, user actions, system processes, or security events. Logs are used to track what happens in a system so that issues can be identified, monitored, or fixed later.

# For example, if a program crashes or detects suspicious activity, it writes details about that event in a log. In cybersecurity, logs help in understanding potential security threats by recording system events.
  • processId: A special number that shows which process (or task) created the log.
  • threadId: A number showing which part of the process (thread) created the log.
  • parentProcessId: The number of the process that started the one you're looking at.
  • userId: The ID of the person who started the process.
  • mountNamespace: Rules about what files the process can access.
  • argsNum: The number of inputs given to the process when it started.
  • returnValue: The result of the process (usually 0 if it worked fine).
  • sus_label: A marker that tells if the event is suspicious (1 = suspicious, 0 = safe).
# We separate features and labels so the model learns patterns from the features to correctly predict the label (malicious[harmful] or benign[safe]).

# The reason to drop the sus_label is to remove the target (output) from the features, so the model only learns from the input data (features) without being influenced by the labels during training.

features = train_df.drop('sus_label', axis=1) #dropping label
features.head(5)

label = train_df['sus_label']
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

print(scaled_features)
torch_features = torch.tensor(scaled_features, dtype=torch.float32)
torch_label = torch.tensor(label, dtype=torch.float32)
num_features = torch_features.shape[1]  # Number of columns in the tensor
print(num_features)

#Number of neurons equal to the number of features in your dataset. = no. of input
model = nn.Sequential(
    nn.Linear(7,10),
    nn.ReLU(),
    nn.Linear(10,1),
    nn.Sigmoid()
)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
epochs = 50

for epoch in range(epochs):

    model.train()
    
    # Forward pass: compute the predicted output
    outputs = model(torch_features)
    
    # Reshape outputs to match the target size
    outputs = outputs.view(-1)
    
    # Calculate loss
    loss = criterion(outputs, torch_label)
    
    # Backward pass: compute gradients
    optimizer.zero_grad()  # Clear previous gradients
    loss.backward()        # Backpropagation to compute gradients
    
    # Update model weights
    optimizer.step()
    
    print(f'Epoch {epoch+1}/{epochs}, Loss: {loss.item()}')
from torchmetrics import Accuracy

# Initialize the accuracy metric for binary classification
accuracy_metric = Accuracy(task="binary").to(torch.device('cpu'))

epochs = 50

for epoch in range(epochs):
    model.train()
    
    # Forward pass: compute the predicted output
    outputs = model(torch_features)
    
    # Reshape outputs to match the target size
    outputs = outputs.view(-1)
    
    # Calculate loss
    loss = criterion(outputs, torch_label)
    
    # Backward pass: compute gradients
    optimizer.zero_grad()  # Clear previous gradients
    loss.backward()        # Backpropagation to compute gradients
    
    # Update model weights
    optimizer.step()
    
    # Switch to evaluation mode to calculate accuracy
    model.eval()
    with torch.no_grad():
        predictions = (outputs > 0.5).float()  # Convert outputs to binary predictions (0 or 1)
        
        # Calculate accuracy using torchmetrics
        val_accuracy = accuracy_metric(predictions, torch_label)
        
        # Convert tensor to float using .item()
        val_accuracy = val_accuracy.item()
    
    print(f'Epoch {epoch+1}/{epochs}, Loss: {loss.item()}, Accuracy: {val_accuracy * 100:.2f}%')
    
    # Reset the accuracy metric for the next epoch
    accuracy_metric.reset()