Skip to content
Project: Detecting Cybersecurity Threats using Deep Learning
  • AI Chat
  • Code
  • Report
  • Cyber threats are a growing concern for organizations worldwide. These threats take many forms, including malware, phishing, and denial-of-service (DOS) attacks, compromising sensitive information and disrupting operations. The increasing sophistication and frequency of these attacks make it imperative for organizations to adopt advanced security measures. Traditional threat detection methods often fall short due to their inability to adapt to new and evolving threats. This is where deep learning models come into play.

    Deep learning models can analyze vast amounts of data and identify patterns that may not be immediately obvious to human analysts. By leveraging these models, organizations can proactively detect and mitigate cyber threats, safeguarding their sensitive information and ensuring operational continuity.

    As a cybersecurity analyst, you identify and mitigate these threats. In this project, you will design and implement a deep learning model to detect cyber threats. The BETH dataset simulates real-world logs, providing a rich source of information for training and testing your model. The data has already undergone preprocessing, and we have a target label, sus_label, indicating whether an event is malicious (1) or benign (0).

    By successfully developing this model, you will contribute to enhancing cybersecurity measures and protecting organizations from potentially devastating cyber attacks.

    The Data

    ColumnDescription
    processIdThe unique identifier for the process that generated the event - int64
    threadIdID for the thread spawning the log - int64
    parentProcessIdLabel for the process spawning this log - int64
    userIdID of user spawning the log
    mountNamespaceMounting restrictions the process log works within - int64
    argsNumNumber of arguments passed to the event - int64
    returnValueValue returned from the event log (usually 0) - int64
    sus_labelBinary label as suspicous event (1 is suspicious, 0 is not) - int64

    More information on the dataset: BETH dataset (Invalid URL)

    # Make sure to run this cell to use torchmetrics. If you cannot use pip install to install the torchmetrics, you can use sklearn.
    !pip install torchmetrics
    # Import required libraries
    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    import torch
    import torch.nn as nn
    import torch.nn.functional as functional
    from torch.utils.data import DataLoader, TensorDataset
    import torch.optim as optim
    from torchmetrics import Accuracy
    # from sklearn.metrics import accuracy_score  # uncomment to use sklearn
    # Load preprocessed data
    train_df = pd.read_csv('labelled_train.csv')
    test_df = pd.read_csv('labelled_test.csv')
    val_df = pd.read_csv('labelled_validation.csv')
    
    # View the first 5 rows of training set
    train_df.head()
    # Start coding here
    # Use as many cells as you need
    
    # Separate features and labels for training, testing, and validation sets
    
    X_train = train_df.drop('sus_label', axis=1).values
    y_train = train_df['sus_label'].values
    X_test = test_df.drop('sus_label', axis=1).values
    y_test = test_df['sus_label'].values
    X_val = val_df.drop('sus_label', axis=1).values
    y_val = val_df['sus_label'].values
    
    # Initialize the scaler and Fit the scaler on the training data and transform the training data
    
    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    
    # Transform the test and validation data using the fitted scaler
    X_test = scaler.transform(X_test)
    X_val = scaler.transform(X_val)
    
    # Convert the numpy arrays to PyTorch tensors
    X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
    y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
    X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
    y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)
    X_val_tensor = torch.tensor(X_val, dtype=torch.float32)
    y_val_tensor = torch.tensor(y_val, dtype=torch.float32).view(-1, 1)
    
    # Define the model using nn.Sequential
    model = nn.Sequential(
        nn.Linear(X_train.shape[1], 128),  # First fully connected layer
        nn.ReLU(),  # ReLU activation
        nn.Linear(128, 64),  # Second fully connected layer
        nn.ReLU(),  # ReLU activation
        nn.Linear(64, 1),  # Third fully connected layer
        nn.Sigmoid()  # Sigmoid activation for binary classification
    )
    
    # Initialize the loss function and optimizer
    criterion = nn.CrossEntropyLoss() 
    
    optimizer = optim.SGD(model.parameters(), lr=1e-3, weight_decay=1e-4)
    
    # Training loop
    num_epoch = 30
    for epoch in range(num_epoch):
        model.train()  # Set the model to training mode
        optimizer.zero_grad()  # Clear the gradients
        outputs = model(X_train_tensor)  # Forward pass: compute the model output
        loss = criterion(outputs, y_train_tensor)  # Compute the loss
        loss.backward()  # Backward pass: compute the gradients
        optimizer.step()  # Update the model parameters
    
        
    # Model Evaluation
    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():  # Disable gradient calculation for efficiency
        y_predict_train = model(X_train_tensor).round()  # Predict on training data and round the outputs
        y_predict_test = model(X_test_tensor).round()  # Predict on test data and round the outputs
        y_predict_val = model(X_val_tensor).round()  # Predict on validation data and round the outputs
    
    # Calculate accuracy using torchmetrics
    accuracy = Accuracy(task="binary")
    
    train_accuracy = accuracy(y_predict_train, y_train_tensor)
    test_accuracy = accuracy(y_predict_test, y_test_tensor)
    val_accuracy = accuracy(y_predict_val, y_val_tensor)
    
    # convert to int or float
    train_accuracy = train_accuracy.item()
    test_accuracy = test_accuracy.item()
    val_accuracy = val_accuracy.item()
    
    print("Training accuracy: {0}".format(train_accuracy))
    print("Validation accuracy: {0}".format(val_accuracy))
    print("Testing accuracy: {0}".format(test_accuracy))
    
    from sklearn.metrics import accuracy_score
    
    # Calculate the accuracy using sklearn
    train_accuracy = accuracy_score(y_train_tensor, y_predict_train)
    val_accuracy = accuracy_score(y_val_tensor, y_predict_val)
    test_accuracy = accuracy_score(y_test_tensor, y_predict_test)
    
    print("Training accuracy: {0}".format(train_accuracy))
    print("Validation accuracy: {0}".format(val_accuracy))
    print("Testing accuracy: {0}".format(test_accuracy))