Skip to content
New Workbook
Sign up
Phishing emails

Exploratory Data Analysis (EDA) of Phishing Emails

We will perform an exploratory data analysis (EDA) on the Phishing_Email.csv file. The goal of this EDA is to understand the dataset's structure, content, and any interesting patterns that may exist. We will start by loading the data and then proceed with the analysis.

import pandas as pd

# Load the dataset
file_path = 'Phishing_Email.csv'
phishing_data = pd.read_csv(file_path)

# Display the first few rows of the dataframe
display(phishing_data.head())

# Display the summary of the dataframe including the data types and non-null counts
phishing_data.info()
import matplotlib.pyplot as plt
import seaborn as sns

# Set the aesthetic style of the plots
sns.set_style('whitegrid')

# Generate summary statistics
phishing_data.describe()
# Check for missing values
missing_values = phishing_data.isnull().sum()
missing_values[missing_values > 0]
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

# Reload the dataset
file_path = 'Phishing_Email.csv'
phishing_data = pd.read_csv(file_path)

# Fill missing values in Email Text with empty string
phishing_data['Email Text'] = phishing_data['Email Text'].fillna('')

# Convert Email Type to numeric
label_encoder = LabelEncoder()
phishing_data['Email Type'] = label_encoder.fit_transform(phishing_data['Email Type'])

# Split the data into features and target variable
X = phishing_data['Email Text']
y = phishing_data['Email Type']

# Convert Email Text to numeric using TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(X)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Initialize and train the Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=1000, max_depth=50)
rf_classifier.fit(X_train, y_train)

# Make predictions on the test data
y_pred = rf_classifier.predict(X_test)

# Calculate the accuracy and other metrics of the model
accuracy = accuracy_score(y_test, y_pred)
classification_report = classification_report(y_test, y_pred)

accuracy, classification_report