Skip to content
Phishing emails
  • AI Chat
  • Code
  • Report
  • Exploratory Data Analysis (EDA) of Phishing Emails

    We will perform an exploratory data analysis (EDA) on the Phishing_Email.csv file. The goal of this EDA is to understand the dataset's structure, content, and any interesting patterns that may exist. We will start by loading the data and then proceed with the analysis.

    import pandas as pd
    # Load the dataset
    file_path = 'Phishing_Email.csv'
    phishing_data = pd.read_csv(file_path)
    # Display the first few rows of the dataframe
    # Display the summary of the dataframe including the data types and non-null counts
    import matplotlib.pyplot as plt
    import seaborn as sns
    # Set the aesthetic style of the plots
    # Generate summary statistics
    # Check for missing values
    missing_values = phishing_data.isnull().sum()
    missing_values[missing_values > 0]
    # Import necessary libraries
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score, classification_report
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.preprocessing import LabelEncoder
    # Reload the dataset
    file_path = 'Phishing_Email.csv'
    phishing_data = pd.read_csv(file_path)
    # Fill missing values in Email Text with empty string
    phishing_data['Email Text'] = phishing_data['Email Text'].fillna('')
    # Convert Email Type to numeric
    label_encoder = LabelEncoder()
    phishing_data['Email Type'] = label_encoder.fit_transform(phishing_data['Email Type'])
    # Split the data into features and target variable
    X = phishing_data['Email Text']
    y = phishing_data['Email Type']
    # Convert Email Text to numeric using TfidfVectorizer
    tfidf_vectorizer = TfidfVectorizer()
    X_tfidf = tfidf_vectorizer.fit_transform(X)
    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)
    # Initialize and train the Random Forest classifier
    rf_classifier = RandomForestClassifier(n_estimators=1000, max_depth=50), y_train)
    # Make predictions on the test data
    y_pred = rf_classifier.predict(X_test)
    # Calculate the accuracy and other metrics of the model
    accuracy = accuracy_score(y_test, y_pred)
    classification_report = classification_report(y_test, y_pred)
    accuracy, classification_report