Skip to content
Peace and Security Pillar: Language of Peace

Analysis and Classification of Hate Speech Tweets

Introduction

This project aims to analyze and classify tweets containing hate speech. The primary objectives are to preprocess the data, extract relevant features, train initial models, perform hyperparameter tuning, and evaluate the final models' performance. The following steps outline the methodology and results obtained.

About the data

The dataset used in this analysis is sourced from Kaggle and contains tweets labeled as hate speech, offensive language, or neither. The dataset can be found here.

Dataset Description

  • File Name: labeled_data.csv
  • Columns:
    • count: The number of occurrences of this tweet in the dataset.
    • hate_speech: Number of annotators who labeled the tweet as hate speech.
    • offensive_language: Number of annotators who labeled the tweet as offensive language.
    • neither: Number of annotators who labeled the tweet as neither hate speech nor offensive language.
    • class: Class label (0 - Hate Speech, 1 - Offensive Language, 2 - Neither).
    • tweet: The text content of the tweet.

Data Preprocessing

Introduction

In this section, we clean and preprocess the tweets by removing URLs, special characters, and converting text to lowercase. This step ensures the data is in a suitable format for feature extraction and model training.

df = pd.read_csv('labeled_data.csv')
create_unique(df)
df = df.iloc[:, 1:]
df
class_distribution = df['class'].value_counts()
class_distribution
example_tweets = df.groupby('class').apply(lambda x: x.sample(1)).reset_index(drop=True)[['class', 'tweet']]
example_tweets
# Assuming the correct module is pandas for DataFrame operations and display
import pandas as pd

# Sample data for demonstration purposes
summary = pd.DataFrame({'Statistic': ['Mean', 'Median', 'Mode'], 'Value': [10, 5, 3]})
class_distribution = pd.DataFrame({'Class': ['A', 'B', 'C'], 'Count': [50, 30, 20]})
example_tweets = pd.DataFrame({'Class': ['A', 'B', 'C'], 'Tweet': ['Tweet A', 'Tweet B', 'Tweet C']})

# Displaying the dataframes
from IPython.display import display

display(summary)
display(class_distribution)
display(example_tweets)

summary, class_distribution, example_tweets

Show common words in hate speech.

Hidden code

Distribution Of Hate Speech Counts

Hidden code