Skip to content
Peace and Security Pillar: Language of Peace
Analysis and Classification of Hate Speech Tweets
Introduction
This project aims to analyze and classify tweets containing hate speech. The primary objectives are to preprocess the data, extract relevant features, train initial models, perform hyperparameter tuning, and evaluate the final models' performance. The following steps outline the methodology and results obtained.
About the data
The dataset used in this analysis is sourced from Kaggle and contains tweets labeled as hate speech, offensive language, or neither. The dataset can be found here.
Dataset Description
- File Name: labeled_data.csv
- Columns:
count
: The number of occurrences of this tweet in the dataset.hate_speech
: Number of annotators who labeled the tweet as hate speech.offensive_language
: Number of annotators who labeled the tweet as offensive language.neither
: Number of annotators who labeled the tweet as neither hate speech nor offensive language.class
: Class label (0 - Hate Speech, 1 - Offensive Language, 2 - Neither).tweet
: The text content of the tweet.
Data Preprocessing
Introduction
In this section, we clean and preprocess the tweets by removing URLs, special characters, and converting text to lowercase. This step ensures the data is in a suitable format for feature extraction and model training.
df = pd.read_csv('labeled_data.csv')
create_unique(df)
df = df.iloc[:, 1:]
df
class_distribution = df['class'].value_counts()
class_distribution
example_tweets = df.groupby('class').apply(lambda x: x.sample(1)).reset_index(drop=True)[['class', 'tweet']]
example_tweets
# Assuming the correct module is pandas for DataFrame operations and display
import pandas as pd
# Sample data for demonstration purposes
summary = pd.DataFrame({'Statistic': ['Mean', 'Median', 'Mode'], 'Value': [10, 5, 3]})
class_distribution = pd.DataFrame({'Class': ['A', 'B', 'C'], 'Count': [50, 30, 20]})
example_tweets = pd.DataFrame({'Class': ['A', 'B', 'C'], 'Tweet': ['Tweet A', 'Tweet B', 'Tweet C']})
# Displaying the dataframes
from IPython.display import display
display(summary)
display(class_distribution)
display(example_tweets)
summary, class_distribution, example_tweets
Show common words in hate speech.
Hidden code
Distribution Of Hate Speech Counts
Hidden code