Data4Good Case Challenge
Background
AI is transforming education but also introduces risks like misinformation. This competition focuses on detecting factuality in AI-generated educational content.
Executive Summary: Complete Analysis Report
1. Problem Statement
Develop a classification system to detect factuality in AI-generated educational content by categorizing answers into three classes:
- Factual: Correct answers
- Contradiction: Incorrect/contradictory answers
- Irrelevant: Answers unrelated to the question
2. Dataset Overview
- Training Set: 21,021 examples from
data/train.json - Test Set: 2,000 examples from
data/test.json - Features: Question, Context (passage), Answer, Type (label)
- Imbalance Ratio: 9.84:1 (Factual class dominates at 90.4%)
3. Key Findings from Exploratory Data Analysis
Class Distribution:
- Factual: 19,005 examples (90.4%)
- Irrelevant: 1,086 examples (5.2%)
- Contradiction: 930 examples (4.4%)
Missing Data Patterns:
- 10,498 examples (49.9%) have empty context fields
- Empty contexts appear across all three classes
- No missing values in question or answer fields
- Critical Implementation Detail: Our
word_overlap_ratiofunction returns 0.0 for empty contexts, effectively helping flag these rows as "Irrelevant" or "Contradiction" when a factual answer should have had a context to reference
Text Length Analysis:
- Question length: avg 66 characters (12 words)
- Context length: avg 295 characters when present (48 words)
- Answer length: avg 77 characters (15 words)
- Contradiction answers tend to be shorter than factual answers
Word Overlap Analysis (Critical Feature):
- Factual answers: High context overlap (0.619) - answers closely reference context
- Contradiction answers: Moderate overlap (0.352) - some context reference but contradictory
- Irrelevant answers: Low overlap (0.212) - minimal connection to context
Negative Keyword Patterns (Critical for Contradiction Detection):
- Contradiction answers contain significantly more negative keywords (not, never, rather, instead, etc.)
- These linguistic markers are essential signals for distinguishing contradictions from factual statements
- Words like "not", "rather", "instead" flip factual statements into contradictions
Entity Patterns:
- Factual answers contain more numbers (dates, statistics)
- Capitalized words (proper nouns) appear frequently across all classes
- Year mentions are common in factual content
4. Modeling Approach
Feature Engineering: Created 9 engineered features:
- Question length (characters)
- Context length (characters)
- Answer length (characters)
- Question word count
- Context word count
- Answer word count
- Answer-question overlap ratio
- Answer-context overlap ratio (strongest predictor)
- Negative keyword count - critical for contradiction detection
Text Vectorization:
- TF-IDF with 5,000 features
- N-grams: unigrams, bigrams, trigrams (1-3)
- Min document frequency: 2
- Max document frequency: 95%
- Sublinear TF scaling applied
- Custom stop words: Preserves "not" and other negative keywords that are critical for contradiction detection
Model Selection:
- Algorithm: XGBoost Classifier
- Rationale: Handles imbalanced data well, feature importance, fast training
- Hyperparameters:
- n_estimators: 200
- max_depth: 6
- learning_rate: 0.1
- subsample: 0.8
- colsample_bytree: 0.8
Handling Class Imbalance:
- Applied balanced class weights (9.84:1 imbalance)
- Minority classes (contradiction, irrelevant) weighted higher during training
5. Model Interpretability & Explainability
Feature Importance Analysis:
- XGBoost native feature importance ranking
- Top features: answer_context_overlap, negative_keywords, specific n-grams
- Engineered features dominate top importance rankings
Key Interpretability Findings:
- Answer-context overlap is the single strongest predictor
- Negative keywords strongly push predictions toward "Contradiction"
- Low overlap + generic language signals "Irrelevant"
- High overlap + domain terms signals "Factual"
- Model decisions align with human intuition and domain knowledge
6. Results
Training Configuration:
- Total features: 5,009 (5,000 TF-IDF + 9 engineered including negative keywords)
- Training samples: 21,021
- Model trained on full dataset for final submission
Validation Performance (Competition-Aligned Metrics):
- Macro-F1 Score: 0.8719 - Evaluates balanced performance across all 3 classes
- Balanced Accuracy: 0.8831 - Average recall per class (aligns with 33.3% per-class weighting)
- Per-Class F1 Scores:
- Contradiction: 0.7386
- Factual: 0.9685
- Irrelevant: 0.9085
Test Set Predictions:
- Factual: 1,657 predictions (82.8%)
- Irrelevant: 209 predictions (10.4%)
- Contradiction: 134 predictions (6.7%)
- Output saved to
data/test_predictions.json
7. Key Insights
- Context overlap is the strongest signal for factuality detection
- Negative keywords are critical for contradiction detection (not, rather, instead, never)
- Class imbalance required careful handling with sample weights
- Text patterns differ significantly between classes (length, overlap, entities, negations)
- Custom TF-IDF preprocessing preserves contradiction signals that standard stop word removal would eliminate
- Model interpretability confirms features work as intended with no spurious patterns
- Ensemble method (XGBoost) effectively combines multiple weak signals
8. Methodology Strengths
- Comprehensive EDA with 11+ visualizations including negative keyword analysis
- Feature engineering based on domain insights and contradiction linguistics
- Competition-specific improvements:
- Negative keyword feature engineering
- Custom stop word list preserving contradiction signals
- Macro-F1 and Balanced Accuracy metrics
- Feature importance analysis for model transparency
- Proper handling of class imbalance
- Combined text and numerical features
- Full dataset utilization for final model
- Reproducible pipeline with clear documentation
- Model explainability demonstrating trustworthy predictions
The Data
QA dataset where we need to classify answers as:
- Factual: answer is correct
- Contradiction: answer is incorrect
- Irrelevant: answer has nothing to do with the question
Training: 21,021 examples in data/train.json
Test: 2,000 examples in data/test.json
Loading the data
import pandas as pd
import json
data_path = "data/train.json"
with open(data_path, 'r', encoding='utf-8') as f:
data = json.load(f)
train_df = pd.DataFrame(data)
train_df.head(50)EDA
import warnings
warnings.filterwarnings('ignore')
# basic info
print(f"Shape: {train_df.shape}")
print(f"Rows: {train_df.shape[0]:,}")
print(f"\nColumns:")
print(train_df.dtypes)
print("\nFirst few:")
train_df.head(3)import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 6)
# class distribution
class_counts = train_df['type'].value_counts()
class_percentages = train_df['type'].value_counts(normalize=True) * 100
print("Class distribution:")
for cls, count in class_counts.items():
print(f"{cls}: {count:,} ({class_percentages[cls]:.2f}%)")
# check imbalance
max_class = class_counts.max()
min_class = class_counts.min()
imbalance_ratio = max_class / min_class
print(f"\nImbalance ratio: {imbalance_ratio:.2f}")
fig, axes = plt.subplots(1, 2, figsize=(16, 5))
# bar plot
axes[0].bar(class_counts.index, class_counts.values, color=['#2E86AB', '#A23B72', '#F18F01'], alpha=0.8)
axes[0].set_xlabel('Type')
axes[0].set_ylabel('Count')
axes[0].set_title('Class Distribution')
for i, (cls, count) in enumerate(class_counts.items()):
axes[0].text(i, count + 200, f'{count:,}\n({class_percentages[cls]:.1f}%)', ha='center')
# pie chart
axes[1].pie(class_counts.values, labels=class_counts.index, autopct='%1.1f%%',
colors=['#2E86AB', '#A23B72', '#F18F01'], startangle=90)
axes[1].set_title('Proportions')
plt.tight_layout()
plt.show()# check for missing data
print("Null values:")
print(train_df.isnull().sum())
print("\nEmpty strings:")
for col in ['question', 'context', 'answer']:
empty_count = (train_df[col] == '').sum()
print(f"{col}: {empty_count:,}")
# empty context by class
print("\nEmpty context by class:")
print(train_df[train_df['context'] == ''].groupby('type').size())
# visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 5))
has_context = train_df.groupby('type')['context'].apply(lambda x: (x != '').sum())
no_context = train_df.groupby('type')['context'].apply(lambda x: (x == '').sum())
x = np.arange(len(has_context))
width = 0.35
axes[0].bar(x - width/2, has_context.values, width, label='Has Context', color='#06A77D', alpha=0.8)
axes[0].bar(x + width/2, no_context.values, width, label='No Context', color='#D62246', alpha=0.8)
axes[0].set_xlabel('Type')
axes[0].set_ylabel('Count')
axes[0].set_title('Context Availability')
axes[0].set_xticks(x)
axes[0].set_xticklabels(has_context.index)
axes[0].legend()
context_pct = (train_df['context'] != '').mean() * 100
no_context_pct = (train_df['context'] == '').mean() * 100
axes[1].pie([context_pct, no_context_pct], labels=['Has Context', 'No Context'],
autopct='%1.1f%%', colors=['#06A77D', '#D62246'], startangle=90)
axes[1].set_title('Overall Context')
plt.tight_layout()
plt.show()# text lengths
train_df['question_length'] = train_df['question'].str.len()
train_df['context_length'] = train_df['context'].str.len()
train_df['answer_length'] = train_df['answer'].str.len()
train_df['question_words'] = train_df['question'].str.split().str.len()
train_df['context_words'] = train_df['context'].str.split().str.len()
train_df['answer_words'] = train_df['answer'].str.split().str.len()
print("Text length stats:")
for col in ['question', 'context', 'answer']:
length_col = f'{col}_length'
print(f"\n{col}:")
print(train_df[length_col].describe())
# by class
print("\nAverage length by class:")
length_by_class = train_df.groupby('type')[['question_length', 'context_length', 'answer_length']].mean()
print(length_by_class.round(1))
# plots
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
columns = ['question_length', 'context_length', 'answer_length']
titles = ['Question Length', 'Context Length', 'Answer Length']
for idx, (col, title) in enumerate(zip(columns, titles)):
# histograms
for answer_type in train_df['type'].unique():
if pd.notna(answer_type) and answer_type != '':
data = train_df[train_df['type'] == answer_type][col]
axes[0, idx].hist(data, alpha=0.5, label=answer_type, bins=50)
axes[0, idx].set_xlabel('Length')
axes[0, idx].set_ylabel('Frequency')
axes[0, idx].set_title(title)
axes[0, idx].legend()
# box plots
data_to_plot = [train_df[train_df['type'] == t][col].values for t in train_df['type'].unique() if t != '']
labels = [t for t in train_df['type'].unique() if t != '']
bp = axes[1, idx].boxplot(data_to_plot, labels=labels, patch_artist=True)
colors = ['#2E86AB', '#A23B72', '#F18F01']
for patch, color in zip(bp['boxes'], colors[:len(labels)]):
patch.set_facecolor(color)
patch.set_alpha(0.7)
axes[1, idx].set_ylabel('Length')
axes[1, idx].set_title(f'{title} by Class')
plt.tight_layout()
plt.show()# look at some examples
for answer_type in ['factual', 'contradiction', 'irrelevant']:
examples = train_df[train_df['type'] == answer_type].head(2)
print(f"\n{answer_type.upper()} examples:")
print("="*60)
for idx, row in examples.iterrows():
print(f"\nQuestion: {row['question']}")
print(f"Context: {row['context'][:200]}..." if len(row['context']) > 200 else f"Context: {row['context']}")
print(f"Answer: {row['answer']}")
print("-"*60)from collections import Counter
import re
def get_top_words(text_series, n=15, min_length=3):
all_text = ' '.join(text_series.fillna('').astype(str).values)
words = re.findall(r'\b[a-zA-Z]+\b', all_text.lower())
# IMPORTANT: Removed 'not' from stop words - it's critical for detecting contradictions
stop_words = {'the', 'is', 'in', 'and', 'to', 'of', 'a', 'for', 'was', 'on', 'that', 'with',
'as', 'by', 'at', 'from', 'are', 'an', 'be', 'or', 'has', 'had', 'have',
'this', 'it', 'its', 'which', 'their', 'were', 'been', 'they'}
words = [w for w in words if len(w) >= min_length and w not in stop_words]
return Counter(words).most_common(n)
print("Top words in answers:")
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
for idx, answer_type in enumerate(['factual', 'contradiction', 'irrelevant']):
answers = train_df[train_df['type'] == answer_type]['answer']
top_words = get_top_words(answers, n=15)
print(f"\n{answer_type}:")
for word, count in top_words[:10]:
print(f" {word}: {count:,}")
if top_words:
words, counts = zip(*top_words)
colors_map = {'factual': '#2E86AB', 'contradiction': '#A23B72', 'irrelevant': '#F18F01'}
axes[idx].barh(range(len(words)), counts, color=colors_map[answer_type], alpha=0.8)
axes[idx].set_yticks(range(len(words)))
axes[idx].set_yticklabels(words)
axes[idx].set_xlabel('Frequency')
axes[idx].set_title(f'Top Words - {answer_type}')
axes[idx].invert_yaxis()
plt.tight_layout()
plt.show()# Analyze negative keywords - critical for contradiction detection
negative_keywords = ['not', 'never', 'no', 'neither', 'nor', 'cannot', 'rather', 'instead',
'however', 'but', 'although', 'despite', 'incorrect', 'false', 'wrong']
def count_negative_keywords(text):
text_lower = str(text).lower()
words = re.findall(r'\b\w+\b', text_lower)
return sum(1 for word in words if word in negative_keywords)
print("Negative keyword analysis by class:")
print("-" * 50)
for answer_type in ['factual', 'contradiction', 'irrelevant']:
subset = train_df[train_df['type'] == answer_type]
neg_counts = subset['answer'].apply(count_negative_keywords)
total_with_neg = (neg_counts > 0).sum()
avg_neg_per_answer = neg_counts.mean()
print(f"\n{answer_type.upper()}:")
print(f" Answers with negative keywords: {total_with_neg:,} ({total_with_neg/len(subset)*100:.1f}%)")
print(f" Average negative keywords per answer: {avg_neg_per_answer:.3f}")
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Bar chart - percentage with negative keywords
neg_presence = []
class_labels = ['factual', 'contradiction', 'irrelevant']
for answer_type in class_labels:
subset = train_df[train_df['type'] == answer_type]
if len(subset) > 0:
neg_counts = subset['answer'].apply(count_negative_keywords)
pct_with_neg = (neg_counts > 0).sum() / len(subset) * 100
neg_presence.append(pct_with_neg)
else:
neg_presence.append(0)
colors = ['#2E86AB', '#A23B72', '#F18F01']
axes[0].bar(['Factual', 'Contradiction', 'Irrelevant'], neg_presence, color=colors, alpha=0.8)
axes[0].set_ylabel('Percentage (%)')
axes[0].set_title('Answers with Negative Keywords')
axes[0].set_ylim(0, max(neg_presence) * 1.2 if max(neg_presence) > 0 else 1)
for i, v in enumerate(neg_presence):
axes[0].text(i, v + 1, f'{v:.1f}%', ha='center', fontweight='bold')
# Histogram - distribution of negative keyword counts
for answer_type, color, display_name in zip(class_labels, colors, ['Factual', 'Contradiction', 'Irrelevant']):
subset = train_df[train_df['type'] == answer_type]
if len(subset) > 0:
neg_counts = subset['answer'].apply(count_negative_keywords)
axes[1].hist(neg_counts, alpha=0.6, label=display_name, bins=range(0, 6), color=color)
axes[1].set_xlabel('Number of Negative Keywords')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Negative Keywords')
axes[1].legend()
plt.tight_layout()
plt.show()
print("\n" + "="*50)
print("KEY INSIGHT: Contradiction answers contain more negative keywords!")
print("This will be used as a feature in the model.")
print("="*50)# simple entity extraction
def extract_entities(text):
entities = {
'numbers': re.findall(r'\b\d+(?:,\d{3})*(?:\.\d+)?(?:\s?%|percent)?\b', str(text)),
'years': re.findall(r'\b(?:19|20)\d{2}\b', str(text)),
'capitalized': re.findall(r'\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b', str(text))
}
return entities
print("Entity presence by class:")
entity_stats = []
for answer_type in ['factual', 'contradiction', 'irrelevant']:
subset = train_df[train_df['type'] == answer_type]
num_count = sum(subset['answer'].apply(lambda x: len(extract_entities(x)['numbers']) > 0))
year_count = sum(subset['answer'].apply(lambda x: len(extract_entities(x)['years']) > 0))
cap_count = sum(subset['answer'].apply(lambda x: len(extract_entities(x)['capitalized']) > 0))
entity_stats.append({
'Type': answer_type,
'With Numbers (%)': (num_count / len(subset)) * 100,
'With Years (%)': (year_count / len(subset)) * 100,
'With Capitalized (%)': (cap_count / len(subset)) * 100
})
print(f"\n{answer_type}:")
print(f" Numbers: {num_count:,} ({(num_count/len(subset)*100):.1f}%)")
print(f" Years: {year_count:,} ({(year_count/len(subset)*100):.1f}%)")
print(f" Capitalized: {cap_count:,} ({(cap_count/len(subset)*100):.1f}%)")
entity_df = pd.DataFrame(entity_stats)
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(entity_df))
width = 0.25
bars1 = ax.bar(x - width, entity_df['With Numbers (%)'], width, label='Numbers', color='#2E86AB', alpha=0.8)
bars2 = ax.bar(x, entity_df['With Years (%)'], width, label='Years', color='#A23B72', alpha=0.8)
bars3 = ax.bar(x + width, entity_df['With Capitalized (%)'], width, label='Capitalized', color='#F18F01', alpha=0.8)
ax.set_xlabel('Type')
ax.set_ylabel('Percentage')
ax.set_title('Entity Presence')
ax.set_xticks(x)
ax.set_xticklabels(entity_df['Type'])
ax.legend()
plt.tight_layout()
plt.show()from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# word overlap
def word_overlap_ratio(text1, text2):
"""Calculate word overlap ratio between two texts.
Returns 0.0 for empty contexts (49.9% of examples), which helps flag
Irrelevant/Contradiction cases where factual answers need context."""
if not text1 or not text2 or text1 == '' or text2 == '':
return 0.0
words1 = set(str(text1).lower().split())
words2 = set(str(text2).lower().split())
if len(words1) == 0:
return 0.0
overlap = len(words1.intersection(words2))
return overlap / len(words1)
print("Computing overlap...")
train_df['answer_question_overlap'] = train_df.apply(
lambda row: word_overlap_ratio(row['answer'], row['question']), axis=1
)
train_df['answer_context_overlap'] = train_df.apply(
lambda row: word_overlap_ratio(row['answer'], row['context']), axis=1
)
print("\nAverage overlap by class:")
overlap_by_class = train_df.groupby('type')[['answer_question_overlap', 'answer_context_overlap']].mean()
print(overlap_by_class.round(3))
# plots
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
# answer-question overlap
for answer_type in ['Factual', 'Contradiction', 'Irrelevant']:
data = train_df[train_df['type'] == answer_type]['answer_question_overlap']
axes[0, 0].hist(data, alpha=0.6, label=answer_type, bins=30)
axes[0, 0].set_xlabel('Answer-Question Overlap')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Answer-Question Overlap')
axes[0, 0].legend()
# answer-context overlap
for answer_type in ['Factual', 'Contradiction', 'Irrelevant']:
data = train_df[train_df['type'] == answer_type]['answer_context_overlap']
axes[0, 1].hist(data, alpha=0.6, label=answer_type, bins=30)
axes[0, 1].set_xlabel('Answer-Context Overlap')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Answer-Context Overlap')
axes[0, 1].legend()
# box plots
data_aq = [train_df[train_df['type'] == t]['answer_question_overlap'].values for t in ['Factual', 'Contradiction', 'Irrelevant']]
bp1 = axes[1, 0].boxplot(data_aq, labels=['Factual', 'Contradiction', 'Irrelevant'], patch_artist=True)
colors = ['#2E86AB', '#A23B72', '#F18F01']
for patch, color in zip(bp1['boxes'], colors):
patch.set_facecolor(color)
patch.set_alpha(0.7)
axes[1, 0].set_ylabel('Overlap')
axes[1, 0].set_title('Answer-Question by Class')
data_ac = [train_df[train_df['type'] == t]['answer_context_overlap'].values for t in ['Factual', 'Contradiction', 'Irrelevant']]
bp2 = axes[1, 1].boxplot(data_ac, labels=['Factual', 'Contradiction', 'Irrelevant'], patch_artist=True)
for patch, color in zip(bp2['boxes'], colors):
patch.set_facecolor(color)
patch.set_alpha(0.7)
axes[1, 1].set_ylabel('Overlap')
axes[1, 1].set_title('Answer-Context by Class')
plt.show()
plt.tight_layout()