Skip to content
0

Data4Good Case Challenge

Background

AI is transforming education but also introduces risks like misinformation. This competition focuses on detecting factuality in AI-generated educational content.

Executive Summary: Complete Analysis Report

1. Problem Statement

Develop a classification system to detect factuality in AI-generated educational content by categorizing answers into three classes:

  • Factual: Correct answers
  • Contradiction: Incorrect/contradictory answers
  • Irrelevant: Answers unrelated to the question

2. Dataset Overview

  • Training Set: 21,021 examples from data/train.json
  • Test Set: 2,000 examples from data/test.json
  • Features: Question, Context (passage), Answer, Type (label)
  • Imbalance Ratio: 9.84:1 (Factual class dominates at 90.4%)

3. Key Findings from Exploratory Data Analysis

Class Distribution:

  • Factual: 19,005 examples (90.4%)
  • Irrelevant: 1,086 examples (5.2%)
  • Contradiction: 930 examples (4.4%)

Missing Data Patterns:

  • 10,498 examples (49.9%) have empty context fields
  • Empty contexts appear across all three classes
  • No missing values in question or answer fields
  • Critical Implementation Detail: Our word_overlap_ratio function returns 0.0 for empty contexts, effectively helping flag these rows as "Irrelevant" or "Contradiction" when a factual answer should have had a context to reference

Text Length Analysis:

  • Question length: avg 66 characters (12 words)
  • Context length: avg 295 characters when present (48 words)
  • Answer length: avg 77 characters (15 words)
  • Contradiction answers tend to be shorter than factual answers

Word Overlap Analysis (Critical Feature):

  • Factual answers: High context overlap (0.619) - answers closely reference context
  • Contradiction answers: Moderate overlap (0.352) - some context reference but contradictory
  • Irrelevant answers: Low overlap (0.212) - minimal connection to context

Negative Keyword Patterns (Critical for Contradiction Detection):

  • Contradiction answers contain significantly more negative keywords (not, never, rather, instead, etc.)
  • These linguistic markers are essential signals for distinguishing contradictions from factual statements
  • Words like "not", "rather", "instead" flip factual statements into contradictions

Entity Patterns:

  • Factual answers contain more numbers (dates, statistics)
  • Capitalized words (proper nouns) appear frequently across all classes
  • Year mentions are common in factual content

4. Modeling Approach

Feature Engineering: Created 9 engineered features:

  1. Question length (characters)
  2. Context length (characters)
  3. Answer length (characters)
  4. Question word count
  5. Context word count
  6. Answer word count
  7. Answer-question overlap ratio
  8. Answer-context overlap ratio (strongest predictor)
  9. Negative keyword count - critical for contradiction detection

Text Vectorization:

  • TF-IDF with 5,000 features
  • N-grams: unigrams, bigrams, trigrams (1-3)
  • Min document frequency: 2
  • Max document frequency: 95%
  • Sublinear TF scaling applied
  • Custom stop words: Preserves "not" and other negative keywords that are critical for contradiction detection

Model Selection:

  • Algorithm: XGBoost Classifier
  • Rationale: Handles imbalanced data well, feature importance, fast training
  • Hyperparameters:
    • n_estimators: 200
    • max_depth: 6
    • learning_rate: 0.1
    • subsample: 0.8
    • colsample_bytree: 0.8

Handling Class Imbalance:

  • Applied balanced class weights (9.84:1 imbalance)
  • Minority classes (contradiction, irrelevant) weighted higher during training

5. Model Interpretability & Explainability

Feature Importance Analysis:

  • XGBoost native feature importance ranking
  • Top features: answer_context_overlap, negative_keywords, specific n-grams
  • Engineered features dominate top importance rankings

Key Interpretability Findings:

  1. Answer-context overlap is the single strongest predictor
  2. Negative keywords strongly push predictions toward "Contradiction"
  3. Low overlap + generic language signals "Irrelevant"
  4. High overlap + domain terms signals "Factual"
  5. Model decisions align with human intuition and domain knowledge

6. Results

Training Configuration:

  • Total features: 5,009 (5,000 TF-IDF + 9 engineered including negative keywords)
  • Training samples: 21,021
  • Model trained on full dataset for final submission

Validation Performance (Competition-Aligned Metrics):

  • Macro-F1 Score: 0.8719 - Evaluates balanced performance across all 3 classes
  • Balanced Accuracy: 0.8831 - Average recall per class (aligns with 33.3% per-class weighting)
  • Per-Class F1 Scores:
    • Contradiction: 0.7386
    • Factual: 0.9685
    • Irrelevant: 0.9085

Test Set Predictions:

  • Factual: 1,657 predictions (82.8%)
  • Irrelevant: 209 predictions (10.4%)
  • Contradiction: 134 predictions (6.7%)
  • Output saved to data/test_predictions.json

7. Key Insights

  1. Context overlap is the strongest signal for factuality detection
  2. Negative keywords are critical for contradiction detection (not, rather, instead, never)
  3. Class imbalance required careful handling with sample weights
  4. Text patterns differ significantly between classes (length, overlap, entities, negations)
  5. Custom TF-IDF preprocessing preserves contradiction signals that standard stop word removal would eliminate
  6. Model interpretability confirms features work as intended with no spurious patterns
  7. Ensemble method (XGBoost) effectively combines multiple weak signals

8. Methodology Strengths

  • Comprehensive EDA with 11+ visualizations including negative keyword analysis
  • Feature engineering based on domain insights and contradiction linguistics
  • Competition-specific improvements:
    • Negative keyword feature engineering
    • Custom stop word list preserving contradiction signals
    • Macro-F1 and Balanced Accuracy metrics
    • Feature importance analysis for model transparency
  • Proper handling of class imbalance
  • Combined text and numerical features
  • Full dataset utilization for final model
  • Reproducible pipeline with clear documentation
  • Model explainability demonstrating trustworthy predictions

The Data

QA dataset where we need to classify answers as:

  • Factual: answer is correct
  • Contradiction: answer is incorrect
  • Irrelevant: answer has nothing to do with the question

Training: 21,021 examples in data/train.json Test: 2,000 examples in data/test.json

Loading the data

import pandas as pd
import json

data_path = "data/train.json"
with open(data_path, 'r', encoding='utf-8') as f:
    data = json.load(f)

train_df = pd.DataFrame(data)
train_df.head(50)

EDA

import warnings
warnings.filterwarnings('ignore')

# basic info
print(f"Shape: {train_df.shape}")
print(f"Rows: {train_df.shape[0]:,}")
print(f"\nColumns:")
print(train_df.dtypes)
print("\nFirst few:")
train_df.head(3)
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 6)

# class distribution
class_counts = train_df['type'].value_counts()
class_percentages = train_df['type'].value_counts(normalize=True) * 100

print("Class distribution:")
for cls, count in class_counts.items():
    print(f"{cls}: {count:,} ({class_percentages[cls]:.2f}%)")

# check imbalance
max_class = class_counts.max()
min_class = class_counts.min()
imbalance_ratio = max_class / min_class
print(f"\nImbalance ratio: {imbalance_ratio:.2f}")

fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# bar plot
axes[0].bar(class_counts.index, class_counts.values, color=['#2E86AB', '#A23B72', '#F18F01'], alpha=0.8)
axes[0].set_xlabel('Type')
axes[0].set_ylabel('Count')
axes[0].set_title('Class Distribution')
for i, (cls, count) in enumerate(class_counts.items()):
    axes[0].text(i, count + 200, f'{count:,}\n({class_percentages[cls]:.1f}%)', ha='center')

# pie chart
axes[1].pie(class_counts.values, labels=class_counts.index, autopct='%1.1f%%', 
           colors=['#2E86AB', '#A23B72', '#F18F01'], startangle=90)
axes[1].set_title('Proportions')

plt.tight_layout()
plt.show()
# check for missing data
print("Null values:")
print(train_df.isnull().sum())

print("\nEmpty strings:")
for col in ['question', 'context', 'answer']:
    empty_count = (train_df[col] == '').sum()
    print(f"{col}: {empty_count:,}")

# empty context by class
print("\nEmpty context by class:")
print(train_df[train_df['context'] == ''].groupby('type').size())

# visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

has_context = train_df.groupby('type')['context'].apply(lambda x: (x != '').sum())
no_context = train_df.groupby('type')['context'].apply(lambda x: (x == '').sum())

x = np.arange(len(has_context))
width = 0.35

axes[0].bar(x - width/2, has_context.values, width, label='Has Context', color='#06A77D', alpha=0.8)
axes[0].bar(x + width/2, no_context.values, width, label='No Context', color='#D62246', alpha=0.8)
axes[0].set_xlabel('Type')
axes[0].set_ylabel('Count')
axes[0].set_title('Context Availability')
axes[0].set_xticks(x)
axes[0].set_xticklabels(has_context.index)
axes[0].legend()

context_pct = (train_df['context'] != '').mean() * 100
no_context_pct = (train_df['context'] == '').mean() * 100

axes[1].pie([context_pct, no_context_pct], labels=['Has Context', 'No Context'], 
           autopct='%1.1f%%', colors=['#06A77D', '#D62246'], startangle=90)
axes[1].set_title('Overall Context')

plt.tight_layout()
plt.show()
# text lengths
train_df['question_length'] = train_df['question'].str.len()
train_df['context_length'] = train_df['context'].str.len()
train_df['answer_length'] = train_df['answer'].str.len()

train_df['question_words'] = train_df['question'].str.split().str.len()
train_df['context_words'] = train_df['context'].str.split().str.len()
train_df['answer_words'] = train_df['answer'].str.split().str.len()

print("Text length stats:")
for col in ['question', 'context', 'answer']:
    length_col = f'{col}_length'
    print(f"\n{col}:")
    print(train_df[length_col].describe())

# by class
print("\nAverage length by class:")
length_by_class = train_df.groupby('type')[['question_length', 'context_length', 'answer_length']].mean()
print(length_by_class.round(1))

# plots
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

columns = ['question_length', 'context_length', 'answer_length']
titles = ['Question Length', 'Context Length', 'Answer Length']

for idx, (col, title) in enumerate(zip(columns, titles)):
    # histograms
    for answer_type in train_df['type'].unique():
        if pd.notna(answer_type) and answer_type != '':
            data = train_df[train_df['type'] == answer_type][col]
            axes[0, idx].hist(data, alpha=0.5, label=answer_type, bins=50)
    
    axes[0, idx].set_xlabel('Length')
    axes[0, idx].set_ylabel('Frequency')
    axes[0, idx].set_title(title)
    axes[0, idx].legend()
    
    # box plots
    data_to_plot = [train_df[train_df['type'] == t][col].values for t in train_df['type'].unique() if t != '']
    labels = [t for t in train_df['type'].unique() if t != '']
    
    bp = axes[1, idx].boxplot(data_to_plot, labels=labels, patch_artist=True)
    
    colors = ['#2E86AB', '#A23B72', '#F18F01']
    for patch, color in zip(bp['boxes'], colors[:len(labels)]):
        patch.set_facecolor(color)
        patch.set_alpha(0.7)
    
    axes[1, idx].set_ylabel('Length')
    axes[1, idx].set_title(f'{title} by Class')

plt.tight_layout()
plt.show()
# look at some examples
for answer_type in ['factual', 'contradiction', 'irrelevant']:
    examples = train_df[train_df['type'] == answer_type].head(2)
    
    print(f"\n{answer_type.upper()} examples:")
    print("="*60)
    
    for idx, row in examples.iterrows():
        print(f"\nQuestion: {row['question']}")
        print(f"Context: {row['context'][:200]}..." if len(row['context']) > 200 else f"Context: {row['context']}")
        print(f"Answer: {row['answer']}")
        print("-"*60)
from collections import Counter
import re

def get_top_words(text_series, n=15, min_length=3):
    all_text = ' '.join(text_series.fillna('').astype(str).values)
    words = re.findall(r'\b[a-zA-Z]+\b', all_text.lower())
    # IMPORTANT: Removed 'not' from stop words - it's critical for detecting contradictions
    stop_words = {'the', 'is', 'in', 'and', 'to', 'of', 'a', 'for', 'was', 'on', 'that', 'with', 
                  'as', 'by', 'at', 'from', 'are', 'an', 'be', 'or', 'has', 'had', 'have',
                  'this', 'it', 'its', 'which', 'their', 'were', 'been', 'they'}
    words = [w for w in words if len(w) >= min_length and w not in stop_words]
    return Counter(words).most_common(n)

print("Top words in answers:")

fig, axes = plt.subplots(1, 3, figsize=(20, 6))

for idx, answer_type in enumerate(['factual', 'contradiction', 'irrelevant']):
    answers = train_df[train_df['type'] == answer_type]['answer']
    top_words = get_top_words(answers, n=15)
    
    print(f"\n{answer_type}:")
    for word, count in top_words[:10]:
        print(f"  {word}: {count:,}")
    
    if top_words:
        words, counts = zip(*top_words)
        colors_map = {'factual': '#2E86AB', 'contradiction': '#A23B72', 'irrelevant': '#F18F01'}
        axes[idx].barh(range(len(words)), counts, color=colors_map[answer_type], alpha=0.8)
        axes[idx].set_yticks(range(len(words)))
        axes[idx].set_yticklabels(words)
        axes[idx].set_xlabel('Frequency')
        axes[idx].set_title(f'Top Words - {answer_type}')
        axes[idx].invert_yaxis()

plt.tight_layout()
plt.show()
# Analyze negative keywords - critical for contradiction detection
negative_keywords = ['not', 'never', 'no', 'neither', 'nor', 'cannot', 'rather', 'instead', 
                     'however', 'but', 'although', 'despite', 'incorrect', 'false', 'wrong']

def count_negative_keywords(text):
    text_lower = str(text).lower()
    words = re.findall(r'\b\w+\b', text_lower)
    return sum(1 for word in words if word in negative_keywords)

print("Negative keyword analysis by class:")
print("-" * 50)

for answer_type in ['factual', 'contradiction', 'irrelevant']:
    subset = train_df[train_df['type'] == answer_type]
    neg_counts = subset['answer'].apply(count_negative_keywords)
    
    total_with_neg = (neg_counts > 0).sum()
    avg_neg_per_answer = neg_counts.mean()
    
    print(f"\n{answer_type.upper()}:")
    print(f"  Answers with negative keywords: {total_with_neg:,} ({total_with_neg/len(subset)*100:.1f}%)")
    print(f"  Average negative keywords per answer: {avg_neg_per_answer:.3f}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Bar chart - percentage with negative keywords
neg_presence = []
class_labels = ['factual', 'contradiction', 'irrelevant']
for answer_type in class_labels:
    subset = train_df[train_df['type'] == answer_type]
    if len(subset) > 0:
        neg_counts = subset['answer'].apply(count_negative_keywords)
        pct_with_neg = (neg_counts > 0).sum() / len(subset) * 100
        neg_presence.append(pct_with_neg)
    else:
        neg_presence.append(0)

colors = ['#2E86AB', '#A23B72', '#F18F01']
axes[0].bar(['Factual', 'Contradiction', 'Irrelevant'], neg_presence, color=colors, alpha=0.8)
axes[0].set_ylabel('Percentage (%)')
axes[0].set_title('Answers with Negative Keywords')
axes[0].set_ylim(0, max(neg_presence) * 1.2 if max(neg_presence) > 0 else 1)
for i, v in enumerate(neg_presence):
    axes[0].text(i, v + 1, f'{v:.1f}%', ha='center', fontweight='bold')

# Histogram - distribution of negative keyword counts
for answer_type, color, display_name in zip(class_labels, colors, ['Factual', 'Contradiction', 'Irrelevant']):
    subset = train_df[train_df['type'] == answer_type]
    if len(subset) > 0:
        neg_counts = subset['answer'].apply(count_negative_keywords)
        axes[1].hist(neg_counts, alpha=0.6, label=display_name, bins=range(0, 6), color=color)

axes[1].set_xlabel('Number of Negative Keywords')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Negative Keywords')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\n" + "="*50)
print("KEY INSIGHT: Contradiction answers contain more negative keywords!")
print("This will be used as a feature in the model.")
print("="*50)
# simple entity extraction
def extract_entities(text):
    entities = {
        'numbers': re.findall(r'\b\d+(?:,\d{3})*(?:\.\d+)?(?:\s?%|percent)?\b', str(text)),
        'years': re.findall(r'\b(?:19|20)\d{2}\b', str(text)),
        'capitalized': re.findall(r'\b[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*\b', str(text))
    }
    return entities

print("Entity presence by class:")

entity_stats = []

for answer_type in ['factual', 'contradiction', 'irrelevant']:
    subset = train_df[train_df['type'] == answer_type]
    
    num_count = sum(subset['answer'].apply(lambda x: len(extract_entities(x)['numbers']) > 0))
    year_count = sum(subset['answer'].apply(lambda x: len(extract_entities(x)['years']) > 0))
    cap_count = sum(subset['answer'].apply(lambda x: len(extract_entities(x)['capitalized']) > 0))
    
    entity_stats.append({
        'Type': answer_type,
        'With Numbers (%)': (num_count / len(subset)) * 100,
        'With Years (%)': (year_count / len(subset)) * 100,
        'With Capitalized (%)': (cap_count / len(subset)) * 100
    })
    
    print(f"\n{answer_type}:")
    print(f"  Numbers: {num_count:,} ({(num_count/len(subset)*100):.1f}%)")
    print(f"  Years: {year_count:,} ({(year_count/len(subset)*100):.1f}%)")
    print(f"  Capitalized: {cap_count:,} ({(cap_count/len(subset)*100):.1f}%)")

entity_df = pd.DataFrame(entity_stats)

fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(entity_df))
width = 0.25

bars1 = ax.bar(x - width, entity_df['With Numbers (%)'], width, label='Numbers', color='#2E86AB', alpha=0.8)
bars2 = ax.bar(x, entity_df['With Years (%)'], width, label='Years', color='#A23B72', alpha=0.8)
bars3 = ax.bar(x + width, entity_df['With Capitalized (%)'], width, label='Capitalized', color='#F18F01', alpha=0.8)

ax.set_xlabel('Type')
ax.set_ylabel('Percentage')
ax.set_title('Entity Presence')
ax.set_xticks(x)
ax.set_xticklabels(entity_df['Type'])
ax.legend()

plt.tight_layout()
plt.show()
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# word overlap
def word_overlap_ratio(text1, text2):
    """Calculate word overlap ratio between two texts.
    Returns 0.0 for empty contexts (49.9% of examples), which helps flag
    Irrelevant/Contradiction cases where factual answers need context."""
    if not text1 or not text2 or text1 == '' or text2 == '':
        return 0.0
    words1 = set(str(text1).lower().split())
    words2 = set(str(text2).lower().split())
    if len(words1) == 0:
        return 0.0
    overlap = len(words1.intersection(words2))
    return overlap / len(words1)

print("Computing overlap...")

train_df['answer_question_overlap'] = train_df.apply(
    lambda row: word_overlap_ratio(row['answer'], row['question']), axis=1
)

train_df['answer_context_overlap'] = train_df.apply(
    lambda row: word_overlap_ratio(row['answer'], row['context']), axis=1
)

print("\nAverage overlap by class:")
overlap_by_class = train_df.groupby('type')[['answer_question_overlap', 'answer_context_overlap']].mean()
print(overlap_by_class.round(3))

# plots
fig, axes = plt.subplots(2, 2, figsize=(16, 10))

# answer-question overlap
for answer_type in ['Factual', 'Contradiction', 'Irrelevant']:
    data = train_df[train_df['type'] == answer_type]['answer_question_overlap']
    axes[0, 0].hist(data, alpha=0.6, label=answer_type, bins=30)

axes[0, 0].set_xlabel('Answer-Question Overlap')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Answer-Question Overlap')
axes[0, 0].legend()

# answer-context overlap
for answer_type in ['Factual', 'Contradiction', 'Irrelevant']:
    data = train_df[train_df['type'] == answer_type]['answer_context_overlap']
    axes[0, 1].hist(data, alpha=0.6, label=answer_type, bins=30)

axes[0, 1].set_xlabel('Answer-Context Overlap')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Answer-Context Overlap')
axes[0, 1].legend()

# box plots
data_aq = [train_df[train_df['type'] == t]['answer_question_overlap'].values for t in ['Factual', 'Contradiction', 'Irrelevant']]
bp1 = axes[1, 0].boxplot(data_aq, labels=['Factual', 'Contradiction', 'Irrelevant'], patch_artist=True)

colors = ['#2E86AB', '#A23B72', '#F18F01']
for patch, color in zip(bp1['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

axes[1, 0].set_ylabel('Overlap')
axes[1, 0].set_title('Answer-Question by Class')

data_ac = [train_df[train_df['type'] == t]['answer_context_overlap'].values for t in ['Factual', 'Contradiction', 'Irrelevant']]
bp2 = axes[1, 1].boxplot(data_ac, labels=['Factual', 'Contradiction', 'Irrelevant'], patch_artist=True)

for patch, color in zip(bp2['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

axes[1, 1].set_ylabel('Overlap')
axes[1, 1].set_title('Answer-Context by Class')

plt.show()
plt.tight_layout()