π Dog Matchmaker β Find your perfect pup
A playful, data-driven chatbot prototype that recommends the top 3 real-world dog breeds based on your lifestyle and personality.
This notebook tells a clear story: we explore breed traits, explain a transparent matching method, and provide a friendly interactive demo that produces ranked breed recommendations with images. The code cells are runnable end-to-end and use a relative images/ folder (place the downloaded Dog-Breeds-Dataset inside an images/ directory next to this notebook).
Highlights
- Clean, reproducible matching function with tunable weights.
- Mapping between breed names and image folders.
- Ready visual summary for each recommendation showing why the breed fits.
1 β Problem & approach
People bring different lives to dog ownership β city apartment dwellers, active runners, families with children, allergy-prone households. Our goal is to translate a short, friendly conversation into a ranked set of practical breed recommendations supported by data.
Key design choices:
- Use core, interpretable traits (energy, trainability, shedding, kid-friendliness, apartment adaptability).
- Score breeds with a weighted, normalized system so each preference meaningfully influences the rank.
- Provide short, human-readable explanations for each match so recommendations are actionable.
!pip install -r requirements.txt
print("All requirements installed β
")# Basic imports and configuration
import pandas as pd
import numpy as np
from pathlib import Path
import json
import re, unicodedata, difflib
from IPython.display import display, Markdown, HTML
import matplotlib.pyplot as plt
# Notebook paths (relative)
DATA_DIR = Path('data')
IMAGES_DIR = Path('data/images') # place Dog-Breeds-Dataset here: images/<breed folder>/*.jpg
BREED_CSV = DATA_DIR / 'breed_traits.csv'
TRAIT_CSV = DATA_DIR / 'trait_description.csv'
MAPPING_JSON = DATA_DIR / 'breed_to_folder.json'
# small helper
def md(s): display(Markdown(s))
print('Ready β data paths set to:', BREED_CSV, TRAIT_CSV, 'images folder ->', IMAGES_DIR)
# Load breed_traits and trait_description (fall back to a small demo if files are not present)
def load_datasets(breed_csv=BREED_CSV, trait_csv=TRAIT_CSV):
if breed_csv.exists():
breeds = pd.read_csv(breed_csv)
else:
# small demo dataset to ensure the notebook runs for presentation
breeds = pd.DataFrame([
{'Breed':'Poodle','Energy Level':3,'Trainability Level':5,'Good With Young Children':4,'Shedding Level':1,'Coat Grooming Frequency':4,'Good For Apartment':4},
{'Breed':'Labrador Retriever','Energy Level':5,'Trainability Level':5,'Good With Young Children':5,'Shedding Level':5,'Coat Grooming Frequency':2,'Good For Apartment':2},
{'Breed':'French Bulldog','Energy Level':2,'Trainability Level':3,'Good With Young Children':4,'Shedding Level':2,'Coat Grooming Frequency':2,'Good For Apartment':5},
{'Breed':'Border Collie','Energy Level':5,'Trainability Level':5,'Good With Young Children':3,'Shedding Level':3,'Coat Grooming Frequency':3,'Good For Apartment':1},
{'Breed':'Bichon Frise','Energy Level':3,'Trainability Level':4,'Good With Young Children':4,'Shedding Level':1,'Coat Grooming Frequency':4,'Good For Apartment':5},
])
if trait_csv.exists():
traits = pd.read_csv(trait_csv)
else:
traits = pd.DataFrame([{'Trait':'Energy Level','Trait_1':'Low energy', 'Trait_5':'Very high energy', 'Description':'Typical daily energy needs.'}])
return breeds, traits
breeds_df, traits_df = load_datasets()
display(breeds_df.head(10))
2 β Advanced EDA: Comprehensive Data Exploration
This section provides deep insights into breed characteristics, correlations, distributions, and relationships to inform our matching algorithm.
# Advanced EDA: Comprehensive analysis
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
# Set style
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100
# Prepare numeric columns
numeric_cols = ['Affectionate With Family', 'Good With Young Children', 'Good With Other Dogs',
'Shedding Level', 'Coat Grooming Frequency', 'Drooling Level',
'Openness To Strangers', 'Playfulness Level', 'Watchdog/Protective Nature',
'Adaptability Level', 'Trainability Level', 'Energy Level', 'Barking Level',
'Mental Stimulation Needs']
# Convert to numeric
breeds_numeric = breeds_df.copy()
for col in numeric_cols:
if col in breeds_numeric.columns:
breeds_numeric[col] = pd.to_numeric(breeds_numeric[col], errors='coerce')
# Remove non-numeric columns for analysis
breeds_analysis = breeds_numeric[numeric_cols].dropna()
print(f"π Dataset Overview:")
print(f" Total breeds: {len(breeds_df)}")
print(f" Complete records: {len(breeds_analysis)}")
print(f" Features analyzed: {len(numeric_cols)}")
print()
# 1. Statistical Summary
md("### 1. Statistical Summary")
display(breeds_analysis.describe().round(2))
# 2. Correlation Heatmap
md("### 2. Feature Correlation Heatmap")
# Calculate correlation matrix
corr_matrix = breeds_analysis.corr()
# Create heatmap
plt.figure(figsize=(14, 12))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool)) # Mask upper triangle
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', cmap='coolwarm',
center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8},
vmin=-1, vmax=1)
plt.title('Correlation Matrix of Dog Breed Traits', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()
# Highlight strong correlations
md("**Strong Correlations (>0.5 or <-0.5):**")
strong_corr = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
val = corr_matrix.iloc[i, j]
if abs(val) > 0.5:
strong_corr.append((corr_matrix.columns[i], corr_matrix.columns[j], val))
strong_corr.sort(key=lambda x: abs(x[2]), reverse=True)
for feat1, feat2, corr_val in strong_corr[:10]:
print(f" {feat1} β {feat2}: {corr_val:.2f}")
# 3. Distribution Plots for All Features
md("### 3. Distribution of All Traits")
n_cols = 3
n_rows = (len(numeric_cols) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 4*n_rows))
axes = axes.flatten()
for idx, col in enumerate(numeric_cols):
if col in breeds_analysis.columns:
values = breeds_analysis[col].dropna()
axes[idx].hist(values, bins=5, alpha=0.7, color=sns.color_palette("husl", len(numeric_cols))[idx], edgecolor='black')
axes[idx].set_title(f'{col}\n(Mean: {values.mean():.2f}, Std: {values.std():.2f})', fontsize=10)
axes[idx].set_xlabel('Score (1-5)')
axes[idx].set_ylabel('Count')
axes[idx].axvline(values.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {values.mean():.2f}')
axes[idx].legend(fontsize=8)
else:
axes[idx].axis('off')
# Hide extra subplots
for idx in range(len(numeric_cols), len(axes)):
axes[idx].axis('off')
plt.suptitle('Distribution of All Breed Traits', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
# 4. Key Relationships: Scatter Plots
md("### 4. Key Feature Relationships")
key_pairs = [
('Energy Level', 'Mental Stimulation Needs', 'Energy vs Mental Stimulation'),
('Trainability Level', 'Energy Level', 'Trainability vs Energy'),
('Shedding Level', 'Coat Grooming Frequency', 'Shedding vs Grooming'),
('Good With Young Children', 'Playfulness Level', 'Kid-Friendly vs Playfulness'),
('Barking Level', 'Watchdog/Protective Nature', 'Barking vs Protective Nature'),
('Adaptability Level', 'Openness To Strangers', 'Adaptability vs Openness')
]
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()
for idx, (x_col, y_col, title) in enumerate(key_pairs):
if x_col in breeds_analysis.columns and y_col in breeds_analysis.columns:
x_vals = breeds_analysis[x_col]
y_vals = breeds_analysis[y_col]
# Scatter plot
axes[idx].scatter(x_vals, y_vals, alpha=0.6, s=50)
# Add regression line
z = np.polyfit(x_vals, y_vals, 1)
p = np.poly1d(z)
axes[idx].plot(x_vals, p(x_vals), "r--", alpha=0.8, linewidth=2)
# Calculate correlation
corr = x_vals.corr(y_vals)
axes[idx].set_title(f'{title}\n(r={corr:.2f})', fontsize=11, fontweight='bold')
axes[idx].set_xlabel(x_col, fontsize=9)
axes[idx].set_ylabel(y_col, fontsize=9)
axes[idx].grid(True, alpha=0.3)
plt.suptitle('Key Feature Relationships', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
# 5. Box Plots: Variability Analysis
md("### 5. Variability Analysis (Box Plots)")
# Select key features for box plot
key_features = ['Energy Level', 'Trainability Level', 'Shedding Level',
'Barking Level', 'Playfulness Level', 'Good With Young Children']
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()
for idx, col in enumerate(key_features):
if col in breeds_analysis.columns:
data = breeds_analysis[col].dropna()
bp = axes[idx].boxplot([data], patch_artist=True, labels=[col])
bp['boxes'][0].set_facecolor(sns.color_palette("Set2")[idx])
axes[idx].set_title(f'{col}\n(IQR: {data.quantile(0.75) - data.quantile(0.25):.2f})',
fontsize=11, fontweight='bold')
axes[idx].set_ylabel('Score (1-5)')
axes[idx].grid(True, alpha=0.3, axis='y')
# Add mean marker
axes[idx].axhline(data.mean(), color='red', linestyle='--', linewidth=2,
label=f'Mean: {data.mean():.2f}')
axes[idx].legend(fontsize=9)
plt.suptitle('Variability Analysis: Key Traits', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
# 6. Top/Bottom Breeds Analysis
md("### 6. Extreme Breeds Analysis")
# Find breeds with extreme values
extreme_analysis = breeds_df.copy()
# Add composite scores
extreme_analysis['High Energy Score'] = pd.to_numeric(extreme_analysis['Energy Level'], errors='coerce')
extreme_analysis['High Trainability'] = pd.to_numeric(extreme_analysis['Trainability Level'], errors='coerce')
extreme_analysis['Low Shedding'] = 6 - pd.to_numeric(extreme_analysis['Shedding Level'], errors='coerce')
extreme_analysis['Low Barking'] = 6 - pd.to_numeric(extreme_analysis['Barking Level'], errors='coerce')
extreme_analysis['Kid Friendly'] = pd.to_numeric(extreme_analysis['Good With Young Children'], errors='coerce')
# Composite apartment-friendly score
extreme_analysis['Apartment Score'] = (
extreme_analysis['Low Shedding'] +
extreme_analysis['Low Barking'] +
(6 - extreme_analysis['High Energy Score'])
) / 3
print("π Top 5 Breeds by Category:\n")
categories = {
'Highest Energy': ('High Energy Score', True),
'Most Trainable': ('High Trainability', True),
'Lowest Shedding': ('Low Shedding', True),
'Quietest': ('Low Barking', True),
'Most Kid-Friendly': ('Kid Friendly', True),
'Best for Apartments': ('Apartment Score', True)
}
for category, (col, ascending) in categories.items():
top = extreme_analysis.nlargest(5, col) if ascending else extreme_analysis.nsmallest(5, col)
print(f" {category}:")
for _, row in top.iterrows():
print(f" β’ {row['Breed']} ({row[col]:.2f})")
print()
# 7. Feature Importance for Matching
md("### 7. Feature Variance Analysis")
# Calculate coefficient of variation (CV) for each feature
cv_data = []
for col in numeric_cols:
if col in breeds_analysis.columns:
values = breeds_analysis[col].dropna()
cv = values.std() / values.mean() if values.mean() != 0 else 0
cv_data.append({
'Feature': col,
'Mean': values.mean(),
'Std': values.std(),
'CV': cv,
'Range': values.max() - values.min()
})
cv_df = pd.DataFrame(cv_data).sort_values('CV', ascending=False)
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Coefficient of Variation
axes[0].barh(range(len(cv_df)), cv_df['CV'], color=sns.color_palette("muted"))
axes[0].set_yticks(range(len(cv_df)))
axes[0].set_yticklabels(cv_df['Feature'], fontsize=9)
axes[0].set_xlabel('Coefficient of Variation', fontsize=11)
axes[0].set_title('Feature Variability (Higher = More Diverse)', fontsize=12, fontweight='bold')
axes[0].grid(True, alpha=0.3, axis='x')
# Range
axes[1].barh(range(len(cv_df)), cv_df['Range'], color=sns.color_palette("pastel"))
axes[1].set_yticks(range(len(cv_df)))
axes[1].set_yticklabels(cv_df['Feature'], fontsize=9)
axes[1].set_xlabel('Score Range', fontsize=11)
axes[1].set_title('Feature Range (Max - Min)', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()
md("**Insights:**")
md("- Features with high CV have more diversity β better for distinguishing breeds")
md("- Features with low CV are similar across breeds β less useful for matching")
print("\nTop 5 Most Variable Features (Best for Matching):")
for _, row in cv_df.head(5).iterrows():
print(f" {row['Feature']}: CV={row['CV']:.3f}, Range={row['Range']:.1f}")
# 8. Summary Statistics Table
md("### 8. Comprehensive Summary Statistics")
summary_stats = breeds_analysis.describe().T
summary_stats['CV'] = summary_stats['std'] / summary_stats['mean']
summary_stats['IQR'] = summary_stats['75%'] - summary_stats['25%']
summary_stats = summary_stats[['mean', 'std', 'min', '25%', '50%', '75%', 'max', 'CV', 'IQR']]
summary_stats.columns = ['Mean', 'Std', 'Min', 'Q1', 'Median', 'Q3', 'Max', 'CV', 'IQR']
display(summary_stats.round(2))
md("**Key Takeaways:**")
md("1. **Energy Level** and **Mental Stimulation** are highly correlated β active breeds need mental exercise")
md("2. **Shedding** and **Grooming** are moderately correlated β breeds that shed less often need more grooming")
md("3. **Trainability** correlates with **Energy** β high-energy breeds are often easier to train")
md("4. **Kid-friendliness** correlates with **Playfulness** β playful breeds tend to be good with children")
md("5. Most features show good variability (CV > 0.2), making them useful for breed differentiation")
3 β Robust folder name normalization
Image folders in the supplied dataset use lowercase names with a trailing " dog" token (e.g. labrador retriever dog). Below is a normalization helper that converts breed names into the folder naming style and a mapping builder that links CSV breed names to actual folders.
β
β