Children's Motor Performance
📖 Background
Measuring the physical abilities of children is helpful for understanding growth and development, as well as identifying gifted individuals by sports talent scouts. A common measure for physical abilities is the Motor Performance Index.
An athletics talent scout has hired you to find insights in a dataset to assist their search for the next generation of track and field stars.
💾 The data
The dataset is a slightly cleaned version of a dataset described in the article Kids motor performances datasets from the Data in Brief journal.
The dataset consists of a single CSV file, data/motor-performance.csv
.
Each row represents a seven year old Malaysian child.
Four properties of motor skills were recorded.
- POWER (cm): Distance of a two-footed standing jump.
- SPEED (sec): Time taken to sprint 20m.
- FLEXIBILITY (cm): Distance reached forward in a sitting position.
- COORDINATION (no.): Number of catches of a ball, out of ten.
Full details of these metrics are described in sections 2.2 to 2.5 of the linked article.
Attributes of the children are included.
- STATE: The Malaysian state where the child resides.
- RESIDENTIAL: Whether the child lives in a rural or urban area.
- GENDER: The child's gender,
F
emale orM
ale. - AGE: The child's age in years.
- WEIGHT (kg): The child's bodyweight in kg.
- HEIGHT (CM): The child's height in cm.
- BMI (kg/m2): The child's body mass index (weight in kg divided by height in meters squared).
- CLASS (BMI): Categorization of the BMI: "SEVERE THINNESS", "THINNESS", "NORMAL", "OVERWEIGHT", "OBESITY".
import pandas as pd
motor_performance = pd.read_csv("data/motor-performance.csv")
motor_performance
💪 Challenge
Explore the dataset to understand how the attributes of the children affect the motor skills, and the relationship between the four motor skills. Your published notebook should contain a short report on the motor skills, including summary statistics, visualizations, statistical models, and text describing any insights you found.
🧑⚖️ Judging criteria
The publications will be graded as follows:
- [20%] Technical approach.
- Is the approach technically sound?
- Is the code high quality?
- [20%] Visualizations
- Are the visualizations suitable?
- Can clear insights be gleaned from the visualizations?
- [30%] Storytelling
- Does the data underpin the narrative?
- Does the narrative read coherently?
- Is the narrative detailed but concise?
- [30%] Insights and recommendations
- How clear are the insights and recommendations?
- Are the insights relevant to the domain?
- Are limitations of the analysis recognized?
In the event that multiple submissions have an equally high score, the publication with the most upvotes wins.
📘 Rules
To be eligible to win, you must:
- Submit your response before the deadline. All responses must be submitted in English.
Entrants must be:
- 18+ years old.
- Allowed to take part in a skill-based competition from their country. Entrants can not:
- Be in a country currently sanctioned by the U.S. government.
✅ Checklist before publishing
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- Remove redundant cells like the judging criteria, so the workbook is focused on your work.
- Check that all the cells run without error.
⌛️ Time is ticking. Good luck!
Created by Mehmet Alper ŞAHİN
Contents
Libraries
Exploratory Data Analysis
- Drop Duplicates - Distribution of Categorical Values - Dummy Variables for categorical Values - Outlier Detection & Remove Outliers - Correlation of Independent Variables - Conformity to Normal Distribution - Shapiro-Wilk Test - Kolmogorov-Smirnov Test - Normal Transform
Multiple Regression
Clustering
- Normalization - Elbow Method to determine the number of class - Non-Hierarchical procedures - K-means - Hierarchical procedures - Linkage Methods - Variance Methods (Ward’s method)
Decision Tree
Random Forest
Conclusion and Recommendations
Libraries:
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
from sklearn.metrics import silhouette_score
import scipy as sp
from scipy.cluster.hierarchy import linkage, dendrogram
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, f1_score, recall_score, confusion_matrix, ConfusionMatrixDisplay
from warnings import filterwarnings
filterwarnings('ignore')
Exploratory Data Analysis:
raw_data = pd.read_csv('data/motor-performance.csv')
raw_data.head(10)
shape_first = raw_data.shape[0]
raw_data.drop_duplicates(inplace = True)
print( shape_first - raw_data.shape[0], ' row is removed from raw data.' )