Identifying Traits of Sports Talent in Malaysian Children Through Motor Performance
Background
Evaluating children's physical abilities is crucial for gaining insight into their growth and development, as well as for recognizing potential talent in sports. One common metric for this assessment is the Motor Performance Index (MPI), which measures different aspects of a child's motor skills.
Objectives
The primary objective of this report is to analyze datasets related to children's motor performance using summary statistics, visualizations, statistical models, and narratives. Specifically, it aims to:
- Explore the demographic profile and characteristics of the sample.
- Understand the relationship between the four motor skills.
- Explain how the children's attributes affect their motor skills.
Data Used
The dataset used in the analysis is a slightly cleaned version of a dataset described in the article entitled "Kids motor performances datasets" from the Data in Brief journal. It consists of a single CSV file, where each row represents a seven year old Malaysian child. The following lists describe its variables:
Four properties of motor skills were recorded.
- POWER (
): Distance of a two-footed standing jump. - SPEED (
): Time taken to sprint 20m. - FLEXIBILITY (
): Distance reached forward in a sitting position. - COORDINATION (no.): Number of catches of a ball, out of ten.
Attributes of the children are included.
- STATE: The Malaysian state where the child resides.
- RESIDENTIAL: Whether the child lives in a rural or urban area.
- GENDER: The child's gender,
Female orMale. - AGE: The child's age in years.
- WEIGHT (
): The child's bodyweight in kg. - HEIGHT (
): The child's height in cm. - BMI (
): The child's body mass index (weight in kg divided by height in meters squared). - CLASS (BMI): Categorization of the BMI: "SEVERE THINNESS", "THINNESS", "NORMAL", "OVERWEIGHT", "OBESITY".
(Full details of these metrics are described in sections 2.2 to 2.5 of the linked article.)
## ---------- Pre-installed Packages, Dataset, and Vectors for Variable Names
# Load pre-installed, required packages
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(scales))
suppressPackageStartupMessages(library(readr))
# Read the data set from the CSV file
motor_performance <- read_csv("data/motor-performance.csv", show_col_types = FALSE)
# Make a character vector for all 8 attributes
attributes <- c("STATE",
"RESIDENTIAL",
"GENDER",
"AGE",
"WEIGHT (kg)",
"HEIGHT (CM)",
"BMI (kg/m2)",
"CLASS (BMI)")
# Make a character vector for all 4 motor skills
motor_skills <- c("POWER (cm)",
"SPEED (sec)",
"FLEXIBILITY (cm)",
"COORDINATION (no.)")
# Make a character vector for all numerical variables
num_vars <- c("AGE",
"WEIGHT (kg)",
"HEIGHT (cm)",
"BMI (kg/m2)",
"POWER (cm)",
"SPEED (sec)",
"FLEXIBILITY (cm)",
"COORDINATION (no.)")Results & Discussion
Descriptive Analysis
The following information describe the demographic profile and characteristics of the sample composing of 1998 seven-year-old children who are in national primary regional school and participating in Malaysia's physical fitness test (SEGAK).
Numerical Variables
- As expected, the mean age of the children is around 7, with a standard deviation of 0.05.
- The mean weight is 22.21 kg, with a standard deviation of 5.41.
- The mean height is 118.26 cm, with a standard deviation of 5.97.
- The mean body mass index (BMI) is 15.77 (kg/m2), with a standard deviation of 3.06.
- The mean distance of a two-footed standing jump is 96.20 cm, with a standard deviation of 17.59.
- The mean time taken to sprint 20 m is 5.16 sec, with a standard deviation of 0.71.
- The mean distance reached forward in a sitting position is 26.2615 cm, with a standard deviation of 4.93.
- Out of ten, the mean number of ball catches is about 4, with a standard deviation of about 3.
- We can see from the boxplots below that all numerical variables seem to be symmetrically distributed at their median.
1 hidden cell
## ---------- Descriptive Analysis
## ----- Numerical Variables
# Subset numerical variable columms
stacked_num_vars <- stack(
motor_performance %>%
dplyr::select(all_of(num_vars))
) %>%
rename(Variable = ind) %>%
mutate(Type = ifelse(Variable %in% c("AGE","WEIGHT (kg)","HEIGHT (cm)","BMI (kg/m2)"),
"Attribute", "Motor skill"))
# Summary statistics for numerical variables
sum_stats <- data.frame(Variable = num_vars) %>%
bind_cols(as.data.frame(t(motor_performance %>%
summarise_at(num_vars, list(mean)) %>%
bind_rows(motor_performance %>%
summarise_at(num_vars, list(sd)), motor_performance %>%
summarise_at(num_vars, list(min)),
motor_performance %>%
summarise_at(num_vars, list(median)),
motor_performance %>%
summarise_at(num_vars, list(max)))
)) %>%
rename(Mean = V1,
`Std. Dev.` = V2,
`Min.` = V3,
`Median` = V4,
`Max.` = V5))
rownames(sum_stats) <- 1: nrow(sum_stats)
# Boxplots for numerical variables
boxplots <- ggplot(stacked_num_vars, aes(x = Variable, y = values, fill=Type)) +
geom_boxplot(width = 0.75) +
theme(legend.position = "top",
legend.justification=0.48,
legend.key.size = unit(7, 'mm'),
legend.text = element_text(margin = margin(r = 10, unit = "pt"),
size = 8.5,
color = "#65707C",
family="sans serif"),
legend.title = element_text(color = "#65707C",
face = "bold",
size = 9,
family="sans serif"),
legend.key = element_rect(fill = NA),
axis.title = element_text(color = "#65707C",
face = "bold",
size = 8.5,
family="sans serif"),
axis.text = element_text(color = "#65707C",
size = 8,
family="sans serif"),
axis.line = element_line(colour = "grey",
linewidth = 0.5),
panel.grid.major = element_line(color = "grey",
linetype="dashed",
linewidth=0.25),
panel.background = element_blank(),
panel.border = element_rect(color="grey40",
fill=NA),
panel.spacing = unit(2, "lines"),
plot.title = element_text(color = "#65707C",
hjust = 0.5,
face = "bold",
size= 11,
family = "sans serif")) +
labs(x = "\nVariable \n(unit)\n", y = "", fill = "Type: ") +
ggtitle("\n Fig. 1: Box Plots of the Numerical Attributes and Motor Skills ") +
scale_x_discrete(labels=c("AGE",
"WEIGHT \n(kg)",
"HEIGHT \n(cm)",
"BMI \n(kg/m2)",
"POWER \n(cm)",
"SPEED \n(sec)",
"FLEXIBILITY \n(cm)",
"COORDINATION \n(no.)")) +
scale_y_continuous(expand = c(0.01, 0),
limits = c(0, 175),
breaks = seq(0, 175, by = 25)) +
scale_fill_manual(values = c('#025C70',
'#007E6C'))
# Save ggplot data
dat <- ggplot_build(boxplots)$data[[1]]
# Reformat boxplots' median line
final_boxplots <- boxplots + geom_segment(data=dat, aes(x=xmin,
xend=xmax,
y=middle-.15,
yend=middle-.15),
color="grey75",
linewidth=0.5,
inherit.aes = FALSE)Categorical Variables
- The five Malaysian states with the most number of children residing in are:
ㅤ1. Selangor - 349 (17.5%)
ㅤ2. Johor - 241 (12.1%)
ㅤ3. Sabah - 202 (10.1%)
ㅤ4. Sarawak - 199 (10.0%)
ㅤ5. Perak - 166 (8.3%)
# Count children per state
state_counts <- motor_performance %>%
count(STATE, sort = TRUE) %>%
mutate(Percentage = label_percent(accuracy=0.01)(n/1998)) %>%
rename(`Number of Children` = n)
# Count children per residential
residential_counts <- motor_performance %>%
count(RESIDENTIAL, sort = TRUE) %>%
mutate(Percentage = label_percent(accuracy=0.01)(n/1998)) %>%
rename(`Number of Children` = n)
# Count children per gender
gender_counts <- motor_performance %>%
count(GENDER, sort = TRUE) %>%
mutate(Percentage = label_percent(accuracy=0.01)(n/1998)) %>%
rename(`Number of Children` = n)
# Count children per BMI class
bmi_class_counts <- motor_performance %>%
count(`CLASS (BMI)`, sort = TRUE) %>%
mutate(Percentage = label_percent(accuracy=0.01)(n/1998)) %>%
rename(`Number of Children` = n)
state_counts
residential_counts
gender_counts
bmi_class_counts## ----- Categorical Variables
# Count children per state
state_counts <- motor_performance %>%
count(STATE, sort = TRUE) %>%
mutate(Percentage = label_percent(accuracy=0.01)(n/1998)) %>%
rename(`Number of Children` = n)
state_counts- Majority or 52.7% (1,052) of the sampled children are urban residents.
# Count children per residential
residential_counts <- motor_performance %>%
count(RESIDENTIAL, sort = TRUE) %>%
mutate(proportion = n/1998, Attribute = "RESIDENTIAL") %>%
rename(`Number of children` = n)
residential_counts# Count children per residential
residential_counts <- motor_performance %>%
count(RESIDENTIAL, sort = TRUE) %>%
mutate(proportion = n/1998, Attribute = "RESIDENTIAL",
Percentage = label_percent(accuracy=0.01)(proportion),
lab.ypos = cumsum(proportion) - 0.6*proportion) %>%
rename(`Number of children` = n)
# Create a pie chart for the RESIDENTIAL variable
pie_chart_for_residential <- ggplot(residential_counts, aes(x = "", y = proportion, fill = RESIDENTIAL)) +
geom_bar(width = 1, stat = "identity", color = "grey", linewidth=0.75) +
coord_polar("y", start = 0)+
geom_text(aes(y = lab.ypos,
label = paste(label_percent(accuracy=0.01)(proportion),
"\n (", prettyNum(`Number of children`,
big.mark=","),")",
sep="")), color = "white", size = 6)+
scale_fill_manual(values = c("#31688E", "#65C899")) +
ggtitle("\n Fig. 3: Pie Graph of the Distribution of Children per Residential Area \n") +
theme(legend.position = "top",
legend.justification=0.48,
legend.direction="horizontal",
legend.key.size = unit(0, 'pt'),
legend.key = element_rect(fill = NA),
legend.text = element_text(margin = margin(r = 5, unit = "pt"),
color = "#65707C",
family="sans serif"),
legend.title = element_text(color = "#65707C",
size = 9,
face = "bold",
family="sans serif"),
axis.title = element_blank(),
axis.text = element_blank(),
axis.line = element_blank(),
axis.ticks = element_blank(),
panel.grid.major = element_line(color = "grey",
linetype="dashed",
linewidth=0.25),
panel.background = element_blank(),
plot.title = element_text(color = "#65707C",
hjust = 0.5,
face = "bold",
size= 11,
family = "sans serif")) +
labs(fill="RESIDENTIAL: ")+
guides(fill = guide_legend(override.aes = list(
shape = 15,
size = 6)))- There is an equal distribution between male and female gender groups.