this is the nav!
Estimating the probability of being born a woman: Case Study from Spain.
• AI Chat
• Code
• Report
• ## .mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Background

In this article we are going to estimate if the probability of being born as a woman is 50% using birth data from Spain across its 17 regions from 1975 to 2020. Here, we'll be using data importing, cleaning, exploratory data analysis and inference.

## Introduction

A few days ago, I was reading a really nice book called Understanding Probability: Chance Rules in Everyday Life, written by (Tijms, 2021), were he introduces basic statistical concepts using approachable explanations and nice examples. In section 5.7.2. of the book, about creating confidence intervals for probabilities, One example he brings out is the common belief that there is a 50 percent chance of being born either a man or a woman. Using a sample size of 585.609 births from the Netherlands during the years 1989, 1990 and 1991, he estimates the probability of being born a woman as 48.86%. Moreover, he creates a confidence interval of 95% around this estimation, resulting in (0.4873, 0.4899), which doesn't contain the 50% probability, suggesting the true probability of being a woman at birth is not 50%. In this article, I just want to try the same hypothesis from the previous example but using this dataset from the National Institute of Statistics (Instituto Nacional de Estadística in spanish) from Spain, with a much bigger sample size (a total of 41.802.854 births) and across 17 different regions inside the country. Let's begin!

## Exploratory Analysis and Cleaning

Let's explore what we have:

```.mfe-app-workspace-11z5vno{font-family:JetBrainsMonoNL,Menlo,Monaco,'Courier New',monospace;font-size:13px;line-height:20px;}```# Installing neccesary packages
install.packages("nortest")

# Importing packages
library(tidyverse)
library(broom)
library(nortest)

# Importing data
locale = locale(encoding = "utf8"))

The glossary of this data frame (which is in spanish) is as follows:

VariableTranslationTypeDescription
SexoGendercharacterThe gender of the newborns
PeriodoPerioddoubleThe year when someone was born
TotalTotaldoubleNumber of births in the given region, gender and period
``````# Brief exploration
summary(nacidos_serie_raw)
glimpse(nacidos_serie_raw)``````
``````# Distint categories from the character variables
nacidos_serie_raw %>%
select_if(is.character) %>%
map(unique)``````

The numbers next to the regions serve no useful purpose for our study, and can be confusing, specially when we sort regions by sample size, like we will in the following sections. Let's remove them:

``````nacidos_serie <- nacidos_serie_raw %>%
# Strip away numbers from CCAA names

Which are the biggest and the smalles values from the dataset?

``````# Ten biggest values
nacidos_serie %>%
arrange(desc(Total)) %>%
``````# Ten smallest values
nacidos_serie %>%
arrange(Total) %>%

Among the smallest values we can see that in 1995, for the region Extranjero (which refers to people born abroad) there were a total of 2 births. This might be a recording issue and can be safely treated as an outlier, so we will remove it from the dataset:

``````# Lets remove the observation from Extranjero in 1995 just by
# filtering for more than 30 births
nacidos_serie <- nacidos_serie %>%
filter(Total > 30)

nacidos_serie %>%
arrange(Total) %>%
``````# How big in terms of births each region is?