Statistical Studies

This is an assignment from the statistic course I have been learning on lumenlearning.com for 11th (final) module.

ATTENTION! If the cells with the code are not displayed, then you need to log in.

Background: Risk Factors for Low Birth Weight

Low birth weight is an outcome that has been of concern to physicians for years. This is due to the fact that infant mortality rates and birth defect rates are very high for babies with low birth weight. A woman's behavior during pregnancy (including diet, smoking habits, and obtaining prenatal care) can greatly alter her chances of carrying the baby to term and, consequently, of delivering a baby of normal birth weight.

In this exercise, we will use a 1986 study (Hosmer and Lemeshow (2000), Applied Logistic Regression: Second Edition) in which data were collected from 189 women (of whom 59 had low birth weight infants) at the Baystate Medical Center in Springfield, MA (an academic, research, and teaching hospital that serves as the western campus of Tufts University School of Medicine and is the only Level 1 trauma center in western Massachusetts). The goal of the study was to identify risk factors associated with giving birth to a low birth weight baby.

Questions

Q1. Do the data provide evidence that the occurrence of low birth weight is significantly related to whether or not the mother smoked during pregnancy?

Q2. Do the results of the study provide significant evidence that the race of the mother is a factor in the occurrence of low birth weight?

Q3. Are there significant differences in age between mothers who gave birth to low weight babies and those whose baby's weight was normal?

Data Dictionary

Variable	Explanation
LOW	Low birth weight (0=No (birth weight >= 2500 g) 1=Yes (birth weight < 2500 g)
AGE	Age of mother (in years)
LWT	Weight of mother at the last menstrual period (in pounds)
RACE	Race of mother (1=White, 2=Black, 3=Other)
SMOKE	Smoking status during pregnancy (0=No, 1=Yes)
PTL	History of premature labor (0=None, 1=One, etc.)
HT	History of hypertension (0=No, 1=Yes)
UI	Presence of uterine irritability (0=No, 1=Yes)
FTV	Number of physician visits during the first trimester
BWT	The actual birth weight (in grams)

Data Validation (stage 1)

# importing packages
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t
import pingouin

First, let's have a glance on the table.

df = pd.read_csv('low_birth_weight.csv', index_col=None)
df

Format

Second, check-out formats of columns.

df.info()

LOW, RACE, SMOKE, HT and UI columns formats don't match with their values. Let's change them and check the result.

df [['LOW', 'RACE', 'SMOKE','HT', 'UI']] = df [['LOW', 'RACE', 'SMOKE','HT', 'UI']].astype(str) 

print(df.info())

Missing data

Third, indentify missing values.

print("Number of missing values by columns:")
print(df.isna().sum(), end = "\n\n") 
print("Proportion of missing values by columns in %:")
print(df.isna().sum() * 100 / len(df))

Great we have no missing values.

Data uniqueness

Let's also view the uniqueness of values.

print(df.nunique())

‌
‌
‌