This is an assignment from the statistic course I have been learning on lumenlearning.com for 11th (final) module.
ATTENTION! If the cells with the code are not displayed, then you need to log in.
Background: Risk Factors for Low Birth Weight
Low birth weight is an outcome that has been of concern to physicians for years. This is due to the fact that infant mortality rates and birth defect rates are very high for babies with low birth weight. A woman's behavior during pregnancy (including diet, smoking habits, and obtaining prenatal care) can greatly alter her chances of carrying the baby to term and, consequently, of delivering a baby of normal birth weight.
In this exercise, we will use a 1986 study (Hosmer and Lemeshow (2000), Applied Logistic Regression: Second Edition) in which data were collected from 189 women (of whom 59 had low birth weight infants) at the Baystate Medical Center in Springfield, MA (an academic, research, and teaching hospital that serves as the western campus of Tufts University School of Medicine and is the only Level 1 trauma center in western Massachusetts). The goal of the study was to identify risk factors associated with giving birth to a low birth weight baby.
Questions
Q1. Do the data provide evidence that the occurrence of low birth weight is significantly related to whether or not the mother smoked during pregnancy?
Q2. Do the results of the study provide significant evidence that the race of the mother is a factor in the occurrence of low birth weight?
Q3. Are there significant differences in age between mothers who gave birth to low weight babies and those whose baby's weight was normal?
Data Dictionary
| Variable | Explanation |
|---|---|
| LOW | Low birth weight (0=No (birth weight >= 2500 g) 1=Yes (birth weight < 2500 g) |
| AGE | Age of mother (in years) |
| LWT | Weight of mother at the last menstrual period (in pounds) |
| RACE | Race of mother (1=White, 2=Black, 3=Other) |
| SMOKE | Smoking status during pregnancy (0=No, 1=Yes) |
| PTL | History of premature labor (0=None, 1=One, etc.) |
| HT | History of hypertension (0=No, 1=Yes) |
| UI | Presence of uterine irritability (0=No, 1=Yes) |
| FTV | Number of physician visits during the first trimester |
| BWT | The actual birth weight (in grams) |
Data Validation (stage 1)
# importing packages
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t
import pingouinFirst, let's have a glance on the table.
df = pd.read_csv('low_birth_weight.csv', index_col=None)
dfFormat
Second, check-out formats of columns.
df.info()LOW, RACE, SMOKE, HT and UI columns formats don't match with their values. Let's change them and check the result.
df [['LOW', 'RACE', 'SMOKE','HT', 'UI']] = df [['LOW', 'RACE', 'SMOKE','HT', 'UI']].astype(str)
print(df.info())Missing data
Third, indentify missing values.
print("Number of missing values by columns:")
print(df.isna().sum(), end = "\n\n")
print("Proportion of missing values by columns in %:")
print(df.isna().sum() * 100 / len(df)) Great we have no missing values.
Data uniqueness
Let's also view the uniqueness of values.
print(df.nunique())