Exploring the data
It is important to know your data. Python is a very efficient way to get this step started. I used Pandas packege to import a csv file into a data frame and used dtypes to find out all the data types.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('diabetic_data.csv')
print(df)
Data Type
df.dtypes
SELECT * FROM 'diabetic_data.csv'
Missing Data
To check how many missing values in each column.
for col in df.columns:
if df[col].dtype == object:
print(col,df[col][df[col] == '?'].count())
There are missing values in 'race,' 'weight,' 'payer_code,' 'medical_specialty,' 'diag_1,' 'diag_2,' and 'diag_3' columns. Depends on what we are going to explore, we can drop some of these columns as needed.
Analysis the data from different ways
Hospitals don't want patients staying in the hospital longer than needed so they can open up new beds, as well as not spend unnecessary money on the patients who could be home.Our health care data analyst boss wants to know what the distribution of time spent in the hospital looks like. They're also curious to know if the majority stay less than 7 days. Once patients stay over 7 days, the hospital wants to ensure these patients are very acute.
One of the best & easiest ways to show a distribution of a numerical column is by making a histogram.
plt.hist(df.time_in_hospital, bins = 30, color = 'skyblue', edgecolor = 'black')
plt.show()
As we can see, most of people stay short than 7 days.