Health care project

Exploring the data

It is important to know your data. Python is a very efficient way to get this step started. I used Pandas packege to import a csv file into a data frame and used dtypes to find out all the data types.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('diabetic_data.csv')
print(df)

Hidden output

Data Type

df.dtypes

DataFrameas

df4

variable

SELECT * FROM 'diabetic_data.csv'

Missing Data

To check how many missing values in each column.

for col in df.columns:
    if df[col].dtype == object:
         print(col,df[col][df[col] == '?'].count())

Hidden output

There are missing values in 'race,' 'weight,' 'payer_code,' 'medical_specialty,' 'diag_1,' 'diag_2,' and 'diag_3' columns. Depends on what we are going to explore, we can drop some of these columns as needed.

Analysis the data from different ways

Hospitals don't want patients staying in the hospital longer than needed so they can open up new beds, as well as not spend unnecessary money on the patients who could be home.Our health care data analyst boss wants to know what the distribution of time spent in the hospital looks like. They're also curious to know if the majority stay less than 7 days. Once patients stay over 7 days, the hospital wants to ensure these patients are very acute.

One of the best & easiest ways to show a distribution of a numerical column is by making a histogram.

plt.hist(df.time_in_hospital, bins = 30, color = 'skyblue', edgecolor = 'black')
plt.show()

As we can see, most of people stay short than 7 days.

‌
‌
‌

Health care project

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Exploring the data

Data Type

Missing Data

Analysis the data from different ways

Exploring the data