Skip to content

Course Notes: Dealing with Missing Data in Python

In Python, there are several ways to handle missing data in a dataset. Some common methods include:

  1. Dropping missing values: This involves removing rows or columns that contain missing data. The dropna() function in pandas can be used for this purpose.

  2. Filling missing values: Instead of removing missing data, you can fill them with a specific value. The fillna() function in pandas allows you to fill missing values with a constant or using various interpolation methods.

  3. Imputing missing values: Imputation involves estimating missing values based on the available data. This can be done using statistical methods such as mean, median, or mode imputation.

  4. Using machine learning algorithms: Another approach is to use machine learning algorithms to predict missing values based on other features in the dataset. This can be done using techniques like regression or k-nearest neighbors.

It is important to carefully consider the nature of the missing data and the specific requirements of your analysis before choosing a method to handle missing data.

Remember to always handle missing data appropriately to avoid biased or inaccurate results in your analysis.

1-) Missing Data Analysis

--->

  • Identifying missing values ​​of data sets.
  • Examining the distribution and rate of missing data.
  • Identifying missing data types (NaN, Null, etc.).
# Importing the necessary libraries
import pandas as pd

#load dataset
df = pd.read_csv("datasets/air-quality.csv")
df0 = pd.read_csv("datasets/pima-indians-diabetes%20data.csv")
print("First Data Set (df):")
print(df.head(5))

print("\nSecond Data Set (df0):")
print(df0.head(5))
print("First Data Set (df):")
print(df.info(5))

print("\nSecond Data Set (df0):")
print(df0.info(5))

The first cell is a markdown cell that provides an introduction to the topic of dealing with missing data in Python. It explains four common methods for handling missing data: dropping missing values, filling missing values, imputing missing values, and using machine learning algorithms. It also emphasizes the importance of carefully considering the nature of the missing data and the specific requirements of the analysis.

The second cell is a code cell that imports the necessary libraries and loads a dataset named "air-quality.csv" into a dataframe named df.

The third cell is a code cell that displays the first 5 rows of the dataframe df using the head() function.

The fourth cell is a code cell that provides information about the dataframe df using the info() function. It displays the column names and data types of the columns in the dataframe.

2-) Missing Data Filling

--->

  • Methods used to fill in missing data (mean, median, mod, etc.).
  • Steps and results of filling missing values ​​in the first data set.
  • Steps and results for filling missing values ​​in the second data set.
# Filling missing values with appropriate methods
# For 'Ozone' and 'Solar', we'll use the mean of the columns
# For 'Wind', we'll use the median of the column
# For 'Temp', we'll use the mode of the column

print("First Data Set (df):")

# Calculate mean, median, and mode
ozone_mean = df['Ozone'].mean()
solar_mean = df['Solar'].mean()
wind_median = df['Wind'].median()
temp_mode = df['Temp'].mode()[0]

# Fill missing values

df['Ozone'].fillna(ozone_mean, inplace=True)
df['Solar'].fillna(solar_mean, inplace=True)
df['Wind'].fillna(wind_median, inplace=True)
df['Temp'].fillna(temp_mode, inplace=True)

print(df.head())

print("\nSecond Data Set (df0):")

# Calculate mean, median, and mode for df0
pregnant_mean = df0['Pregnant'].mean()
glucose_mean = df0['Glucose'].mean()
diastolic_bp_median = df0['Diastolic_BP'].median()
skin_fold_mode = df0['Skin_Fold'].mode()[0]

# Fill missing values for df0
df0['Pregnant'].fillna(pregnant_mean, inplace=True)
df0['Glucose'].fillna(glucose_mean, inplace=True)
df0['Diastolic_BP'].fillna(diastolic_bp_median, inplace=True)
df0['Skin_Fold'].fillna(skin_fold_mode, inplace=True)
df0['Serum_Insulin'].fillna(df0['Serum_Insulin'].median(), inplace=True)
df0['BMI'].fillna(df0['BMI'].mean(), inplace=True)
df0['Diabetes_Pedigree'].fillna(df0['Diabetes_Pedigree'].median(), inplace=True)

print(df0.head())

3-) Data Discovery and Inspection

--->

  • General statistical properties of the initial data set (mean, median, standard deviation, etc.).
  • General statistical properties of the second data set.
  • Examining the relationships between variables with correlation analysis.
  • Visualization of data distribution with boxplots and histograms.

1-)Corellation and Heatmap matrix

# Perform further analysis or data manipulation on the dataframe
print("\nFirst Data Set (df):")

# Calculate the correlation matrix
correlation_matrix = df.corr()

# Display the correlation matrix
correlation_matrix

# 1. Explore the data further by visualizing the correlations between variables
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting a heatmap of the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
1.1-)Visualization of data distribution with boxplots and histograms.
# 2. Analyze the relationship between 'Ozone' and other variables
# Scatter plot of 'Ozone' vs 'Solar'
plt.figure(figsize=(8, 6))
plt.scatter(df['Solar'], df['Ozone'])
plt.xlabel('Solar')
plt.ylabel('Ozone')
plt.title('Ozone vs Solar')
plt.show()

# Scatter plot of 'Ozone' vs 'Wind'
plt.figure(figsize=(8, 6))
plt.scatter(df['Wind'], df['Ozone'])
plt.xlabel('Wind')
plt.ylabel('Ozone')
plt.title('Ozone vs Wind')
plt.show()

# Scatter plot of 'Ozone' vs 'Temp'
plt.figure(figsize=(8, 6))
plt.scatter(df['Temp'], df['Ozone'])
plt.xlabel('Temp')
plt.ylabel('Ozone')
plt.title('Ozone vs Temp')
plt.show()

# 3. Perform statistical analysis on the variables
# Calculate descriptive statistics for 'Ozone'
ozone_stats = df['Ozone'].describe()
ozone_stats

# Calculate descriptive statistics for 'Solar'
solar_stats = df['Solar'].describe()
solar_stats

# Calculate descriptive statistics for 'Wind'
wind_stats = df['Wind'].describe()
wind_stats

# Calculate descriptive statistics for 'Temp'
temp_stats = df['Temp'].describe()
temp_stats

# 4. Identify outliers in the data
# Boxplot of 'Ozone'
plt.figure(figsize=(8, 6))
sns.boxplot(df['Ozone'])
plt.xlabel('Ozone')
plt.title('Boxplot of Ozone')
plt.show()

# Boxplot of 'Solar'
plt.figure(figsize=(8, 6))
sns.boxplot(df['Solar'])
plt.xlabel('Solar')
plt.title('Boxplot of Solar')
plt.show()

# Boxplot of 'Wind'
plt.figure(figsize=(8, 6))
sns.boxplot(df['Wind'])
plt.xlabel('Wind')
plt.title('Boxplot of Wind')
plt.show()

# Boxplot of 'Temp'
plt.figure(figsize=(8, 6))
sns.boxplot(df['Temp'])
plt.xlabel('Temp')
plt.title('Boxplot of Temp')
plt.show()

2-)Corellation and Heatmap matrix for Second Data Set