Skip to content

As more animals face endangerment, understanding environmental impacts is key. Conservation groups analyze habitat and survival data to identify beneficial or harmful patterns. This helps target efforts like habitat restoration or policy changes. If certain environments prove better for wildlife, efforts are concentrated there. Acting as a conservation data scientist, you'll leverage data to inform conservation decisions by identifying the area where these efforts are needed most urgently.

This project is about both factor analysis and survival analysis. Factor analysis is useful in market research to identify underlying customer preferences, in human resources to understand employee satisfaction drivers, and in finance to assess investment risks by identifying underlying factors. Survival analysis is applied in real-life business contexts like customer churn prediction, analyzing the lifespan of products in the market, and estimating the time to failure for machinery in manufacturing settings.

The factor_data.csv and survival_data.csv files encompass vital information for analyzing how environmental factors influence wildlife populations.

factor_data.csv includes metrics on environmental and biological variables across various habitats:

VariableDescription
AirQualityIndex of air quality in the habitat
TemperatureScaled average temperature in the habitat
DeforestationRateScaled rate of deforestation in the area
SpeciesDiversityAverage number of different species in the habitat
ReproductiveRatesScaled average reproductive rates of species in habitat

survival_data.csv contains survival details of individual animals or populations, linking back to the environmental conditions:

VariableDescription
Survival_TimeTime until the event (death) occurs
Censuring_StatusEvent observed (1) or censored (0)
HabitatFive different types of wildlife habitats

Import Libaries

!pip install factor-analyzer lifelines
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from factor_analyzer import FactorAnalyzer
from lifelines import KaplanMeierFitter
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import FactorAnalysis

Factor Analysis

# Load the factor_data.csv
factor_data = pd.read_csv("factor_data.csv")
print(factor_data)
cor_factor_data = factor_data.corr(method='pearson')

#  to identify which variable is most strongly correlated with SpeciesDiversity
most_impactful_factor = cor_factor_data['SpeciesDiversity'].idxmax()

print(cor_factor_data)
print(f"Most impactful factor: {most_impactful_factor}")
cor_factor_data = factor_data.corr()

# 2. Make a scree plot to determine the number of factors
fa = FactorAnalyzer(rotation=None)
fa.fit(factor_data)

# Get eigenvalues (explained variance)
eigenvalues, _ = fa.get_eigenvalues()

# Plot the scree plot
plt.figure(figsize=(8, 6))
plt.plot(np.arange(1, len(eigenvalues) + 1), eigenvalues, 'bo-', linewidth=2)
plt.title('Scree Plot')
plt.xlabel('Factors')
plt.ylabel('Eigenvalue')
plt.xticks(np.arange(1, len(eigenvalues) + 1))
plt.grid(True)
plt.show()

num_factors = 2  

fa = FactorAnalyzer(n_factors=num_factors, rotation='varimax')  # Varimax rotation is commonly used
fa.fit(factor_data)

efa_model = fa

print(f"Number of factors: {num_factors}")
print("Factor Loadings:")
print(fa.loadings_)

# # Factor Analysis ; which environmental factors are most important in influencing wildlife survival.

# # for better factor analysis
# scaler = StandardScaler()
# factor_data_scaled = scaler.fit_transform(factor_data)

# fa = FactorAnalysis(n_components=2)  # to Adjust the number of factors based on explained variance
# factor_scores = fa.fit_transform(factor_data_scaled)

# # Factor 1: "Environmental Stress" (Air Quality, Temperature, Deforestation Rate)
# # Factor 2: "Biodiversity Health" (Species Diversity, Reproductive Rates)

# factor_df = pd.DataFrame(factor_scores, columns=["Factor1", "Factor2"])
# factor_df

# High Environmental Stress Factor โ†’ More pollution, extreme temperatures, high deforestation โ†’ Worse for wildlife survival.
# High Biodiversity Health Factor โ†’ More species, better reproductive rates โ†’ Better for wildlife survival.
# # Merge with original factor data
# factor_data = pd.concat([factor_data, factor_df], axis=1)
# factor_data
# # Compute correlation matrix
# cor_factor_data = factor_data.corr()

# # Identify the variable with the strongest correlation to SpeciesDiversity
# species_correlation = cor_factor_data["SpeciesDiversity"].drop("SpeciesDiversity")
# most_impactful_factor = species_correlation.abs().idxmax()

# # Print results
# print("Correlation Matrix:\n", cor_factor_data)
# print(f"The most impactful environmental factor is: {most_impactful_factor}")
# # how much each variable contributes to each factor
# # Adjust the columns to match the shape of fa.loadings_
# factor_loadings = pd.DataFrame(fa.loadings_, columns=factor_data.columns[:5], index=["Environmental Stress", "Biodiversity Health"])

# plt.figure(figsize=(8, 5))
# sns.heatmap(factor_loadings, annot=True, cmap="coolwarm", center=0)
# plt.title("Factor Loadings Heatmap")
# plt.show()

Survival Analysis

โ€Œ
โ€Œ
โ€Œ