Skip to content
Prodigy-ML-04
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import plotly.express as px
# Load the data
file_paths = ['./0.csv', './1.csv', './2.csv', './3.csv']
combined_df = pd.concat([pd.read_csv(file) for file in file_paths], axis=1)
# Handle missing values
combined_df.fillna(combined_df.mean(), inplace=True)
# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(combined_df)
# Perform PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_data)
# Create a DataFrame with the principal components
pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
# Perform K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(pca_df)
# Add cluster labels to the DataFrame
pca_df['Cluster'] = clusters
# Plot the clusters
fig = px.scatter(pca_df, x='PC1', y='PC2', color='Cluster', title='K-Means Clustering on PCA Components')
fig.show()
# Save the notebook content
notebook_content = '''
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Analysis and Clustering\\n",
"\\n",
"## Introduction\\n",
"This notebook performs Principal Component Analysis (PCA) and K-Means clustering on a combined dataset from multiple CSV files.\\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Loading\\n",
"The data is loaded from four CSV files and combined into a single DataFrame.\\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\\n",
"file_paths = ['./0.csv', './1.csv', './2.csv', './3.csv']\\n",
"combined_df = pd.concat([pd.read_csv(file) for file in file_paths], axis=1)\\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Preprocessing\\n",
"Handle missing values and standardize the data before performing PCA.\\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.preprocessing import StandardScaler\\n",
"\\n",
"# Handle missing values\\n",
"combined_df.fillna(combined_df.mean(), inplace=True)\\n",
"\\n",
"# Standardize the data\\n",
"scaler = StandardScaler()\\n",
"scaled_data = scaler.fit_transform(combined_df)\\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Principal Component Analysis (PCA)\\n",
"PCA is performed to reduce the dimensionality of the data and visualize the first two principal components.\\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.decomposition import PCA\\n",
"\\n",
"# Perform PCA\\n",
"pca = PCA(n_components=2)\\n",
"principal_components = pca.fit_transform(scaled_data)\\n",
"\\n",
"# Create a DataFrame with the principal components\\n",
"pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])\\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## K-Means Clustering\\n",
"K-Means clustering is applied to the principal components to identify clusters within the data.\\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.cluster import KMeans\\n",
"\\n",
"# Perform K-Means clustering\\n",
"kmeans = KMeans(n_clusters=3, random_state=42)\\n",
"clusters = kmeans.fit_predict(pca_df)\\n",
"\\n",
"# Add cluster labels to the DataFrame\\n",
"pca_df['Cluster'] = clusters\\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualization\\n",
"The clusters are visualized using a scatter plot of the first two principal components.\\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import plotly.express as px\\n",
"\\n",
"# Plot the clusters\\n",
"fig = px.scatter(pca_df, x='PC1', y='PC2', color='Cluster', title='K-Means Clustering on PCA Components')\\n",
"fig.show()\\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Conclusion\\n",
"The analysis reveals distinct clusters within the data, indicating underlying patterns and structures.\\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Additional Visualizations\\n",
"To further understand the data, we can visualize the distribution of each principal component and the explained variance ratio.\\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Explained variance ratio\\n",
"explained_variance = pca.explained_variance_ratio_\\n",
"fig = px.bar(x=['PC1', 'PC2'], y=explained_variance, title='Explained Variance Ratio')\\n",
"fig.show()\\n",
"\\n",
"# Distribution of PC1\\n",
"fig = px.histogram(pca_df, x='PC1', title='Distribution of PC1')\\n",
"fig.show()\\n",
"\\n",
"# Distribution of PC2\\n",
"fig = px.histogram(pca_df, x='PC2', title='Distribution of PC2')\\n",
"fig.show()\\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
'''