Prodigy-ML-04 — DataLab

import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import plotly.express as px

# Load the data
file_paths = ['./0.csv', './1.csv', './2.csv', './3.csv']
combined_df = pd.concat([pd.read_csv(file) for file in file_paths], axis=1)

# Handle missing values
combined_df.fillna(combined_df.mean(), inplace=True)

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(combined_df)

# Perform PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_data)

# Create a DataFrame with the principal components
pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

# Perform K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(pca_df)

# Add cluster labels to the DataFrame
pca_df['Cluster'] = clusters

# Plot the clusters
fig = px.scatter(pca_df, x='PC1', y='PC2', color='Cluster', title='K-Means Clustering on PCA Components')
fig.show()

# Save the notebook content
notebook_content = '''
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Analysis and Clustering\\n",
    "\\n",
    "## Introduction\\n",
    "This notebook performs Principal Component Analysis (PCA) and K-Means clustering on a combined dataset from multiple CSV files.\\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Loading\\n",
    "The data is loaded from four CSV files and combined into a single DataFrame.\\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\\n",
    "file_paths = ['./0.csv', './1.csv', './2.csv', './3.csv']\\n",
    "combined_df = pd.concat([pd.read_csv(file) for file in file_paths], axis=1)\\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Preprocessing\\n",
    "Handle missing values and standardize the data before performing PCA.\\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler\\n",
    "\\n",
    "# Handle missing values\\n",
    "combined_df.fillna(combined_df.mean(), inplace=True)\\n",
    "\\n",
    "# Standardize the data\\n",
    "scaler = StandardScaler()\\n",
    "scaled_data = scaler.fit_transform(combined_df)\\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Principal Component Analysis (PCA)\\n",
    "PCA is performed to reduce the dimensionality of the data and visualize the first two principal components.\\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA\\n",
    "\\n",
    "# Perform PCA\\n",
    "pca = PCA(n_components=2)\\n",
    "principal_components = pca.fit_transform(scaled_data)\\n",
    "\\n",
    "# Create a DataFrame with the principal components\\n",
    "pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])\\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## K-Means Clustering\\n",
    "K-Means clustering is applied to the principal components to identify clusters within the data.\\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.cluster import KMeans\\n",
    "\\n",
    "# Perform K-Means clustering\\n",
    "kmeans = KMeans(n_clusters=3, random_state=42)\\n",
    "clusters = kmeans.fit_predict(pca_df)\\n",
    "\\n",
    "# Add cluster labels to the DataFrame\\n",
    "pca_df['Cluster'] = clusters\\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualization\\n",
    "The clusters are visualized using a scatter plot of the first two principal components.\\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import plotly.express as px\\n",
    "\\n",
    "# Plot the clusters\\n",
    "fig = px.scatter(pca_df, x='PC1', y='PC2', color='Cluster', title='K-Means Clustering on PCA Components')\\n",
    "fig.show()\\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\\n",
    "The analysis reveals distinct clusters within the data, indicating underlying patterns and structures.\\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Additional Visualizations\\n",
    "To further understand the data, we can visualize the distribution of each principal component and the explained variance ratio.\\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Explained variance ratio\\n",
    "explained_variance = pca.explained_variance_ratio_\\n",
    "fig = px.bar(x=['PC1', 'PC2'], y=explained_variance, title='Explained Variance Ratio')\\n",
    "fig.show()\\n",
    "\\n",
    "# Distribution of PC1\\n",
    "fig = px.histogram(pca_df, x='PC1', title='Distribution of PC1')\\n",
    "fig.show()\\n",
    "\\n",
    "# Distribution of PC2\\n",
    "fig = px.histogram(pca_df, x='PC2', title='Distribution of PC2')\\n",
    "fig.show()\\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
'''