Known Website Data Analytics
Larger corporations are increasingly looking to statistics like search engine trends, assessing customer critical mass geolocation, and delivering personalized goods based on social media picture analysis as they battle for market share in the beauty field. Companies are gathering real-time product pricing in order to implement dynamic strategies that allow them to provide competitive bargains to value shoppers or customised pricing for specific demographics, particularly in emerging nations where price can be a decisive factor for cash-strapped customers.
Import Required Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
!pip install pingouin
import pingouin as pg
import plotly.express as px
import scipy.stats as stats
import sklearn as sk
import plotly.express as px
from wordcloud import WordCloud
!pip install statsmodels
import statsmodels.stats.multitest as sm
Data Preprocessing
The data set consists of 9,168 entries and contains 11 columns. All columns have no blank values. Numerical attributes such as 'Product Rating', 'Number of reviews', 'Number of people liked product' and 'Price' have reasonable statistics. The data types for the columns seem appropriate.
# Read data set
website = pd.read_csv("Known_Website_data.csv")
website.head(15)
Overview
# Examine the general structure of the data set
print("General structure of the data set:")
print(website.info())
# Detect incorrect, contradictory or abnormal values in the data set
print("\nBasic statistics of numeric variables in the data set:")
print(website.describe())
Visualisations
Visualisations such as scatter plots and pair plots are useful for understanding the data. A scatter plot showing the relationship between product score and number of reviews provides insight into customer interaction. Boxplot for outlier analysis by category is a good application.
# Visualisations to see the distributions and relationships of variables in the data set.
sns.distplot(website["Price"], kde=False, bins=20) # show the distribution of product prices
sns.pairplot(website[["Product Rating", "Number of reviews", "Number of people liked product", "Price"]]) # show relationships between numeric variables
Analysis of Categorical Variables
# To see the frequencies and effects of categorical variables in the data set
website["Category of Product"].value_counts() # show frequencies of product categories
sns.barplot(x="Price", y="Category of Product", data=website) # show average prices of product categories
Model Geliştirme, Eğitme, Değerlendirme
Correlation and Regression Analysis
Various regression models (Random Forest, Gradient Boosting, Decision Tree, Linear Regression, Lasso, Ridge, SVR, KNN, AdaBoost) were applied to predict product prices. Random Forest has the highest R-squared value, indicating that it performs better among the models. Linear Regression seems to perform poorly with an extremely low R-square value. Some models such as AdaBoost have negative R-squared values, indicating that it may not be suitable for this data set.
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score
# Models
rf_model = RandomForestRegressor(random_state=42)
gb_model = GradientBoostingRegressor(random_state=42)
tree_model = DecisionTreeRegressor(random_state=42)
lr_model = LinearRegression()
lasso_model = Lasso()
ridge_model = Ridge()
svr_model = SVR()
knn_model = KNeighborsRegressor()
ada_model = AdaBoostRegressor(random_state=42)
# Training of models
rf_model.fit(X_train, y_train)
gb_model.fit(X_train, y_train)
tree_model.fit(X_train, y_train)
lr_model.fit(X_train, y_train)
lasso_model.fit(X_train, y_train)
ridge_model.fit(X_train, y_train)
svr_model.fit(X_train, y_train)
knn_model.fit(X_train, y_train)
ada_model.fit(X_train, y_train)
# Predictions on the test set
y_pred_rf = rf_model.predict(X_test)
y_pred_gb = gb_model.predict(X_test)
y_pred_tree = tree_model.predict(X_test)
y_pred_lr = lr_model.predict(X_test)
y_pred_lasso = lasso_model.predict(X_test)
y_pred_ridge = ridge_model.predict(X_test)
y_pred_svr = svr_model.predict(X_test)
y_pred_knn = knn_model.predict(X_test)
y_pred_ada = ada_model.predict(X_test)
# Evaluation of R-squared values
r2_rf = r2_score(y_test, y_pred_rf)
r2_gb = r2_score(y_test, y_pred_gb)
r2_tree = r2_score(y_test, y_pred_tree)
r2_lr = r2_score(y_test, y_pred_lr)
r2_lasso = r2_score(y_test, y_pred_lasso)
r2_ridge = r2_score(y_test, y_pred_ridge)
r2_svr = r2_score(y_test, y_pred_svr)
r2_knn = r2_score(y_test, y_pred_knn)
r2_ada = r2_score(y_test, y_pred_ada)
# Reflection of results
print("Random Forest Model R-square Value:", r2_rf)
print("Gradient Boosting Model R-square Value:", r2_gb)
print("Decision Tree Model R-square Value:", r2_tree)
print("Linear Regression Model R-square Value:", r2_lr)
print("Lasso Model R-square Value:", r2_lasso)
print("Ridge Model R-square Value:", r2_ridge)
print("SVR Model R-square Value:", r2_svr)
print("KNN Model R-square Value:", r2_knn)
print("AdaBoost Model R-square Value:", r2_ada)