Statistical Analysis

What is Statistical Analysis? Statistical analysis is the process of collecting, analyzing, and interpreting data to uncover patterns, trends, and relationships. It helps answer key questions like:

Is a new marketing campaign increasing sales? Do two variables have a meaningful correlation? Can we predict future trends based on historical data? What makes data science, science? The answer is statistics. Today, we dive into Descriptive Statistics.

As Jack Reacher (just Reacher) famously said: “In an investigation, assumptions kill.” So, how do we kill assumptions in data? Descriptive statistics is the answer.

What is Descriptive Statistics? Descriptive statistics summarize and organize data to reveal its main features through numerical measures, tables, and graphs. It doesn’t predict or infer — it simply gives insights into data, such as:

Mean, median, mode Standard deviation, variance, range Skewness, kurtosis It answers: What is the data telling us?

Use Case: A retail store wants to analyze its daily sales over a month to understand performance, trends, and outliers. Descriptive statistics summarize the data and provide insights into sales, variability, and distribution.

Key Parameters: ✔️ Mean & Median — What’s the average revenue? ✔️ Mode — Which sales figure occurred most? ✔️ Standard Deviation & Range — How much do sales fluctuate? ✔️ Skewness & Kurtosis — Is the sales distribution normal or extreme?

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
from scipy import stats

#Simulated daily sales data for 38 days 
sales_data = [238, 258, 226, 270, 268, 318, 295, 288, 278, 265, 308, 340, 320, 298, 275, 285, 295, 325, 330, 310, 350, 360, 348, 376, 390, 410, 428, 395, 385, 378]

#convert the pandas DataFrame
df = pd.DataFrame(sales_data, columns = ['Daily Sales'])

mean_sales = np.mean(sales_data)
print(mean_sales)
median_sales = np.median(sales_data)
print(median_sales)

#Handle mode correctly 
mode_result = stats.mode(sales_data, keepdims=True) 
mode_sales = mode_result.mode [0] if mode_result.mode.size > 0 else "No mode"
print(mode_sales)

std_dev_sales = np.std(sales_data, ddof=1) 
range_sales = np.ptp(sales_data) 
iqr_sales = np.percentile(sales_data, 75) - np.percentile(sales_data, 25) 
skewness  = stats.skew(sales_data) 
kurtosis  = stats.kurtosis(sales_data)

#Display Results 
print(f"Mean Sales: ${mean_sales:.2f}") 
print(f"Median Sales: ${median_sales:.2f}") 
print(f"Mode Sales: {mode_sales}") 
print(f"Standard Deviation: ${std_dev_sales:.2f}") 
print(f"Interquartile Range (IQR): ${iqr_sales}") 
print(f"Skewness: {skewness:.2f}")

#Visualization Histogram 
plt.figure(figsize=(18,5)) 
plt.hist(sales_data, bins=7, color='skyblue', edgecolor='black', alpha=8.7) 
plt.axvline(mean_sales, color='red', linestyle='dashed', linewidth=2, label="Mean") 
plt.axvline(median_sales, color='green', linestyle='dashed', linewidth=2, label="Median") 
plt.title("Daily Sales Distribution") 
plt.xlabel("Sales ($)") 
plt.ylabel("Frequency") 
plt.legend() 
plt.show()