Skip to content
Introduction to Statistics in Python
Introduction to Statistics in Python
Run the hidden code cell below to import the data used in this course.
# There are two main types of statistics: descriptive and inferential.
# Descriptive statistics summarize and describe the main features of a dataset.
# Examples of descriptive statistics include measures of central tendency (mean, median, mode), measures of variability (range, variance, standard deviation), and measures of shape (skewness, kurtosis).
# Inferential statistics use sample data to make inferences about a larger population.
# Examples of inferential statistics include hypothesis testing, confidence intervals, and regression analysis.
# Examples of numeric data
age = 25
height = 1.75
income = 50000.00
# Examples of categorical data
gender = "male"
marital_status = "married"
education_level = "bachelor's degree"
# Examples of continuous numeric data
temperature = 25.5
weight = 68.2
height = 1.75
# Examples of discrete numeric data
number_of_children = 2
number_of_cars = 1
number_of_pets = 3
# Examples of nominal data
color = "red"
fruit = "apple"
city = "New York"
# Examples of ordinal data
education_level = "bachelor's degree"
job_level = "manager"
rating = "excellent"
### Mean, median, and mode are measures of central tendency used in descriptive statistics.
# Mean is the average value of a dataset. It is calculated by summing all the values in the dataset and dividing by the number of values.
deals["amount"].mean()
# Median is the middle value of a dataset. It is calculated by sorting the values in the dataset and selecting the middle value. If there are an even number of values, the median is the average of the two middle values.
happiness["gdp_per_cap"].median()
# Mode is the most common value in a dataset. It is the value that appears most frequently.
food["food_category"].mode()
# Pandas provides built-in functions to calculate mean, median, and mode.
deals["num_users"].mean()
happiness["life_exp"].median()
food["consumption"].mode()
mean este mai senzitiv la valori extreme decat median. este mai recomandat sa folosim median
# Skewness is a measure of the asymmetry of a probability distribution.
# A distribution is said to be skewed if it is not symmetric.
# If the tail of the distribution is longer on the right side, it is said to be positively skewed.
# If the tail of the distribution is longer on the left side, it is said to be negatively skewed.
# Skewness can be calculated using the skew() function from the scipy.stats library.
from scipy.stats import skew
# Calculate the skewness of the "amount" column in the "deals" dataframe
skew(deals["amount"])
# A positive skewness value indicates a longer tail on the right side, while a negative skewness value indicates a longer tail on the left side.
mean is pulled in the direction of the skew:
- lower then the median when on left skew data
- higher than the median on the right skew data
# Variance is a measure of how spread out a set of data is. It is calculated as the average of the squared differences from the mean.
# Let's say we have a list of numbers:
numbers = [2, 4, 6, 8, 10]
# We can calculate the mean of these numbers:
mean = sum(numbers) / len(numbers)
# Next, we can calculate the squared differences from the mean:
squared_diffs = [(x - mean) ** 2 for x in numbers]
# Finally, we can calculate the variance as the average of the squared differences:
variance = sum(squared_diffs) / len(squared_diffs)
# We can also use the numpy library to calculate variance:
import numpy as np
variance_np = np.var(numbers)
!!!! cu cat este mai mare valoarea pentru varience cu atat de departate sunt datele impartite fata de average
# Standard deviation is a measure of how spread out a set of data is from the mean. It is calculated as the square root of the variance.
# Let's say we have a list of numbers:
numbers = [2, 4, 6, 8, 10]
# We can calculate the mean of these numbers:
mean = sum(numbers) / len(numbers)
# Next, we can calculate the squared differences from the mean:
squared_diffs = [(x - mean) ** 2 for x in numbers]
# Then, we can calculate the variance as the average of the squared differences:
variance = sum(squared_diffs) / len(squared_diffs)
# We can also use the numpy library to calculate variance and standard deviation:
import numpy as np
variance_np = np.var(numbers)
std_dev_np = np.std(numbers)
# High values for variance and standard deviation indicate that the data is more spread out from the mean, and there is more variability in the data.
# Quartiles are values that divide a dataset into four equal parts. The first quartile (Q1) is the value that separates the lowest 25% of the data from the rest, the second quartile (Q2) is the median, and the third quartile (Q3) is the value that separates the highest 25% of the data from the rest.
# Let's say we have a list of numbers:
numbers = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
# We can use the numpy library to calculate quartiles:
import numpy as np
q1 = np.percentile(numbers, 25)
q2 = np.percentile(numbers, 50)
q3 = np.percentile(numbers, 75)
# Quintiles are values that divide a dataset into five equal parts. The first quintile (Q1) is the value that separates the lowest 20% of the data from the rest, the second quintile (Q2) is the value that separates the lowest 40% of the data from the rest, and so on.
# We can use the numpy library to calculate quintiles:
q1 = np.percentile(numbers, 20)
q2 = np.percentile(numbers, 40)
q3 = np.percentile(numbers, 60)
q4 = np.percentile(numbers, 80)
q5 = np.percentile(numbers, 100)
# Deciles are values that divide a dataset into ten equal parts. The first decile (D1) is the value that separates the lowest 10% of the data from the rest, the second decile (D2) is the value that separates the lowest 20% of the data from the rest, and so on.
# We can use the numpy library to calculate deciles:
d1 = np.percentile(numbers, 10)
d2 = np.percentile(numbers, 20)
d3 = np.percentile(numbers, 30)
d4 = np.percentile(numbers, 40)
d5 = np.percentile(numbers, 50)
d6 = np.percentile(numbers, 60)
d7 = np.percentile(numbers, 70)
d8 = np.percentile(numbers, 80)
d9 = np.percentile(numbers, 90)
d10 = np.percentile(numbers, 100)
q1