My Python workspace

Data Scientist Study notes Preparation for the exam

Exploratory Analysis and Statistical Experimentation in Python

Metrics callculation, charecteristics and relationships between features report

Measures of center

Measures of center identify average data value, if the data is more or less evenly spread with not so many outliers (normaly distributed) then Mean is calculated. However, if data has outliers they can affect mean and move data center (example if we have 10 workers and 9 of them earn 500$ while one earn 50000$, then mean would be 5450$ which is definetly not a center of our data) in those cases it is better to use Median or Mode which are less afected by outliers.

Mean - average value
Median - the middle value when data is in order
Mode - most common value

import statistics
 
data = [220, 100, 190, 180, 250, 190, 240, 180, 140, 180, 190]
 
# Mean calulation
mean = statistics.mean(data)
 
# Median calulation
median = statistics.median(data)
 
# Mode calulation
mode =  statistics.mode(data)

print("Mean = ", mean)
print("Median = ", median)
print("Mode = ", mode)

Measures of spread

Measures of spread describe how similar or varied the set of observed values are for a particular variable (data item). Measures of spread include the range, quartiles and the interquartile range, variance and standard deviation.

Range - difference between the smallest value and the largest value in a dataset (Max - Min)
Quartiles - divide an ordered dataset into four equal parts (25%, 50%, 75%, 100%), and refer to the values of the point between the quarters. A dataset may also be divided into quintiles (five equal parts), deciles (ten equal parts).
IQR - is the difference between the upper (Q3 or 75%) and lower (Q1 - 25%) quartiles, and describes the middle 50% of values when ordered from lowest to highest. The IQR is often seen as a better measure of spread than the range as it is not affected by outliers.

The variance and the standard deviation are measures of the spread of the data around the mean. They summarise how close each observed data value is to the mean value. In datasets with a small spread all values are very close to the mean, resulting in a small variance and standard deviation. Where a dataset is more dispersed, values are spread further away from the mean, leading to a larger variance and standard deviation. The smaller the variance and standard deviation, the more the mean value is indicative of the whole dataset. Therefore, if all values of a dataset are the same, the standard deviation and variance are zero. The standard deviation of a normal distribution enables us to calculate confidence intervals. In a normal distribution, about 68% of the values are within one standard deviation either side of the mean and about 95% of the scores are within two standard deviations of the mean.

The standard deviation is the square root of the variance. The standard deviation for a population is represented by σ , and the standard deviation for a sample is represented by s.

Standart deviation
Variance