Data Scientist Study notes Preparation for the exam
Exploratory Analysis and Statistical Experimentation in Python
Metrics callculation, charecteristics and relationships between features report
Measures of center
Measures of center identify average data value, if the data is more or less evenly spread with not so many outliers (normaly distributed) then Mean is calculated. However, if data has outliers they can affect mean and move data center (example if we have 10 workers and 9 of them earn 500$ while one earn 50000$, then mean would be 5450$ which is definetly not a center of our data) in those cases it is better to use Median or Mode which are less afected by outliers.
- Mean - average value
- Median - the middle value when data is in order
- Mode - most common value
import statistics
data = [220, 100, 190, 180, 250, 190, 240, 180, 140, 180, 190]
# Mean calulation
mean = statistics.mean(data)
# Median calulation
median = statistics.median(data)
# Mode calulation
mode = statistics.mode(data)
print("Mean = ", mean)
print("Median = ", median)
print("Mode = ", mode)
Measures of spread
Measures of spread describe how similar or varied the set of observed values are for a particular variable (data item). Measures of spread include the range, quartiles and the interquartile range, variance and standard deviation.
- Range - difference between the smallest value and the largest value in a dataset (Max - Min)
- Quartiles - divide an ordered dataset into four equal parts (25%, 50%, 75%, 100%), and refer to the values of the point between the quarters. A dataset may also be divided into quintiles (five equal parts), deciles (ten equal parts).
- IQR - is the difference between the upper (Q3 or 75%) and lower (Q1 - 25%) quartiles, and describes the middle 50% of values when ordered from lowest to highest. The IQR is often seen as a better measure of spread than the range as it is not affected by outliers.
The variance and the standard deviation are measures of the spread of the data around the mean. They summarise how close each observed data value is to the mean value. In datasets with a small spread all values are very close to the mean, resulting in a small variance and standard deviation. Where a dataset is more dispersed, values are spread further away from the mean, leading to a larger variance and standard deviation. The smaller the variance and standard deviation, the more the mean value is indicative of the whole dataset. Therefore, if all values of a dataset are the same, the standard deviation and variance are zero. The standard deviation of a normal distribution enables us to calculate confidence intervals. In a normal distribution, about 68% of the values are within one standard deviation either side of the mean and about 95% of the scores are within two standard deviations of the mean.
The standard deviation is the square root of the variance. The standard deviation for a population is represented by σ , and the standard deviation for a sample is represented by s.
- Standart deviation
- Variance
data = [220, 100, 190, 180, 250, 190, 240, 180, 140, 180, 190]
# Range calculation
range0 = max(data) - min(data)
# Quartiles or procentiles calculation
#quartile = quartile(data, 0.25)
# Interquartile range calculation
#iqr = iqr(data)
# Variance calculation
#variance = vars
# Standard deviation calculation
#std
print(" Range = ", range0 )
#print("Quartiles = ", quartile)
#print("IQR = ", IQR)
#print("Variance = ", variance)
#print("STD = ", std)