So what is statistics? The practice and study of collecting and analyzing data. Statistics has two main branches. Descriptive or summary statistics are used to describe or summarize our data, while inferential statistics involve using samples to draw conclusions about the population they represent.
Statistics requires asking specific,measurable,questions.
Types of data:
Numeric/quantataive- 2 sub-groups Continuose - measured on a continouse scale (ie stock prices) Interval or count data - measured in whole numbers (ie cups of coffee per day)
Categorical / qualitative- Nominal - unordered data (ie eye color) Ordered data - categories are ordered (strongly disagree-agree-do not agree)
Summery statistics
Center measures - Mean (avg), Median(add two closed values to 50%, and divide by 2) , Mode(most frequent value. Found by counting the occurances of each value).
Mean, Median - numeric data Mode- categorical data
Histogram are great for summerzie numeric data.
Outlier - When one value is substantially different to others we call this an outlier. This outlier pulls the mean towards it, while the median is less affected. This is because the mean calculation involves adding up all values, so larger values affect the result, where as the median just looks at the middle value. Therefore, when data is not symmetrical it is best to use the median to describe the data's typical value.
Measures of spread How far apart data points are (narrow/wide histogram)
1.Range - max-min
- Variance - the distance of each data point from the mean
We extract Standard devication by taking the square root of the variance, standard deviation
3.Quartiles
4.Interquartile range (IQR)
Note - large x-value range = larger standard deviation
Probabilities
Between 0-100%
Unconditional probability
Independent events or Sampling with replacement - the chosen value is returned to the 'box' of possabilities and can be chosen again (same odds)
So the two events are independent.
Conditional probability
Dependent events or sampling without replacement - we are not replacing the value we pooled out.
Those are two events where the probability of the second event is affected by the outcome of the first event.
Context or subject-matter expertise is critical to evaluate probability of two dependant events.
Ven diagram describes two dependant events
Probability Distributions
Mean of a probability distribution- each value of x is multiplied by its corresponding probability and the products are added
Probability distribution enable to qualify risk and hypothasis testing.
Discrete distribution is a probability distribution that depicts the occurrence of individually countable or interval data such as 1, 2, 3, yes, no, true, or false.
Continuos data distribution
A continuos uniform distribution
Bimodal distirbution - two values accurding most frequently- two peaks graph
Normal distribution - bell shape curve. Very common.
Binomial Distribution - Probability distribution of the number of success in a sequence of independent events (must be) that produce a binary outcome.It's a discreet distirubtion since we are wworking with acountable outcomes. For example - the number of heads in a sequence of coin flips.
Discreet distribution - A random variable is discrete if it has a finite number of possible outcomes, or a countable number (i.e. the integers are infinite, but are able to be counted).
Parameters: n = total number of events performed p= probability of success (=1)
Adding the area (summing up Y-axit) to get the probability of a certain number of events.
Largest probability VS probability of success
Largest probability - look at the graph, which is the value with the largest probability (Y-axis). While Probability of success is
The normal distribution (bell curve)
Properties: -Symmetrical -Area (probability of success, p)=1 (as with binomial distribution) -Probability never hits 0! (even if it looks like the tail ends) -Described by it's mean and standard deviation - "the 68%,95%, 99.7% rule"
- 68% of values fall within 1 standard deviation of the mean
- 95% of values fall within 2 standard deviations of the mean
- 99.7% of values fall within 3 standard deviations of the mean -uses numerical values -lot's of real-world data resembles a normal distribution -in testing an hypothatis, a normal distribution is required for many statistical tests such as comparing the mean of a sample to the population it represents.
Data distribution -Skewness - "positive skewed or right skewed", "negative skewed or left skewed" (common in real world data)
- Kurtosis - occurance of extreme values in a distribution - 3 types:
-
Positive kurtosis (red), Leptokurtic - a large peak around the mean and smaller standard deviation
-
Normal distribution (blue), Mesokurtic
-
Negative kurtosis (green) - Platykrutic,a distribution with a lower peak and larger standard deviation
The Central Limit Theorem(CLT)- the sampling distribution of a mean becomes closer to the normal distribution as the size of the sample increases.
*CLT only applied when samples are taking randomly and are independent. *CLT is relevant to discrete uniform, continous uniform, or binomial distribution.
- a sample size of at least 30 is required for the CLT to apply.
- CLT can also apply to sampling the distribution of standart deviation.
- CLT applies to proportion. If we role the deice 1000 times, and plot the distribution of fours rolled in each sample, it resembles a normal distribution centered around 0.16, since there is a 1/6 chance of rolling a four. This is the law of large numbers.
CLT vs. the law of large numbers:
- CLT - as the number of sample summary statistics calculated increase, the distribution willl more closely resemble a normal distribution while in the law of large numbers, this is not always the case.
- CLT generally applies if the sample size is 30 or more, in the law of large numbers, as the sample increase, the sample mean gets close to the value of the popularion mean.
The CLT also comes in handy when we have a huge population and don't have the time or resources to collect data on everyone. Instead, we can collect smaller samples and create a sampling distribution to estimate summary statistics.
Example: It would be impossible to find out the Type 2 Diabetes status of every adult in the USA, so an appropriate approach is to take lots of small samples across several locations and use the sampling distribution to calculate the percentage diagnosed.
The Poisson Distribution
Common in our day to day lives. It's a probability of some # of events occuring over a fixed period of time, where the time between events is random.For example, the number of animals adopted from an animal shelter each week is a Poisson process - we may know that on average there are eight adoptions per week, but the time between adoptions can differ randomly. We can calculate the probability of at least five animals adopted per week. The Poisson distribution is described by a value called lambda, which represents the average number of events per time period. Lambda = average number of animals adopted per week = the expected value of the distribution= Lambda is the distribution's peak.
CLT still applies.
Normal vs. Binomial vs. Poisson Distributions
Normal distribution
Properties of a normal distribution: Continuous distribution! The mean, mode and median are all equal. The curve is symmetric at the center (i.e. around the mean, μ). Exactly half of the values are to the left of center and exactly half the values are to the right. The total area under the curve is 1.
Properties of a Biinomial distribution Discrete variables! Each trial has a binary outcome (One of the two outcomes is labeled a ‘success’) The probability of success is known and constant over all trials The number of trials is specified The trials are independent. That is, the outcome from one trial doesn’t affect the outcome of successive trials
Properties of a Possion distribution Discrete variables! Used to determine the probability of the number of events occurring over a specified time or space. Examples of events over space or time:
-number of cells in a specified volume of fluid -number of calls/hour to a help line -number of emergency room beds filled/ 24 hours
Like the binomial distribution and the normal distribution, there are many Poisson distributions.
Each Poisson distribution is specified by the average rate at which the event occurs. The rate is notated with λ = ‘lambda’, Greek letter ‘L’ – There is only one parameter for the Poisson distribution