Skip to content

Probability and distributions

What are the chances?

Measuring chance

  • Probability - take the number of ways the event can happen and divide it by the total number of possible outcomes.
  • Probability is always between zero and 100 percent. If the probability of something is zero, it's impossible, and if the probability is 100%, it will certainly happen.

Sampling with replacement

  • The sample is placed back into the selection and can be chosen again.
  • e.g. If Brian's name is selected for a meeting in the morning, the probability that Brian is picked for a meeting in the afternoon remains 25%.

Independent probability

  • Independent probability - probability of an event does not change based on the outcome of a previous event.

Online retail sales

  • Product Type - category of the product sold in that order.
  • Net Quantity - number of products sold in the order.
  • Gross Sales - number of dollars generated for the order.
  • Discounts - dollar value deducted from the sale.
  • Returns - number of dollars given back to the customer due to returned items.
  • Net Sales - total amount of dollars generated by the order after factoring in discounts and returns.

Probability of an order for a jewelry product

  • To find the probability of the next order being for a jewelry product we divide the number of orders for jewelry products by the total number of orders. There were 1767 orders, of which 210 were for jewelry products, so we divide 210 by 1767. The probability of the next order being for a jewelry product is just under 12%.

Conditional probability

Multiple meetings

  • Brian has been selected to attend a potential client meeting, so his name is no longer in the box. However, we now have another potential client who wants to meet at the same time, so we need to pick another salesperson. Brian is unavailable, so we'll pick between the remaining three. This is called sampling without replacement since we aren't replacing the name we already pulled out. This time, Claire is picked. The probability of this is one out of three or about 33%.

Dependent events

  • Outcome of the first event changes the probability of the second.
  • In general, when sampling without replacement, each pick is dependent.

Conditional probability

  • Method used to calculate the chances of dependent events because the probability of one event is conditional on the outcome of another.
  • Context or subject-matter expertise is critical when using conditional probability.

Venn diagrams

  • Technique used to display the possible outcomes of multiple events, and the overlap where both events can occur.
  • Where the two events are dependent, a Venn diagram's overlap will change based on the results of the first event.

Conditional probability formula

  • The probability of event A, given event B, is equal to the sum of the probabilities of both events, divided by the probability of event B.

Discrete distributions

Probability distribution

  • Consider rolling a standard, six-sided die. There are six possible outcomes and each has a one-sixth chance of occurring. This is an example of a probability distribution.
  • Used in hypothesis testing to understand whether results may have occurred by chance
  • Probability distribution - describes the probability of each possible outcome in a scenario.
  • Mean - Expected value of a distribution. Calculated by multiplying each value by its probability and adding everything together.

Visualizing a probability distribution

  • We can visualize a probability distribution using a histogram, where each bar represents an outcome, and each bar's height represents the probability of that outcome.

Probability = area

  • We can calculate probabilities of different outcomes by taking areas of the probability distribution.

Discrete probability distributions

  • Involve count or interval data
  • Discrete uniform distribution - when all outcomes have the same probability

Law of large numbers

  • Law of large numbers - if we increase the size of the sample then its mean will approach the theoretical mean.

Continuous distributions

  • Involve continuous data
  • Continuous uniform distribution - when all outcomes have the same probability

Probability = area

  • Just like with discrete distributions, we can take the area to calculate probability.

Bimodal distribution

  • Distribution with two values occurring most frequently. This is known as the bimodal distribution
  • Has two modes.
  • e.g. book prices, with different typical values occurring depending on whether the book is a paperback or a hardback.

The normal distribution

  • Has a peak in the middle, with symmetrical slopes as values move away from the center in either direction.
  • Often described as a bell-shaped curve.
  • e.g. blood pressure, retirement age

Total area = 1

  • Regardless of the shape of the distribution, the area beneath it must alway equal one, as the area covers 100% of possible outcomes.

More Distributions and the Central Limit Theorem

The binomial distribution

Binary outcomes

  • Two possible values can occur.
  • Outcomes are represented as a one or a zero, a success or a failure, and a win or a loss.

Binomial distribution

  • Describes the probability of the number of successes in a sequence of independent events. For example, it can tell us the probability of getting some number of heads in a sequence of coin flips.
  • The binomial distribution can be described using two parameters, n and p. n represents the total number of events being performed, and p is the probability of success, in this case, heads.

Probability of 7 or fewer heads

  • As with other distributions, we can calculate the probability of outcomes by adding together the area.

Probability of 8 or more heads

  • Likewise, to calculate the probability of eight or more heads, we can subtract the probability of seven or fewer heads from the total probability, or one.

Expected value

Expected value = n × p

  • The expected value of the binomial distribution can be calculated by multiplying n by p.

Independence

  • In order for the binomial distribution to apply, each event must be independent, so the outcome of one event shouldn't have an effect on the next.
  • But if we're sampling without replacement, the probabilities for the second event are different due to the outcome of the first event. Since these events aren't independent, we can't calculate accurate probabilities for this situation using the binomial distribution.

General applications

  • Clinical trials measuring effectiveness of a drug, where the outcome is whether the drug worked or not
  • Betting on the result of a sports match, where the bettor can either win or lose

The normal distribution

  • A continuous probability distribution.
  • It's one of the most important probability distributions we'll learn about since several statistical methods rely on it, and it applies to more real-world situations than the distributions we've covered so far.
  • Its shape is commonly referred to as a bell curve.
  • Properties:
    1. It's symmetrical, so the left side is a mirror image of the right.
    2. Just like any probability distribution, the area beneath the curve equals one.
    3. The probability never hits zero, even if it looks like it at the tail ends.
  • The normal distribution is described by its mean and standard deviation.
  • For the normal distribution, 68% of the area is within one standard deviation of the mean. 95% of the area falls within two standard deviations of the mean, and 99.7% of the area falls within three standard deviations. This is sometimes called the 68-95-99-point-seven rule.

Why is the normal distribution important?

  • Lots of real-world data closely resembles the normal distribution.
  • In hypothesis testing our data must follow a normal distribution in order to perform many statistical tests, such as comparing the mean of a sample to the population it represents.

Skewness

  • Describes the direction that the data tails off.
  • Positive skewed / right skewed - tail is on the right where larger positive values are
  • Negative skewed / left skewed - peaks on the right and tails off to the left.

Kurtosis

  • Way of describing the occurrence of extreme values in a distribution.
  • Positive kurtosis / leptokurtic - characterized by a large peak around the mean and smaller standard deviation
  • Mesokurtic distribution - term used to describe the normal distribution
  • Negative kurtosis / platykurtic - distribution with a lower peak and larger standard deviation

The central limit theorem

  • A distribution of a summary statistic such as the mean is called a sampling distribution.
  • A sampling distribution of the sample mean.
  • The shape stays consistent at one million sample means. This phenomenon is known as the central limit theorem (CLT), which states that a sampling distribution will approach a normal distribution as the size of the sample increases.
  • It's important to note that the central limit theorem only applies when samples are taken randomly and are independent, for example, randomly picking sales deals with replacement.
  • Generally, a sample size of at least 30 is required for the central limit theorem to apply.

Benefits of the central limit theorem

  • The central limit theorem also comes in handy when we have a huge population and don't have the time or resources to collect data on everyone. Instead, we can collect smaller samples and create a sampling distribution to estimate summary statistics.

The Poisson distribution

Poisson processes

  • A process where the average number of events in a given time period is known, but the time or space between events is random.
  • For example, the number of animals adopted from an animal shelter each week is a Poisson process - we may know that on average there are eight adoptions per week, but the time between adoptions can differ randomly. Other examples would be the number of people arriving at a restaurant each hour, or the number of visits to a company's website in a day.

Poisson distribution

  • Describes the probability of some number of events happening over a fixed period of time.
  • We can use the Poisson distribution to calculate the probability of at least five animals getting adopted in a week, the probability of 12 people arriving at a restaurant in an hour, or the probability of fewer than 200 visits to a company's website in a day.

Lambda (λ)

  • The Poisson distribution is described by a value called lambda, which represents the average number of events per time period.
  • Lambda changes the shape of the distribution.
  • No matter what, the distribution's peak is always at its lambda value.

Central limit theorem still applies!

  • Just like other distributions, if we have a large number of samples and calculate the mean for each, then the distribution of sample means of a Poisson distribution looks like the normal distribution!

Correlation and Hypothesis Testing

Hypothesis testing

  • Hypothesis testing is a group of theories, methods, and techniques to compare populations
  • Routinely used in many industries:
    • Can a change in price lead to increased revenue?
    • Will changing a website address result in increased traffic?
    • Is a medication effective in treatment of a health condition?

History

  • Well-established discipline
  • Early origins can be traced to the 18th century when the analysis of birth records showed that each birth has a slightly larger probability of being male than female

Null hypothesis

  • In hypothesis testing, we always start with an assumption that no difference exists between the populations. We do this to reduce the risk of introducing any bias into our testing. This is called the null hypothesis.
  • We can expand on the example of male to female birth ratio to look at vitamin C supplements. Our null hypothesis could be that there is no difference in gender birth ratio between women who do and do not take vitamin C supplements. We then create an alternative hypothesis, which can typically take one of two forms. We can say that there is a difference between male and female births among women taking vitamin C supplements versus those who do not. Or we can state the direction of the difference, for example, that the population taking vitamin C supplements have more female births than those not taking the supplements.

Hypothesis testing workflow

  1. First, we decide on populations we want to analyze the difference between, in this case adult women using or not using vitamin C supplements.
  2. Then, we develop null and alternative hypotheses, that births are equally likely to be male or female in both populations, or that babies are more likely to be female in women taking vitamin C supplements.
  3. Now we collect our sample data. Specifically, we collect gender status of babies born in both populations.
  4. We then perform statistical tests on the sample data.
  5. Finally, we use the results to draw conclusions about the population that the sample represents.

How much data do we need?

  • Apply central limit theorem
  • Look at peer-reviewed research on similar hypothesis tests to find out how large the samples were. This can then serve as a benchmark.

Independent and dependent variables

  • A note on terminology. In hypothesis testing, we define the data in terms of the difference we expect to observe in the alternative hypothesis. The independent variable describes data we expect will not be affected by other data. For our vitamin C and gender birth ratio hypothesis test, this would be vitamin C supplementation, meaning it is independent of male to female birth ratio. The dependent variable is the data we expect to be affected by other values. In the alternative hypothesis, we propose that birth gender ratio will be affected by vitamin C supplementation, thus it is dependent on vitamin C. These terms are commonly used when describing the results of hypothesis tests, as well as when visualizing results such as on a scatter plot, where the independent variable is always on the x-axis and the dependent variable is on the y-axis.

Experiments

Experiments, treatment, and control

  • Experiments are a subset of hypothesis testing that involves performing statistical tests on sample data to draw conclusions about a population.
  • Draw product insights
  • Drive improvements to commercial performance.
  • Experiments aim to answer: What is the effect of the treatment on the response?
    • Where the treatment refers to the independent variable, and the response to the dependent variable.

Example: Advertising as a treatment

  • What is the effect of an advertisement on the number of products purchased?
    • Treatment: advertisement
    • Response: number of products purchased
  • Visualizing the results using a bar plot suggests the treatment may have been effective in increasing the number of products purchased.

Controlled experiments

  • Participants are randomly assigned to either the treatment group or the control group.
    • Treatment group sees the advertisement
    • Control group does not see the advertisement
  • Groups should be comparable to avoid introducing bias
  • If groups are not comparable, this could lead to drawing incorrect conclusions.

The gold standard of experiments

  • The gold standard, or ideal experiment, will eliminate as much bias as possible.
Randomization
  • Participants are assigned to the treatment or control group randomly, not due to some characteristics.
  • Choosing randomly helps ensure that the groups are comparable.
  • Known as a randomized controlled trial.
Blinding
  • Participants don't know if they're in the treatment or control group.
  • Ensures that the effect of the treatment is due to the treatment itself, not the idea of getting the treatment.
  • Known as a blind trial
Double-blind randomized controlled trial
  • Person administering the treatment or running the experiment also doesn't know whether they're administering the actual treatment or a placebo.
  • This protects against bias in the response as well as the analysis of the results.
  • These different tools all boil down to the same principle: the fewer the opportunities for bias to creep into our experiment, the more reliably we can conclude whether the treatment affects the response.

Randomized Controlled Trials vs. A/B testing

Randomized Controlled Trials
  • Can be multiple treatment groups
  • Popular in science, clinical research
  • Also referred to as A/B testing
A/B Testing
  • Popular in marketing, engineering.
  • Only split evenly into two groups (treatment and control).

Correlation

  • One way to measure relationships

Pearson correlation coefficient

  • Often referred to as the correlation coefficient
  • Developed by Karl Pearson and published back in 1896
  • Quantifies the strength of a relationship between two variables, producing a value between -1 and 1.
  • Number - corresponds to the strength of the relationship between the variables
  • Sign - corresponds to the direction of the relationship
    • Positive - as x increases, y increases
    • Negative - as x increases, y decreases
  • Can only be used for linear relationships
  • Linear - proportionate changes between dependent and independent variables

Values = strength of the relationship

  • 0.99 - near-perect or very strong relationship
  • 0.75 - moderate relationship
  • 0.2 - weak relationship
  • close to 0 - no relationship
  • Correlation does not equal causation.

Confounding variables

  • When looking at relationships among data, it is important to ask what else might be affecting the values.
  • The cost of a bottle of water is typically higher in locations with stronger economies, and they may offer better access to high quality healthcare. So perhaps life expectancy is not affected by the cost of a bottle of water, it is actually affected by the strength of the economy. This is known as a confounding variable, which is something that affects the data we are analyzing, but was not accounted for when assessing the relationship between variables.

Interpreting hypothesis test results

p-value

  • Probability of achieving a result, assuming the null hypothesis is true.
  • We can visualize the p-value for two sample mean distributions as the total area that overlaps between them.

Significance level (α)

  • To reduce the risk of drawing a false conclusion
    • Set a probability threshold for rejecting the null hypothesis
  • Known as α or the significance level.
  • Decided before data collection to minimize bias:
    • Otherwise, they could choose a different α to serve their interests.
  • A typical threshold is 0.05
    • 5% chance of wrongly concluding that Chicago residents live longer than Bangkok residents.
  • If p ≤ α reject the null hypothesis.
  • These results are said to be statistically significant.

Type I/II error

  • Type I error - wrongly reject our null hypothesis when it was actually true
  • Type II error - wrongly accept our null hypothesis when it's false