Hypothesis Testing in Machine Learning

In this tutorial, you'll learn about the basics of Hypothesis Testing and its relevance in Machine Learning.

Jan 4, 2019 · 4 min read

The process of hypothesis testing is to draw inferences or some conclusion about the overall population or data by conducting some statistical tests on a sample. The same inferences are drawn for different machine learning models through T-test which I will discuss in this tutorial.

For drawing some inferences, we have to make some assumptions that lead to two terms that are used in the hypothesis testing.

Null hypothesis: It is regarding the assumption that there is no anomaly pattern or believing according to the assumption made.
Alternate hypothesis: Contrary to the null hypothesis, it shows that observation is the result of real effect.

P value

It can also be said as evidence or level of significance for the null hypothesis or in machine learning algorithms. It’s the significance of the predictors towards the target.

Generally, we select the level of significance by 5 %, but it is also a topic of discussion for some cases. If you have a strong prior knowledge about your data functionality, you can decide the level of significance.

On the contrary of that if the p-value is less than 0.05 in a machine learning model against an independent variable, then the variable is considered which means there is heterogeneous behavior with the target which is useful and can be learned by the machine learning algorithms.

The steps involved in the hypothesis testing are as follow:

Assume a null hypothesis, usually in machine learning algorithms we consider that there is no anomaly between the target and independent variable.
Collect a sample
Calculate test statistics
Decide either to accept or reject the null hypothesis

Calculating test or T statistics

For Calculating T statistics, we create a scenario.

Suppose there is a shipping container making company which claims that each container is 1000 kg in weight not less, not more. Well, such claims look shady, so we proceed with gathering data and creating a sample.

After gathering a sample of 30 containers, we found that the average weight of the container is 990 kg and showing a standard deviation of 12.5 kg.

So calculating test statistics:

T = (Mean - Claim)/ (Standard deviation / Sample Size^(1/2))

Which is -4.3818 after putting all the numbers.

Now we calculate t value for 0.05 significance and degree of freedom.

Note: Degree of Freedom = Sample Size - 1

From T table the value will be -1.699.

After comparison, we can see that the generated statistics are less than the statistics of the desired level of significance. So we can reject the claim made.

You can calculate the t value using stats.t.ppf() function of stats class of scipy library.

Errors

As hypothesis testing is done on a sample of data rather than the entire population due to the unavailability of the resources in terms of data. Due to inferences are drawn on sample data the hypothesis testing can lead to errors, which can be classified into two parts:

Type I Error: In this error, we reject the null hypothesis when it is true.
Type II Error: In this error, we accept the null hypothesis when it is false.

Other Approaches

A lot of different approaches are present to hypothesis testing of two models like creating two models on the features available with us. One model comprises all the features and the other with one less. So we can test the significance of individual features. However feature inter-dependency affect such simple methods.

In regression problems, we generally follow the rule of P value, the feature which violates the significance level are removed, thus iteratively improving the model.

Different approaches are present for each algorithm to test the hypothesis on different features.

If you would like to learn more about Bayesian inferences fundamentals, take DataCamp's Fundamentals of Bayesian Data Analysis in R course.

Check out our Machine Learning Basics tutorial.

Topics

Machine Learning

Data Science

Learn more about Machine Learning

Course

Machine Learning for Time Series Data in Python

4 hr

51.3K

This course focuses on feature engineering and machine learning for time series data.

See Details

Start Course

Course

Machine Learning with Tree-Based Models in Python

5 hr

110.7K

In this course, you'll learn how to use tree-based models and ensembles for regression and classification using scikit-learn.

See Details

Start Course

Course

Understanding Machine Learning

2 hr

268.1K

An introduction to machine learning with no coding involved.

See Details

Start Course

Tutorial

Hyperparameter Optimization in Machine Learning Models

This tutorial covers what a parameter and a hyperparameter are in a machine learning model along with why it is vital in order to enhance your model’s performance.

Sayak Paul

Tutorial

Machine Learning in R for beginners

This small tutorial is meant to introduce you to the basics of machine learning in R: it will show you how to use R to work with KNN.

Karlijn Willems

Tutorial

Probability Distributions in Python Tutorial

In this tutorial, you'll learn about and how to code in Python the probability distributions commonly referenced in machine learning literature.

DataCamp Team

Tutorial

An Introduction to Statistical Machine Learning

Discover the powerful fusion of statistics and machine learning. Explore how statistical techniques underpin machine learning models, enabling data-driven decision-making.