Skip to main content
HomeTutorialsMachine Learning

Hypothesis Testing in Machine Learning

In this tutorial, you'll learn about the basics of Hypothesis Testing and its relevance in Machine Learning.
Jan 2019  · 4 min read

The process of hypothesis testing is to draw inferences or some conclusion about the overall population or data by conducting some statistical tests on a sample. The same inferences are drawn for different machine learning models through T-test which I will discuss in this tutorial.

For drawing some inferences, we have to make some assumptions that lead to two terms that are used in the hypothesis testing.

  • Null hypothesis: It is regarding the assumption that there is no anomaly pattern or believing according to the assumption made.

  • Alternate hypothesis: Contrary to the null hypothesis, it shows that observation is the result of real effect.

P value

It can also be said as evidence or level of significance for the null hypothesis or in machine learning algorithms. It’s the significance of the predictors towards the target.

Generally, we select the level of significance by 5 %, but it is also a topic of discussion for some cases. If you have a strong prior knowledge about your data functionality, you can decide the level of significance.

On the contrary of that if the p-value is less than 0.05 in a machine learning model against an independent variable, then the variable is considered which means there is heterogeneous behavior with the target which is useful and can be learned by the machine learning algorithms.

The steps involved in the hypothesis testing are as follow:

  • Assume a null hypothesis, usually in machine learning algorithms we consider that there is no anomaly between the target and independent variable.

  • Collect a sample

  • Calculate test statistics

  • Decide either to accept or reject the null hypothesis

Calculating test or T statistics

For Calculating T statistics, we create a scenario.

Suppose there is a shipping container making company which claims that each container is 1000 kg in weight not less, not more. Well, such claims look shady, so we proceed with gathering data and creating a sample.

After gathering a sample of 30 containers, we found that the average weight of the container is 990 kg and showing a standard deviation of 12.5 kg.

So calculating test statistics:

T = (Mean - Claim)/ (Standard deviation / Sample Size^(1/2))

Which is -4.3818 after putting all the numbers.

Now we calculate t value for 0.05 significance and degree of freedom.

Note: Degree of Freedom = Sample Size - 1

From T table the value will be -1.699.

After comparison, we can see that the generated statistics are less than the statistics of the desired level of significance. So we can reject the claim made.

You can calculate the t value using stats.t.ppf() function of stats class of scipy library.

Errors

As hypothesis testing is done on a sample of data rather than the entire population due to the unavailability of the resources in terms of data. Due to inferences are drawn on sample data the hypothesis testing can lead to errors, which can be classified into two parts:

  • Type I Error: In this error, we reject the null hypothesis when it is true.

  • Type II Error: In this error, we accept the null hypothesis when it is false.

Other Approaches

A lot of different approaches are present to hypothesis testing of two models like creating two models on the features available with us. One model comprises all the features and the other with one less. So we can test the significance of individual features. However feature inter-dependency affect such simple methods.

In regression problems, we generally follow the rule of P value, the feature which violates the significance level are removed, thus iteratively improving the model.

Different approaches are present for each algorithm to test the hypothesis on different features.

If you would like to learn more about Bayesian inferences fundamentals, take DataCamp's Fundamentals of Bayesian Data Analysis in R course.

Check out our Machine Learning Basics tutorial.

Learn more about Machine Learning

Understanding Machine Learning

BeginnerSkill Level
2 hr
161.6K
An introduction to machine learning with no coding involved.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

How to Choose The Right Data Science Bootcamp in 2023 (With Examples)

Learn everything about data science bootcamps, including a list of top programs to kickstart your career.
Abid Ali Awan's photo

Abid Ali Awan

10 min

DataCamp Portfolio Challenge: Win $500 Publishing Your Best Work

Win up to $500 by building a free data portfolio with DataCamp Portfolio.
DataCamp Team's photo

DataCamp Team

5 min

Classification vs Clustering in Machine Learning: A Comprehensive Guide

Explore the key differences between Classification and Clustering in machine learning. Understand algorithms, use cases, and which technique to use for your data science project.
Kurtis Pykes 's photo

Kurtis Pykes

12 min

What is Named Entity Recognition (NER)? Methods, Use Cases, and Challenges

Explore the intricacies of Named Entity Recognition (NER), a key component in Natural Language Processing (NLP). Learn about its methods, applications, and challenges, and discover how it's revolutionizing data analysis, customer support, and more.
Abid Ali Awan's photo

Abid Ali Awan

9 min

The Curse of Dimensionality in Machine Learning: Challenges, Impacts, and Solutions

Explore The Curse of Dimensionality in data analysis and machine learning, including its challenges, effects on algorithms, and techniques like PCA, LDA, and t-SNE to combat it.
Abid Ali Awan's photo

Abid Ali Awan

7 min

Chroma DB Tutorial: A Step-By-Step Guide

With Chroma DB, you can easily manage text documents, convert text to embeddings, and do similarity searches.
Abid Ali Awan's photo

Abid Ali Awan

10 min

See MoreSee More