Chuyển đến nội dung chính

Polynomial Regression: From Straight Lines to Curves

Explore how polynomial regression helps model nonlinear relationships and improve prediction accuracy in real-world datasets.
23 thg 3, 2026  · 12 phút đọc

When your data curves, it doesn’t make sense to use a straight line to estimate new data points. By doing so, you’ll end up with a model that misses the pattern, has high residuals, and poor predictions. Real-world data rarely behaves linearly, whether you're modeling how drug dosage affects response, how temperature impacts material stress, or how asset prices move over time.

Polynomial regression fixes this by extending linear regression to fit curves instead of straight lines. Just add a few higher-degree terms - , - and your model can keep track of the actual shape of your data.

In this article, I'll cover what polynomial regression is, the math behind it, how to implement it in Python, and how to avoid the trap most people fall into: overfitting.

If you’re new to the concept of machine learning, read our Essentials of Linear Regression in Python tutorial first.

What Is Polynomial Regression?

Polynomial regression is the algorithm you reach for when a straight line can't describe your data.

Linear regression models the relationship between variables as a straight line. That works when the relationship actually is linear - but most real-world data isn't. Think of how a car's braking distance changes with speed, or how a plant's growth rate responds to fertilizer. These relationships curve. A straight line won't fit them well, no matter what you do.

Polynomial regression extends linear regression by adding higher-degree terms to the equation. Instead of fitting y = b0 + b1x, you fit something like y = b0 + b1x + b2x² + b3x³. The degree of the polynomial - that n in "nth-degree" - controls how many bends the curve can make.

In short and plain English, here's the key difference between the two:

  • Linear regression: Fits a straight line. One coefficient per feature, one degree of freedom in the curve.

  • Polynomial regression: Fits a curve. Each additional term (, , ...) gives the model more flexibility to follow the shape of the data.

Linear versus polynomial regression

Linear versus polynomial regression

Below the surface, polynomial regression is still a linear model. "Linear" here refers to how the model treats its coefficients, not the shape of the curve it produces. You're adding new features (, ) and fitting a linear equation on top of them.

So when do you actually use it?

Go with polynomial regression when your residual plot from a linear model shows a pattern - that's a sign the relationship isn't linear. It's also a great fit when you have domain knowledge suggesting a curved relationship, like in physics, biology, or economics.

The tradeoff is that higher-degree polynomials can get unstable. A degree-2 or degree-3 polynomial can handle most real-world curves, but when you go higher, you're likely fitting noise rather than signal.

Why Use Polynomial Regression?

Most real-world relationships between variables aren't linear.

A straight line might get close, but "close" isn't good enough when you're predicting anything sensitive. In the relationship in the data bends, a linear model will consistently miss that bend.

Polynomial regression does a better job by letting the model curve. Instead of forcing a straight line through your data, you're fitting a curve that can follow the shape of the relationship.

Here are some areas in different lines of business where it makes a real difference:

  • Biology and medicine: Dose-response relationships are rarely linear. A low dose might have little effect, a medium dose works well, and a high dose causes side effects. That S-shaped curve needs a polynomial model to capture it
  • Engineering: Stress-strain relationships in materials, aerodynamic drag, and thermal expansion all follow nonlinear patterns that polynomial regression models well
  • Finance: Asset returns, option pricing, and demand curves often show diminishing or accelerating effects that a straight line can't represent
  • Machine learning pipelines: Polynomial features are a quick way to add nonlinearity to a linear model without switching to a more complex algorithm

The common thread in all these cases is the same: the relationship between your input and output changes at different values of x. Linear regression assumes that change is constant. Polynomial regression doesn't.

That said, polynomial regression isn't a silver bullet.

It works best when you have domain knowledge suggesting a curved relationship, or when your residual plot clearly shows a pattern a straight line can't fix. Go for it with a specific problem in mind - not just because your linear model's isn't high enough.

Understanding the Math Behind Polynomial Regression

Knowing the basic math behind polynomial regression will help you understand it better.

Polynomial terms

In linear regression, your model looks like this:

Linear regression formula

Linear regression formula

That's one input variable, one coefficient, one straight line. Polynomial regression extends this by adding higher-degree terms:

Polynomial regression formula

Polynomial regression formula

Each new term - , , and so on - gives the model one more "bend" to work with. A degree-2 polynomial can fit a single curve. A degree-3 polynomial can fit a curve that changes direction once. The degree n controls how flexible the model is.

The underlying algorithm remains the same. You're just adding new features. is treated as its own input variable, same as x. The model is still fitting a linear equation - just on top of transformed features.

Least squares estimation

Fitting a polynomial regression model works the same way as linear regression - that is with least squares estimation.

The idea is to find the coefficients that minimize the sum of squared residuals:

SSR formula

SSR formula

Each squared difference is a residual - the gap between what the model predicts and what was observed. Squaring them makes sure negative and positive errors don't cancel each other out, and penalizes large errors more than small ones.

In practice, your library handles this for you. But knowing that least squares is the objective helps you understand why outliers hurt polynomial models so much - a single large residual gets squared and pulls the coefficients in its direction.

Interpreting coefficients

In linear regression, b1 has a simple interpretation: for every one-unit increase in x, y changes by b1.

Polynomial regression is a bit more involved. When your model includes b_1x + b_2x^2, the effect of x on y depends on the current value of x - you can't read b2 in isolation and draw a conclusion. The slope of the curve is constantly changing, which you can see by taking the derivative with respect to x:

Derivative wrt x

Derivative wrt x

The slope itself is a function of x. That means the impact of a one-unit change in x is different at every point on the curve.

This is why you shouldn't try to interpret individual coefficients in a polynomial model. Instead, look at the curve as a whole. Plot your predictions against your data.

Applications of Polynomial Regression in Data Science

Polynomial regression shows up across a lot of fields because curved relationships are everywhere in real data.

Finance

Financial data rarely moves in straight lines.

Asset prices, revenue growth, and demand curves all tend to accelerate, decelerate, or reverse direction depending on market conditions. A linear model assumes a constant rate of change, which is almost never true. Polynomial regression lets you model these shifts - for example, how consumer demand drops off slowly at first, then sharply as prices rise past a certain point.

It's also handy for trend analysis over time. When you're fitting a curve to historical price data or modeling how a metric grows during different phases of a business cycle, a degree-2 or degree-3 polynomial often estimates the shape much better than a straight line.

Engineering

Physical processes are some of the best examples of nonlinear relationships.

Stress and strain in materials, fluid dynamics, thermal expansion, and aerodynamic drag all follow curves, not lines. Many of the governing equations in physics are polynomial by nature. Polynomial regression gives you a data-driven way to fit those curves when you have measurements but no clean, closed-form equation.

A good example is drag force, which increases with the square of velocity. A linear model will underestimate drag at high speeds, and a degree-2 polynomial will correctly fit the relationship.

Machine Learning

In machine learning, polynomial regression is often used as a feature engineering technique rather than a standalone model.

By adding polynomial terms - , , interaction terms - to your feature set, you give a linear model the ability to fit nonlinear patterns without switching to a more complex algorithm. This is a common first step when your linear model underfits and you want to add flexibility before reaching for something like a decision tree or neural network.

It's also useful as a baseline model.

Before training a more complex model, fitting a polynomial regression tells you how much of the variance a simple curve can explain. If a degree-3 polynomial already gets you most of the way there, you might not need anything more complex.

How to Choose the Right Degree for Polynomial Regression

Picking the degree of your polynomial is one of the most important decisions you'll make. If you get it wrong in either direction, you’ll end up with a less accurate model.

Luckily, a few lines of Python code are enough to get the work done.

Overfitting vs. underfitting

Underfitting happens when your degree is too low. A degree-1 polynomial on curved data will miss the pattern - high bias, poor predictions, and a model that performs badly on both training and new data.

Overfitting is the opposite problem, and it's more dangerous because it looks good at first. A high-degree polynomial can go through every data point in your training set with near-zero error. But the model is just memorizing the noise. It would fall apart on new data.

You can see this by comparing training error versus test error across degrees:

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

np.random.seed(42)

# Generate sample data
x = np.linspace(-3, 3, 80).reshape(-1, 1)
y = 0.6 * x.ravel()**2 - x.ravel() + np.random.normal(0, 0.6, 80)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

for deg in [1, 2, 12]:
    poly = PolynomialFeatures(deg)
    model = LinearRegression()
    model.fit(poly.fit_transform(x_train), y_train)

    train_err = mean_squared_error(y_train, model.predict(poly.transform(x_train)))
    test_err  = mean_squared_error(y_test,  model.predict(poly.transform(x_test)))

    print(f"Degree {deg:>2} | Train MSE: {train_err:.4f} | Test MSE: {test_err:.4f}")

MSE on different degrees

MSE on different degrees

Or, presented visually:

Data fit with different polynomial degrees

Data fit with different polynomial degrees

Degree 1 shows high error on both sets - that's underfitting. Degree 2 is well-balanced. Degree 12 has lower training error but a much higher test error - that's overfitting.

Cross-validation

The right way to find the best degree is cross-validation - specifically, k-fold cross-validation.

The idea is to split your data into k subsets, then train on k-1 of them and test on the one you held out, and repeat until every subset has been the test set once. Finally, average the error across all folds and do this for each candidate degree and pick the one with the lowest average test error.

Implementation is much simpler than explanation:

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt

np.random.seed(42)

# Generate sample data
x = np.linspace(-3, 3, 80).reshape(-1, 1)
y = 0.6 * x.ravel()**2 - x.ravel() + np.random.normal(0, 0.6, 80)

# Test degrees 1 through 10
degrees = range(1, 11)
mean_errors = []

for deg in degrees:
    model = make_pipeline(PolynomialFeatures(deg), LinearRegression())
    scores = cross_val_score(model, x, y, cv=5, scoring="neg_mean_squared_error")
    mean_errors.append(-scores.mean())
    print(f"Degree {deg:>2} | Mean error: {-scores.mean():.4f}")

best_degree = np.argmin(mean_errors) + 1

Degree error comparison

Degree error comparison

Or, represented visually:

Cross-validation error comparison

Cross-validation error comparison

The CV error drops as you add useful polynomial terms, then climbs again as the model starts overfitting.

When two degrees give similar CV error, pick the lower one. A simpler model that performs just as well is always the better choice.

Key Considerations and Limitations

There are a couple of ways polynomial regression can lead you to wrong conclusions. Let’s go over them now.

Sensitivity to outliers

Outliers affect polynomial regression more than linear regression.

Least squares square each residual before summing them. A single data point far from the trend contributes a disproportionately large error, and the model will bend its curve to reduce that error - even if it means distorting the fit everywhere else.

This effect gets worse as the degree increases. A high-degree polynomial has enough flexibility to chase an outlier, which pulls the curve away from the bulk of your data to fit one bad point.

A way to get around this is to clean your data before fitting. Plot your data, identify outliers, and decide whether they represent real signal or noise. If they're noise - measurement errors, data entry mistakes, corrupted records - remove them. If they're real, consider a more outlier-resistant fitting method like RANSAC or Huber regression.

Overfitting

Every time you add a polynomial term, you give the model more flexibility. At some point, that flexibility stops helping, and the model starts fitting the random noise in your training data instead of the actual pattern. The result is a curve that does well on training data but falls apart on new data.

The tricky part is that overfitting is invisible if you only look at training error. A degree-10 polynomial will almost always have lower training MSE than a degree-2 polynomial. That doesn't mean it's a better model.

Here’s how you should approach this:

  • Start low: Try degree 2 or 3 before going higher. Most real-world curves don't need more than that
  • Always evaluate on held-out data: Training error alone tells you nothing about generalization.
  • Cross-validation: Use it to find the degree where test error stops improving.
  • Watch the curve: If your polynomial does something wild outside the range of your training data, that's a sign it's overfitting.

Polynomial regression works best when you have a good reason to expect a curved relationship, and you can keep the degree low enough.

Alternatives to Polynomial Regression

Polynomial regression isn't always the right tool - some of these alternatives might be a better fit for you, no pun intended.

Splines

Splines solve the problem of global instability.

When you fit a degree-10 polynomial, every coefficient is influenced by every data point. A change in one region of your data affects the curve everywhere else. Splines avoid this by splitting your data into segments and fitting a separate low-degree polynomial to each one. The segments are joined at points called knots, with constraints that keep the overall curve smooth at the joins.

The result is a curve that's flexible where it needs to be and stable everywhere else.

In Python, scipy and scikit-learn both have solid spline implementations:

from scipy.interpolate import UnivariateSpline

spline = UnivariateSpline(x, y, k=3)
y_pred = spline(x_new)

Spline versus a high-degree polynomial

Spline versus a high-degree polynomial

To reiterate, go with splines when your data has different behavior in different regions, or when a single polynomial curve can't capture the shape without going to a high degree.

Support Vector Regression

Support Vector Regression (SVR) takes a different approach.

It doesn’t fit a curve that minimizes squared error across all points, but instead tries to find a function that stays within a defined margin of error for as many points as possible, while ignoring points that fall within that margin. This makes it less sensitive to outliers than polynomial regression.

The connection to polynomial regression comes through the kernel trick. SVR with a polynomial kernel can fit nonlinear relationships similar to polynomial regression - but with better generalization and more control over the fit via regularization parameters.

from sklearn.svm import SVR

model = SVR(kernel="poly", degree=3, C=1.0, epsilon=0.1)
model.fit(x_train, y_train)

SVR versus a high-degree polynomial

SVR versus a high-degree polynomial

SVR is a good choice when your data has outliers you can't remove, when you need more control over the bias-variance tradeoff, or when polynomial regression keeps overfitting despite cross-validation.

Conclusion

In this article, I’ve shown you how it extends linear regression to fit curves, how least squares estimation finds the best coefficients, and why interpreting those coefficients individually doesn't tell you much.

The degree you choose matters more than anything else. Too low leads to underfitting, and too high to overfitting. Cross-validation gives you an objective way to find the sweet spot. And if polynomial regression isn't the right fit, splines and SVR are solid alternatives worth knowing.

The best way to build intuition for all of this is to use it on your own data. Pick a dataset where you suspect a nonlinear relationship, fit a linear model first, plot the residuals, and see what polynomial regression does differently. Read our guide to Non-Linear Models and Insights Using R to see this pipeline in practice.


Dario Radečić's photo
Author
Dario Radečić
LinkedIn
Senior Data Scientist based in Croatia. Top Tech Writer with over 700 articles published, generating more than 10M views. Book Author of Machine Learning Automation with TPOT.

Polynomial Regression FAQs

What is polynomial regression, and when should I use it?

Polynomial regression is an extension of linear regression that fits a curved line to your data instead of a straight one. You use it when the relationship between your input and output variables isn't linear - for example, when a residual plot from a linear model shows a pattern. It's a good first step before reaching for more complex models like decision trees or neural networks.

How is polynomial regression different from linear regression?

Linear regression fits a straight line by modeling the relationship as y = b0 + b1x. Polynomial regression adds higher-degree terms like and , giving the model enough flexibility to follow curves in the data. Both use the same least squares estimation - polynomial regression just treats those extra terms as additional input features.

What are the biggest risks of using polynomial regression?

The two main risks are overfitting and sensitivity to outliers. A high-degree polynomial can memorize the noise in your training data and perform poorly on new data. Outliers are dangerous because least squares squares each residual, meaning a single bad data point can pull the curve away from the rest of your data.

How do I choose the right degree for my polynomial regression model?

Start with degree 2 or 3 - most real-world curved relationships don't need more than that. From there, use k-fold cross-validation to compare average test error across candidate degrees and pick the one where test error stops improving. When two degrees give similar results, always go with the lower one.

When should I use splines or SVR instead of polynomial regression?

Go with splines when your data behaves differently in different regions, or when a single polynomial keeps producing unstable curves at the edges of your data range. And go with SVR with a polynomial kernel is a better choice when outliers are unavoidable and you need a model that won't bend itself out of shape to accommodate them. Both alternatives give you more control over the fit at the cost of some interpretability.

Chủ đề

Learn with DataCamp

Courses

Mô hình Hồi quy Bayesian với rstanarm

4 giờ
7K
Xem chi tiếtRight Arrow
Bắt đầu khóa học
Xem thêmRight Arrow
Có liên quan

Tutorials

Linear Regression in Python: Your Guide to Predictive Modeling

Learn how to perform linear regression in Python using NumPy, statsmodels, and scikit-learn.
Samuel Shaibu's photo

Samuel Shaibu

Tutorials

Simple Linear Regression: Everything You Need to Know

Learn simple linear regression. Master the model equation, understand key assumptions and diagnostics, and learn how to interpret the results effectively.
Josef Waples's photo

Josef Waples

Tutorials

Multivariate Linear Regression: A Guide to Modeling Multiple Outcomes

Learn when to use multivariate linear regression, understand its mathematical foundations, and implement it in Python with practical examples.
Vinod Chugani's photo

Vinod Chugani

Tutorials

How to Do Linear Regression in R

Learn linear regression, a statistical model that analyzes the relationship between variables. Follow our step-by-step guide to learn the lm() function in R.

Eladio Montero Porras

Tutorials

Least Squares Method: How to Find the Best Fit Line

Use this method to make better predictions from real-world data. Learn how to minimize errors and find the most reliable trend line.
Amberle McKee's photo

Amberle McKee

Tutorials

Multiple Linear Regression in R: Tutorial With Examples

A complete overview to understanding multiple linear regressions in R through examples.
Zoumana Keita 's photo

Zoumana Keita

Xem thêmXem thêm