Concepts8 min read

Overfitting and Curve-Fitting

Overfitting is the core problem of retail backtesting. It happens when a strategy's parameters are tuned — deliberately or not — to fit historical noise rather than persistent signal. The result is a backtest that looks compelling and live performance that doesn't match.

What overfitting looks like

A curve-fit strategy typically shows: extremely high in-sample Sharpe (often above 3), very few losing months in the backtest period, tight stop-losses that “just avoided” major drawdowns, and a parameter sensitivity map where performance degrades sharply as parameters deviate from their optimised values.

The last point is diagnostic. A robust strategy is relatively insensitive to small parameter changes. A curve-fit strategy is optimised to a specific combination of parameters that happened to work on this particular dataset. Change the moving average length by 2, and the Sharpe falls from 2.4 to 0.6.

Why it happens

Financial time series are noisy with low signal-to-noise ratio. A daily strategy on 5 years of data has roughly 1260 observations. If you optimise over 10 parameters each with 10 candidate values, you've implicitly searched over 10 billion combinations. Even searching a small fraction of this space is enough to find patterns that are pure noise.

The number of effective free parameters (degrees of freedom) in a strategy isn't just the number you explicitly optimise. Every time you look at results, adjust a rule, change an exit condition, or decide to test a different market, you're consuming degrees of freedom. The full trial count includes every strategy variation you've ever tested on this data, not just the ones in the current optimisation run.

The most dangerous form of overfitting is unconscious: a trader looks at a chart, notices a pattern, backtests it, sees good results, and concludes they found edge. Every time you glance at historical price data and form a hypothesis, you are using degrees of freedom that the classical Sharpe ratio calculation doesn't account for.

Symptoms to check

Parameter sensitivity: Run the same strategy with parameters ±20% from optimal. If Sharpe degrades by more than 50%, the optimum is a spike rather than a plateau. Robust strategies produce plateau-shaped parameter surfaces.

Walk-forward efficiency: A WFE below 0.5 consistently across multiple windows suggests the parameters are capturing in-sample-specific patterns.

DSR tier: After accounting for the number of trials, does the result still show statistical significance? “Noise” DSR tier with a large N is the quantitative diagnosis.

Trade count: Strategies with very few trades are more susceptible to overfitting because each trade has a disproportionate impact on the Sharpe calculation. Fewer trades = more noise per trade = easier to fit.

What you can't fix with more data

It's tempting to think that a longer backtest cures overfitting. It helps, but doesn't eliminate the problem. If you've tested 200 strategy variations on 20 years of data, the DSR adjustment for N = 200 trials will still deflate the Sharpe significantly. The amount of data available is fixed; the number of trials you can run is bounded only by your patience.

The only real cure is structural: commit to a hypothesis before looking at results, hold out data you haven't touched, keep an honest trial log, and apply the DSR correction rigorously. Kestrel Signal's result hash and DSR computation on every run are designed to make this discipline easier to maintain.

Regularisation

In machine learning, regularisation penalises model complexity to prevent overfitting. Systematic trading has analogues: prefer fewer parameters, wider parameter ranges, and strategies that work across multiple instruments and timeframes without re-optimisation. A strategy that works similarly on SPY, QQQ, and GLD with the same parameters is more likely to generalise than one tuned tightly to a single instrument.