Regime-Conditional Strategy Evaluation
Pooled performance metrics hide the conditional distributions that matter; regime-conditional evaluation separates strategy quality from historical regime-mix luck.
A backtest aggregated across a decade of mixed market conditions tells you almost nothing actionable. A strategy with a 1.4 Sharpe ratio computed over 2015-2024 might have generated all of its returns during two specific volatility regimes and bled capital everywhere else. Pooling regimes hides the conditional distribution that actually matters: how the strategy behaves given the regime you are about to trade through. Regime-conditional evaluation reframes performance as a set of state-dependent estimates rather than a single number.
The Pooling Problem
Unconditional metrics treat all observations as exchangeable draws from a stationary distribution. Markets are not stationary. When you compute a single Sharpe ratio across regimes with structurally different return-generating processes, you are averaging over distributions with different means, variances, and tail behaviors. The result is an estimator of a quantity that has no physical interpretation.
Consider a trend-following system evaluated from 2010 to 2024. The aggregate metrics smear together the low-volatility QE era, the 2018 vol shock, the 2020 dislocation, and the 2022 rate regime. Each of these is a different sampling environment. The pooled mean return is a weighted average where the weights are arbitrary functions of how long each regime happened to last during your sample window.
Defining Regimes Without Overfitting
Regime definitions must be specified before evaluation, not discovered by inspecting returns. Common ex-ante partitions include realized volatility terciles, yield curve states, dispersion measures, or trend strength indicators. The discipline is that the regime label at time t depends only on information available at time t, and the partition rule is fixed before any performance calculation.
A useful sanity check: compute regime labels using only data up to t-1, then verify that regime transitions are persistent enough to be tradeable. If your regime flips every three days, you have not identified a regime — you have identified noise. Persistence in the range of weeks to months is typical for meaningful market states.
Conditional Performance Estimators
Once regimes R_1, R_2, ..., R_k are defined, compute performance metrics within each subset. The conditional Sharpe under regime i is the ratio of mean excess return to standard deviation, restricted to periods where the regime indicator equals i.
The pooled Sharpe relates to the conditional Sharpes through the law of total variance, not as a simple weighted average. The relationship is:
The second term — variance of conditional means — inflates the denominator of the pooled Sharpe whenever conditional expected returns differ across regimes. This is why a strategy with strong conditional performance in two regimes and weak performance in a third often shows a mediocre pooled Sharpe: the cross-regime dispersion in means itself contributes to the variance.
Sample Size and Confidence
Regime conditioning shrinks your effective sample size. A ten-year daily backtest has roughly 2,500 observations, but if a regime occupies 20% of the period, you have 500 observations to estimate its conditional Sharpe. The standard error of a Sharpe ratio estimate scales approximately with sqrt((1 + SR²/2) / n), so confidence intervals widen quickly.
Report conditional Sharpes with their standard errors. A conditional Sharpe of 1.8 over 200 observations has a standard error near 0.10, which is informative. The same point estimate over 40 observations carries a standard error above 0.22 and should be treated as a hypothesis rather than a result. Regimes that occur rarely in your sample produce estimates that look precise but are not.
Operationalizing the Framework
In Kestrel Signal, regime-conditional evaluation means three concrete practices. First, tag every bar in your dataset with a regime label computed from a lookback-only rule. Second, produce per-regime equity curves, drawdown statistics, and turnover metrics alongside the pooled summary. Third, stress-test position sizing assumptions against the worst observed regime, not the average.
The output is not a single performance number but a vector of conditional estimates. Allocation decisions then incorporate beliefs about regime probabilities going forward — explicitly, rather than implicitly through the historical regime mix baked into pooled numbers. This separates strategy quality from regime-mix luck, which is the actual goal of evaluation.