Learn · Guide

Reading a backtest: Sharpe, drawdown, and the metrics that matter

Total return tells you almost nothing. Here is how to read backtest output correctly using Sharpe, Sortino, max drawdown, CAGR, and out-of-sample tests.

A strategy returns 340% over two years. Is that good? It is impossible to say without knowing how it got there. Did it compound steadily, or did it ride one enormous concentrated bet in January and spend the rest of the time in a 60% drawdown? Total return, taken alone, is nearly useless for evaluating a strategy. The metrics that actually matter describe the path — how much risk was taken, how painful the losing periods were, and whether the gains were achieved in a way that is repeatable.

Sharpe ratio: return per unit of risk

The Sharpe ratio is the most widely used risk-adjusted return metric. The formula is:

Sharpe = (mean_return - risk_free_rate) / stddev_of_returns

In crypto, the risk-free rate is often set to zero for simplicity, since the asset class has no true risk-free benchmark. The more important detail is the annualization factor. Sharpe is calculated on a per-period return series and then scaled to annual:

Annualized Sharpe = period_Sharpe x sqrt(periods_per_year)

For crypto, that is sqrt(365) if you are using daily returns, or sqrt(365 x 24) if using hourly. Using a 252-day factor (the equity market convention) understates volatility for a market that never closes. Always use 365 for daily crypto data.

Rough benchmarks for what "good" looks like in practice:

  • Sharpe below 0.5: the strategy is generating noise, not edge
  • 0.5 to 1.0: modest, possibly worth developing further
  • 1.0 to 2.0: solid for a systematic strategy in live trading
  • 2.0 to 3.0: excellent; expect some in-sample inflation
  • Above 3.0: see the callout below

Sortino ratio: penalizing only bad volatility

The Sortino ratio is a refinement of Sharpe that divides by the standard deviation of downside returns only, rather than all returns. The logic is sensible: upside volatility (large positive returns) is not actually a problem for a trader. Why penalize a strategy for winning big?

Sortino = mean_return / stddev_of_negative_returns

Sortino is almost always higher than Sharpe for the same strategy. If a strategy has a Sharpe of 1.2 and a Sortino of 1.8, the gap tells you the strategy has positive skew — its big moves tend to be positive. That is a good sign. If Sortino and Sharpe are nearly identical, your return distribution is symmetric or negatively skewed, meaning losses and gains are roughly equal in magnitude. For leveraged strategies in crypto, aim for Sortino above 1.5; above 2.0 is strong.

Max drawdown: the real pain measure

Max drawdown is the largest peak-to-trough decline in equity over the test period. A strategy that grew from $100k to $200k but passed through $80k on the way is showing a max drawdown of 60% from the $200k peak — not 20% from the starting point.

Max drawdown matters for two reasons that go beyond aesthetics. First, it represents the actual loss a real trader would have experienced if they started at the worst possible moment and held through the trough. Second, it tests your strategy's dependence on not having a bad sequence: a strategy with 30% max drawdown requires a 43% recovery just to get back to flat.

Equally important is the recovery time — how many days or weeks were spent underwater. A 25% drawdown that recovers in three weeks is qualitatively different from a 25% drawdown that takes 18 months. Prolonged drawdowns destroy real accounts not because the math fails but because traders quit or get margin-called.

Rough drawdown benchmarks for leveraged crypto strategies:

  • Under 15%: exceptional, likely a conservative or low-leverage strategy
  • 15% to 30%: acceptable for actively managed leveraged strategies
  • 30% to 50%: aggressive; requires high Sharpe and fast recovery to be viable
  • Above 50%: very difficult to trade live without significant psychological and financial strain

CAGR: making multi-year comparisons honest

CAGR (compound annual growth rate) normalizes total return across different time horizons so you can compare them fairly.

CAGR = (final_equity / initial_equity)^(1 / years) - 1

A strategy that returns 300% over four years has a CAGR of about 41%. A strategy that returns 300% over one year has a CAGR of 300%. They are not the same thing, and using raw return to compare them is misleading. CAGR is especially useful when running walk-forward tests across different market regimes — you want to know whether the annualized performance was consistent across periods, not just the aggregate.

The most useful single summary metric for a leveraged strategy is the CAGR-to-max-drawdown ratio. If your CAGR is 60% and your max drawdown is 30%, the ratio is 2.0. Targeting a ratio above 1.5 is a reasonable filter for strategies worth developing further.

Volatility and what it tells you

Strategy-level volatility (the annualized standard deviation of daily returns) is the denominator of Sharpe. But it is worth looking at directly. A strategy with 80% annualized volatility is riding a violent return path even if the Sharpe looks acceptable. In practice, high-volatility strategies require larger equity buffers to absorb variance without getting liquidated or stopped out at the wrong time.

For leveraged crypto strategies, annualized return volatility between 30% and 60% is common and manageable. Above 80% starts to become uncomfortable to live with, even if the expected return is high.

The overfitting tell

Overfitting is the core danger of backtesting. A curve-fit strategy finds patterns in historical noise — it memorizes the past rather than learning from it. The red flags:

Too-perfect metrics. Sharpe above 3, max drawdown under 5%, win rate above 70% across thousands of trades. Real edge does not look this clean.

Too many parameters. A strategy with 12 tunable parameters fitted to 18 months of data has almost certainly found spurious patterns. The degrees of freedom consumed by fitting exceed the information content of the sample.

In-sample vs. out-of-sample collapse. The definitive test: hold out 20-30% of your data, run all development on the in-sample portion, then evaluate the final model once on the held-out data. If the Sharpe drops by more than 50%, the strategy is not generalizing.

Walk-forward testing and the deflated Sharpe

A single in-sample backtest is almost never enough. Walk-forward testing splits the historical data into rolling windows: you train on the first 12 months, test on the next 3, then roll forward and repeat. The aggregate of all out-of-sample windows gives you a much more reliable picture of live performance.

Related is the concept of the deflated Sharpe — a statistical adjustment that accounts for the number of strategy variations you tested before arriving at the "winner." If you ran 50 parameter combinations and picked the best, you need to apply a penalty to the reported Sharpe because the best of 50 random noise strategies looks good by chance. Quantitative researchers often require a deflated Sharpe above 1.0 before taking a strategy to paper trading.

A practical reading order

When you receive a backtest report, evaluate in this order:

  1. Max drawdown and recovery time. Can you survive this emotionally and financially?
  2. Sharpe ratio (annualized with 365 factor). Is the return meaningfully above its own volatility?
  3. CAGR-to-drawdown ratio. Is the return worth the worst-case pain?
  4. Sortino vs. Sharpe gap. Is the volatility coming from the good side or the bad side?
  5. Out-of-sample performance. Does the strategy hold up on data it has never seen?
  6. Number of parameters and trades. Fewer parameters and more trades mean the statistics are more reliable.

Total return is listed nowhere in that sequence. It is useful only after everything else checks out — and even then, it is mainly telling you how long the strategy has been running.