Short answer
The most useful ways to evaluate compare a backtest against a real baseline are the ones that show whether it still works after costs, timing, turnover, and portfolio context are included, not just whether one chart looks cleaner in hindsight.
That usually means asking whether compare a backtest against a real baseline improves the decision after friction, not just whether it makes one in-sample score look impressive.