Start with a paired comparison
The clean test is base strategy versus base strategy plus context layer, with everything else held constant. If the overlay changes costs, timing, rebalance rules, or allowable leverage, those changes need to be counted as part of the overlay rather than treated like free upgrades.
This sounds obvious, but many context layers look good only because they quietly smuggle in a second policy change at the same time.