2025-02-12 · Rafael Nogueira
Pairwise Grids Beat Lone Scores for Draft Reviews
Why we stopped asking reviewers for single numbers and started forcing comparisons between outputs.
evaluation · cohorts · quality
When teams evaluate model drafts with a lone 1–5 score, reviewers anchor differently week to week. We moved cohorts to pairwise grids so every score is relative to another draft in the same context window.
The first week feels slower because reviewers compare more items. By week three, disagreements cluster in ways a spreadsheet can actually show to leadership—without hiding the messy reality.
We also require a short note when reviewers flip their preference. That single constraint keeps the grid honest and surfaces prompt regressions earlier than aggregate averages ever did.
Downstream, product teams export the grid as markdown appendices for ship reviews. It is not glamorous work, but it keeps conversations grounded in observed differences instead of vibes.