From demo-grade to lending-grade: what changed in our QA stack
A short, honest account of the things we had to add to be defensible in a room with bank risk officers — and what we still have ahead of us.
If you had asked us in March what our QA stack was, the answer was: ten KPIs graded by an architect model, overall score must clear 85, critical KPIs must clear 90, three retries. That answer was true and it was insufficient.
What changed: deterministic hard gates for identifiers, totals, and date envelopes, run before any LLM scoring. Three routing outcomes (auto_pass, review_queue, hard_reject) instead of binary approve/reject. Per-source-type thresholds, with CCRIS held to a 99.5% bar on critical fields. An independent verifier model running parallel to the maker, with critical-field disagreement forcing human review. Immutable audit storage of every maker, verifier, and architect output for every document.
What we have not done yet: per-field confidence and bounding-box citations exposed in the public API response. A weekly recalibration job anchored to a human-labeled gold set with published precision numbers. A public quality dashboard.
We will not claim to be lending-grade until those three are also live and a friendly bank's risk team has signed off on the numbers. Until then, the honest framing is: pilot-credible, not lending-final. We would rather under-promise that distinction than learn it the expensive way.