Evaluating LLM-Integrated Systems: What Works and What Doesn't
The large US technology company builds systems where LLM calls are in the critical path. These systems are hard to test in the traditional sense — the outputs are probabilistic, the failure modes are subtle, and the “correct” answer for many queries doesn’t have a binary definition. After two years of building and evaluating LLM-integrated systems, here’s what actually works. ...