Research

Retrieval-augmented generation (RAG) is the dominant pattern for building LLM applications over private data. You retrieve relevant documents, pass them as context to an LLM, and get a grounded answer. But which LLM should you use? Does the dataset matter? And how do you even measure whether the answers are good?

To answer these questions, I ran 25,600 evaluations: 5 LLMs × 6 datasets × 2 retrieval conditions × 3 random seeds, with every response scored by 10 metrics — automated text similarity, a 3-judge panel, and RAGAS framework scores. The headline finding: automated metrics and the LLM judges disagree on which models are best.

Read the post

Tag: Research

When RAG Metrics Disagree: A Controlled Study of Models, Backends, and Evaluation Methods

The Delegation Gap: Why AI Coding Agents Aren't as Autonomous as You Think

Tag: Research

Posts

When RAG Metrics Disagree: A Controlled Study of Models, Backends, and Evaluation Methods

The Delegation Gap: Why AI Coding Agents Aren't as Autonomous as You Think