GenAI

05 April 2026 /

When RAG Metrics Disagree: A Controlled Study of Models, Backends, and Evaluation Methods

Retrieval-augmented generation (RAG) is the dominant pattern for building LLM applications over private data. You retrieve relevant documents, pass them as context to an LLM, and get a grounded answer. But which LLM should you use? Does the dataset matter? And how do you even measure whether the answers are good?

To answer these questions, I ran 25,600 evaluations: 5 LLMs × 6 datasets × 2 retrieval conditions × 3 random seeds, with every response scored by 10 metrics — automated text similarity, a 3-judge panel, and RAGAS framework scores. The headline finding: automated metrics and the LLM judges disagree on which models are best.

Read the post

02 April 2026 /

The Delegation Gap: Why AI Coding Agents Aren't as Autonomous as You Think

Anthropic’s 2026 report says developers use AI in 60% of their work but can fully delegate only 0-20% of tasks ^[1]. That’s a massive gap. I dug into the public research to understand why — and the answer isn’t what most people assume.

As someone who uses AI coding agents daily, I assumed the bottleneck was model capability. It’s not. The research points to something more fundamental.

Read the post

01 April 2026 /

Building AI Agents That Actually Do Things: Lessons from the Payments Domain

There’s a growing gap in the AI agent conversation. Everyone’s talking about agents — autonomous systems that can reason, plan, and act. But most demos stop at “the agent wrote a nice email” or “it summarized a document.” The real challenge starts when you need an agent to interact with production backend systems, handle authentication, deal with partial failures, and return structured results that downstream systems can consume.

Over the past several months, I’ve been building exactly this: tool-equipped LLM agents operating in the payments domain. These agents can query transaction systems, look up order histories, check workflow statuses, retrieve standard operating procedures, and interact with ticketing systems — all through structured tool interfaces that the model invokes autonomously.

Here’s what I’ve learned.

Read the post

Tag: GenAI

Posts

When RAG Metrics Disagree: A Controlled Study of Models, Backends, and Evaluation Methods

The Delegation Gap: Why AI Coding Agents Aren't as Autonomous as You Think

Building AI Agents That Actually Do Things: Lessons from the Payments Domain