Tag: GenAI

Posts

05 April 2026 /

Retrieval-augmented generation (RAG) is the dominant pattern for building LLM applications over private data. You retrieve relevant documents, pass them as context to an LLM, and get a grounded answer. But which LLM should you use? Does the dataset matter? And how do you even measure whether the answers are good?

To answer these questions, I ran 25,600 evaluations: 5 LLMs × 6 datasets × 2 retrieval conditions × 3 random seeds, with every response scored by 10 metrics — automated text similarity, a 3-judge panel, and RAGAS framework scores. The headline finding: automated metrics and the LLM judges disagree on which models are best.

02 April 2026 /

Anthropic’s 2026 report says developers use AI in 60% of their work but can fully delegate only 0-20% of tasks [1]. That’s a massive gap. I dug into the public research to understand why — and the answer isn’t what most people assume.

As someone who uses AI coding agents daily, I assumed the bottleneck was model capability. It’s not. The research points to something more fundamental.

01 April 2026 /

There’s a growing gap in the AI agent conversation. Everyone’s talking about agents — autonomous systems that can reason, plan, and act. But most demos stop at “the agent wrote a nice email” or “it summarized a document.” The real challenge starts when you need an agent to interact with production backend systems, handle authentication, deal with partial failures, and return structured results that downstream systems can consume.

Over the past several months, I’ve been building exactly this: tool-equipped LLM agents operating in the payments domain. These agents can query transaction systems, look up order histories, check workflow statuses, retrieve standard operating procedures, and interact with ticketing systems — all through structured tool interfaces that the model invokes autonomously.

Here’s what I’ve learned.