May 10 · 11 min read · The agent returned a 200. The customer lost $4,200. The monitoring dashboard showed green. This is the failure mode that keeps me up at night — not the one where your AI agent crashes with a stack trace, but the one where it succeeds at the wrong thi...
Join discussionMay 9 · 11 min read · The agent returned a 200. The customer lost $4,200. The monitoring dashboard showed green. This is the failure mode that keeps me up at night — not the one where your AI agent crashes with a stack trace, but the one where it succeeds at the wrong thi...
Join discussionApr 13 · 8 min read · Last post covered the implementation layer — how speech-to-text, audio emotion, and facial analysis actually run in real-time systems. This one covers what happens after deployment. How you evaluate, monitor, and catch degradation before your users d...
Join discussionFeb 22 · 2 min read · When it comes to deploying large language models in sensitive domains like defense, accuracy isn't just a preference—it's a necessity. That’s why Justin Norman’s release of DoDHaluEval v0.1.0 caught my attention. This open-source framework is specifi...
Join discussionFeb 12 · 7 min read · Large language models deliver substantial gains in efficiency across numerous tasks, but their unpredictable outputs and tendency to generate incorrect information present significant risks. These potential errors can prove expensive and labor-intens...
Join discussionNov 25, 2025 · 8 min read · Your LLM scored 85% on your test set. How confident are you in that number? What if I told you it might actually be anywhere between 70% and 95%? Most engineering teams ship LLM systems based on evaluation numbers that look precise but hide massive u...
Join discussion
Sep 14, 2025 · 6 min read · LLM evaluation has become essential as artificial intelligence systems are increasingly deployed in real-world applications. While basic accuracy measurements are important, they don't tell the complete story of how well a large language model perfor...
Join discussionMar 10, 2025 · 5 min read · We all know that LLMs thrive in unstructured environments, capable of consuming large amounts of unstructured text and outputting more unstructured text based on the unstructured text prompts you provide. Of course, I have written articles in the pas...
Join discussion
Jan 2, 2025 · 6 min read · The Generative AI race has been coupled with a rise in usage of the term “Perplexity”. Google Trends suggests the same and most of the references in academic journals coming from the last 1 year. And no, this is not perplexity.ai. It is a metric bei...
Join discussion