@Tebza
Building with AWSomeness
Nothing here yet.
Nothing here yet.
Mar 31 · 9 min read · Our first LLM judge gave a 9/10 to a page where the hero text was completely invisible. Dark grey text on a dark background image. The CSS was syntactically valid. The HTML was well-structured. Every tag was correct. The page was unusable. And our ju...
Join discussionMar 30 · 10 min read · We tested five AI models on the same task 467 times. Each run produced a complete deployable website — not a code snippet, not a function, not a patch. A real site with HTML, CSS, JavaScript, and assets. The question: can cheaper models match Claude ...
MEAamer and 1 more commentedMar 30 · 9 min read · Most LLM benchmarks evaluate text. HumanEval checks if a function passes unit tests. SWE-bench measures whether a model can patch a repository. MBPP scores single-function completions. None of these work when your AI agent generates an entire website...
Join discussion