Tag feed

#swe-bench

11 posts0 followers

Explore Hashnode

Alternatives

Trending tags this week

HKHououin Kyoumaxanther.hashnode.devMay 9 · 10 min read

How a $0.02/Call Model Scored 78.2% on SWE-bench Verified — Beating Every Model on the Leaderboard

TL;DR We added architectural context to AI coding agents via MCP and tested on SWE-bench Verified (500 real bugs). MiniMax M2.5 — a model that costs $0.02 per call — scored 78.2%, surpassing every mod

0

AKAnup Karanjkarwowhow.hashnode.devMay 9 · 8 min read

Poolside Laguna XS.2 and M.1: Agentic Coding Developer Guide 2026

Poolside released Laguna XS.2 and Laguna M.1 on April 28, 2026 — two agentic coding models built specifically for software engineering tasks that run plan-execute-observe loops across multi-file codebases. XS.2 is open-weight under Apache 2.0, runs o...

0

AKAnup Karanjkarwowhow.hashnode.devMay 6 · 8 min read

Poolside Laguna XS.2 and M.1: Agentic Coding Developer Guide 2026

Poolside released Laguna XS.2 and Laguna M.1 on April 28, 2026 — two agentic coding models built specifically for software engineering tasks that run plan-execute-observe loops across multi-file codebases. XS.2 is open-weight under Apache 2.0, runs o...

0

AKAnup Karanjkarwowhow.hashnode.devMay 5 · 8 min read

Poolside Laguna XS.2 and M.1: Agentic Coding Developer Guide 2026

Poolside released Laguna XS.2 and Laguna M.1 on April 28, 2026 — two agentic coding models built specifically for software engineering tasks that run plan-execute-observe loops across multi-file codebases. XS.2 is open-weight under Apache 2.0, runs o...

0

AKAnup Karanjkarwowhow.hashnode.devMay 2 · 4 min read

Best AI Models for Coding in 2026: Benchmarks That Matter

Every AI company claims their model is "best for coding." Marketing benchmarks are cherry-picked. Real-world performance is what matters. We tested 12 models on the benchmarks that correlate most strongly with actual developer productivity. The Mode...

0

JKJangwook Kimeffloow.hashnode.devApr 29 · 11 min read

MiniMax M2.5 API Guide: 80% SWE-Bench at $0.15/M Tokens

If you are building a coding agent that will run thousands of agentic loops per day, the model you choose determines whether your infrastructure bill is $50/day or $1,000/day — for nearly identical task performance. MiniMax M2.5 is the clearest illus...

0

RDRaj Darshan Pachorinextgenrd.techFeb 1 · 1 min read

AI Software Engineering benchmark just went from 80% to 23%

What is SWE-bench? SWE-bench is a widely followed benchmark evaluation framework designed to test AI coding assistants on real software engineering tasks. AI coding assistant benchmarks are supposed to give us clarity. SWE-bench does the opposite. SW...

0

RVRishi Vaishblog.codesweep.aiDec 9, 2025 · 5 min read

Mixture of Open-Weight Models with Iterative Patch Generation Improves Performance on SWE-bench

Goal Our goal in this study was to explore whether a mixture of open-weight models, combined through an iterative process, can outperform any single model on the SWE-bench Verified benchmark. Specifically, we wanted to evaluate if patches generated b...

0

#swe-bench

Search Hashnode

#swe-bench

Explore Hashnode

Trending tags this week

How a $0.02/Call Model Scored 78.2% on SWE-bench Verified — Beating Every Model on the Leaderboard

Poolside Laguna XS.2 and M.1: Agentic Coding Developer Guide 2026

Poolside Laguna XS.2 and M.1: Agentic Coding Developer Guide 2026

Poolside Laguna XS.2 and M.1: Agentic Coding Developer Guide 2026

Best AI Models for Coding in 2026: Benchmarks That Matter

MiniMax M2.5 API Guide: 80% SWE-Bench at $0.15/M Tokens

AI Software Engineering benchmark just went from 80% to 23%

Mixture of Open-Weight Models with Iterative Patch Generation Improves Performance on SWE-bench