We just open sourced, Passmark, the AI testing framework we built at Bug0

tldr: Passmark is an open-source AI regression testing framework built on Playwright. Describe tests in plain English. AI executes them once and caches every action to Redis. When your UI changes, it self-heals. We built it at Bug0, and we just made it free.

The short version of why

We are Fazle and Sandeep, the co-founders of Hashnode. We have spent years building tools for developers, and the thing that keeps pulling us forward is the same question: where is the developer experience still broken?

In mid-2025, we found a clear answer. We were in SF meeting founder friends who were shipping fast with AI-generated code. The pattern was striking. Teams were pushing features daily, sometimes hourly. Cursor, Copilot, Claude Code. AI was writing real production code. But almost none of these teams had regression testing in place. No QA team. No test suites. Just vibes and staging environments.

Code was being written faster than ever. Nobody was checking if it still worked.

That gap is why we started building Bug0, an AI-native QA platform for modern web applications, also pairing teams with dedicated forward deployed engineers. At its core sits a testing framework called Passmark.

We just open sourced it. Here is the official announcement on the Bug0 blog.

This post is about why we built it, how it works, and why we chose to give it away.

QA never got its developer experience moment

Think about how much better software delivery has gotten in the last ten years.

CI/CD pipelines are fast and cheap. Deployment is a merged PR away. Observability tools actually work.

Testing still feels like 2015. You write brittle Selenium scripts. They break when someone moves a button. You spend more time maintaining tests than writing features. Or you skip testing entirely and hope staging catches it.

AI tools showed up and promised to fix this. Most of them did not.

The typical AI testing demo looks great. An agent navigates your app, clicks around, generates a test. Impressive for a two-minute video. Then you try running 500 of those tests in CI. Each one calls an LLM on every single step. Your pipeline takes 40 minutes instead of 4. Your API bill looks like a hosting bill.

That is the gap we wanted to close.

What Passmark actually does

The tagline on passmark.dev says it well. Simpler than hand-written Playwright. More reliable than real-time AI.

Passmark works in three phases.

Describe. You write test steps in plain English. No selectors, no page objects, no CSS paths.

Execute. On first run, AI agents navigate your app using accessibility snapshots and screenshots. Every successful action gets cached to Redis. This takes roughly 30 seconds per step.

Replay. On subsequent runs, Passmark replays cached actions using Playwright at native speed. Zero LLM calls. When a cached action fails because the UI changed, AI re-engages only for that broken step, heals it, and updates the cache.

Here is a real test:

import { test, expect } from "@playwright/test";
import { runSteps } from "passmark";

test("Shopping cart tests", async ({ page }) => {
  await runSteps({
    page,
    userFlow: "Add product to cart",
    steps: [
      { description: "Navigate to https://demo.vercel.store" },
      { description: "Click Acme Circles T-Shirt" },
      { description: "Select color", data: { value: "White" } },
      { description: "Select size", data: { value: "S" } },
      { description: "Add to cart", waitUntil: "My Cart is visible" },
    ],
    assertions: [
      { assertion: "You can see My Cart with Acme Circles T-Shirt" },
    ],
    test,
    expect,
  });
});

The entire test is plain English. If someone renames a component or moves a button, the test does not break. It still describes what a user does, and Passmark figures out the rest. It runs inside a standard Playwright test file with the same runner and config you already have.

If you have 200 regression tests running on every PR, you do not want 200 LLM sessions. You want 200 Playwright replays and maybe 3 LLM sessions for the steps that broke.

Multi-model consensus, not single-model guessing

One thing we got wrong early on: trusting a single AI model to verify test results.

LLMs hallucinate. They get things subtly wrong. A single model saying "looks good" is not enough when you are deciding whether a release ships.

Passmark runs assertions through Claude and Gemini independently. If they agree, the assertion passes. If they disagree, a third model acts as arbiter. We call this consensus-based assertion.

import { assert } from "passmark";

const result = await assert({
  page,
  assertion: "The dashboard shows 3 active projects",
  expect,
});

This is slower than a single model call. We think it is worth it. A false positive in regression testing means shipping a broken release. An extra second of verification is cheap compared to that.

Built for real testing workflows

We built Passmark inside Bug0, where it runs against production applications every day. That forced us to solve problems that AI testing demos never touch.

Email testing is the one that surprised us most. So many critical flows depend on email: sign up, verify your account, reset your password. We kept hitting this wall in client projects, so we built disposable inboxes and email content extraction directly into Passmark. OTP codes, verification links, email-dependent flows. No external tools needed.

{
  description: "Enter the verification code",
  data: {
    value: "{{email.otp:get the 6 digit verification code:{{run.dynamicEmail}}}}"
  }
}

Dynamic test data was another pain point. When you run the same test suite 50 times a day across parallel workers, hardcoded values collide. Passmark generates unique emails, names, and IDs per run using placeholders like {{run.email}} and {{run.shortid}}. The same test works whether you run it once or a thousand times.

Passmark also supports cross-test state via Redis (test B can read what test A created) and OpenTelemetry tracing via Axiom (every AI call is instrumented, so when something fails you can see exactly which model was called, what it saw, and what it decided).

Why we open sourced it

When we were building Hashnode, we made a bet on headless APIs and open integrations. We believed developers should own their content and their workflow. That bet shaped everything about how Hashnode grew.

Passmark is the same bet applied to testing.

If you are asking engineers to trust an AI system to decide whether a release is safe to ship, that system should be inspectable. You should be able to read the code, see where AI is used and where it is not, understand how caching works, and decide whether the tradeoffs fit your team.

We have seen too many AI tools hide behind API wrappers. Engineers adopt them because the demo was good, then get stuck when something breaks and they cannot see inside. We did not want Passmark to be that.

Passmark is licensed under FSL-1.1-Apache-2.0. Functional Source License today, Apache 2.0 in the future.

What building Hashnode taught us about developer tools

At Hashnode, we spent years watching which developer tools got adopted and which ones got bookmarked and forgotten. The pattern was always the same: the tools that fit into existing workflows won. The ones that asked developers to change everything lost. Markdown won over proprietary editors. Custom domains won over locked subdomains. Git-backed content won over closed databases.

We built Passmark with that pattern in mind. It runs inside normal Playwright test files. You use @playwright/test, not some custom runner. Your existing playwright.config.ts works. Your CI stays the same.

There is a related thing we noticed at Hashnode that we did not expect. The features we removed mattered more than the features we added. Every time we stripped out something that was not pulling its weight, adoption went up. Passmark reflects that. It uses AI for two things: discovery and healing. Everything else is Playwright. We could have made AI do more. We chose not to, because reliability at 2 AM in a red CI pipeline matters more than capability in a demo.

Get started

Install Passmark in any Playwright project:

npm install passmark

You need API keys for Anthropic (Claude) and Google (Gemini) for multi-model consensus. Set them in your .env:

ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_GENERATIVE_AI_API_KEY=AIza...

Optionally, connect Redis for step caching. Without Redis, every run uses AI. With Redis, you get cache-first execution.

The full docs are at passmark.dev. The code is at github.com/bug0inc/passmark.

For teams that want a managed experience where we handle test infrastructure, maintenance, and QA strategy, that is what Bug0 does. Passmark is the engine underneath.

FAQs

What is Passmark?

An open-source AI regression testing framework built on Playwright. You write tests in plain English, AI figures out how to execute them, and every successful action gets cached to Redis. Subsequent runs replay at native Playwright speed. AI only comes back when the UI changes.

How is it different from other AI testing tools?

Most AI testing tools call an LLM on every step of every run. Passmark calls AI once for discovery, then replays cached Playwright actions. It also verifies assertions using two models (Claude and Gemini) with a third as tiebreaker, instead of trusting one model to get it right.

What do we need to run it?

Node.js 18+, Playwright 1.57+, and API keys for both Anthropic and Google AI. Redis is optional but recommended. Without it, there is no caching and every run goes through AI.

Is it production-ready?

We run it in Bug0's production workflows every day against real applications. The codebase is stable, but test coverage is still growing. Contributions are welcome.

What is Bug0?

Bug0 is an AI-native QA platform used by 200+ engineering teams. Passmark is the engine that powers it. You can use Passmark on its own, or use Bug0 for a fully managed QA experience starting at $2,500/month.

Can we mix Passmark with existing Playwright code?

Yes. Passmark runs inside standard Playwright test files. You can use runSteps() alongside regular Playwright calls in the same suite. Same config, same runner, same CI.

What AI models does it use?

Gemini for step execution, Claude and Gemini together for assertions, with a configurable arbiter model for disagreements. All 8 model slots can be swapped. You can also route everything through Vercel AI Gateway instead of managing individual keys.