AI & Developer Tools

We built an AI that beta-tests your app

2026-04-14 · 6 min read

Last month, our AI rejected a deploy. The automated tests were green, but the product was not ready: controls were too close together, an error state broke on smaller screens, and a loading state looked like a crash. The score was 64 out of 100. The verdict was simple: do not ship.

That is the problem Gadget was built to solve. It is not an LLM wrapped around a test runner. It is a beta tester with eyes: a system that runs product flows, captures what a user would actually see, and asks a vision model to judge whether the experience is coherent.

End-to-end tests answer the wrong question

Cypress, Playwright and Selenium are excellent at proving that code paths execute. They can confirm that a form submits, a route changes, or an API returns the expected status. But a product can satisfy every assertion and still feel broken to a user.

The missing question is visual and experiential: does this look right, is it readable, would someone hesitate, and did the interface enter a state no one thought to assert? Manual QA catches those issues, but it rarely scales to every pull request, every viewport, and every release candidate.

What changes when the runner can see

Gadget keeps the parts of E2E testing that work. It uses Playwright to navigate, click, fill and assert. The difference is what happens after each step: the runner waits for the page to settle, captures a full-page screenshot, and sends the sequence to Claude with the posture of a human beta tester.

The model reviews layout, readability, broken states, UX friction and visual inconsistencies. It reports findings by severity and returns structured JSON, validated with Zod, so the pipeline can fail gracefully when the model output is imperfect instead of turning the tool itself into the weakest link.

Five design choices shaped the product

First, tests are written in YAML so product managers, QA engineers and designers can describe flows without writing TypeScript. Second, locators follow what users see: labels, placeholders and accessible names before brittle selectors. That makes test files closer to product intent than implementation detail.

Third, screenshots are treated as a first-class pipeline. In audit mode every step is captured, but only after network idle and a short settle timeout, because a vision model cannot tell the difference between a genuinely broken interface and a page caught mid-load unless the pipeline gives it a fair frame.

Fourth, the review prompt is scoped aggressively. The AI is told not to critique test coverage, password strength, security posture or destination pages outside the tested flow. The hardest prompt work was often telling the reviewer what to ignore. Fifth, readiness is scored rather than reduced to a boolean: critical issues, warnings, nitpicks and improvements deduct from a 100-point baseline.

The agentic loop

The audit loop is straightforward: parse YAML, interpolate variables, launch Chromium, execute each step, capture screenshots, ask Claude for a beta-test review, validate the response, aggregate readiness, then output reports for humans and CI systems. Console, JSON, HTML, JUnit and GitHub annotations all serve different parts of the same workflow.

There is also a second loop. Gadget can read a git diff and generate YAML tests aimed at the changed features, validate those tests against the schema, and optionally run them immediately. One agent writes the tests; another reviews what the product looked like when those tests ran.

What we did not expect

Prompt scoping mattered more than model choice. Settle time mattered more than expected. Structured output was non-negotiable. YAML lowered the contribution barrier for non-engineers. And the score changed team conversations: instead of arguing about whether something was ready, everyone could look at the same number and the same visible evidence.

The lesson is that the expensive bugs are often not crashes. They are experiences that technically work and still make users hesitate, misunderstand, or leave. Traditional E2E suites catch failures of code execution. Gadget is aimed at the gap between code that runs and a product that feels ready.

Open source by default

Gadget is published as an open-source tool under MIT. The package runs locally, uses your own Anthropic API key, and sends screenshots directly from your machine to the model provider. The repository includes the TypeScript runner, analyzer, checker, reporters, prompt files, example YAML suites, CI examples, documentation and Claude Code skills.

Explore the Gadget repository

Read the original LinkedIn article

← All insights