TestRobin — Test Execution Engine Architecture

Architecture

TestRobin is an AI test-execution engine — you write a test in plain English, and an autonomous agent drives a real browser by sight, locating elements on the rendered screen through a grid rather than by CSS/XPath selectors, then produces a tamper-evident, step-by-step evidence record.

A working single-instance reference implementation — one headless browser per run, packaged to run on two clouds (primary + warm backup) — built to demonstrate the architecture; multi-tenant parallelism is on the roadmap, not yet built.

Vision-only grounding — no selectors, no DOM

The agent locates elements by looking at the screenshot through a grid rather than by CSS/XPath selectors — so a markup-only refactor is far less likely to break a test.

Plain English in, structured steps out

A test is written in natural language; a compiler turns it into structured steps; the agent runs each one and verifies the outcome before it advances.

Model-agnostic intelligence — the model is pluggable

The vision model is an external, swappable inference call to the ColdVault Platform — no model weights live in the engine; the engine is the harness.

Evidence as a byproduct — ALCOA+ aligned

Every run captures per-step screenshots and a tamper-evident evidence record with a PDF execution summary, derived from what actually happened — aligned to (not certified against) ALCOA+ / Annex 11 / Part 11.

TestRobin is live — sign in and run a test, or watch one execute headless, step by step. Open TestRobin →

Grounding — the agent locates by sight, through a grid

The whole flow, in order: a plain-English test → a compiler that turns it into structured steps → an agent loop that, each step, screenshots the page, locates the target on a grid (this section), acts, and verifies the outcome → a tamper-evident evidence record and a read-only API. The sections below walk it in that order.

The agent works from the rendered screenshot — the same pixels a human sees — and locates elements by sight, not by CSS/XPath selectors, IDs or DOM queries. Targeting is done by looking. The architectural payoff is durability: a refactor that renames a class, reshuffles the markup or swaps a component framework is far less likely to break the test, because the test was never bound to the markup in the first place.

Locating happens on a two-level grid. A continuous grid is drawn over the screenshot and the model names the cell — the column from the header tile, the row from the row tile; then a finer grid is drawn over just that cell and the model picks the precise spot. The hardest case — a dense matrix form (a severity×likelihood grid of checkboxes, columns scrolled off the top) — is exactly the one this method is built to nail; it is the signature capability.

For off-screen targets the agent tiles the page (scrolls and captures each view as a separate image, never stitched — stitching downscales and loses legibility) and grounds within the right tile. Every step grounds from a fresh capture; there is no cross-step pixel memory to drift.

Authoring & execution — plain English, compiled, then a two-phase loop

Tests are written in plain English. A compiler turns those natural-language steps into structured, executable steps. Steps name a value and a target semantically — “Click Rarely for the BCCA question” — never an ordinal position (“the third checkbox”), so a test survives columns being added or reordered.

The agent then runs each step in a two-phase loop: ACT (ground the target, then click / type / select) followed by VERIFY (confirm the page actually changed the way the step intended). It does not advance on an unverified step — a step that didn’t take effect is a failed step, not a silently-skipped one.

Model replies are parsed forgivingly (JSON or prose), and an unparseable reply triggers a re-ask rather than a hard failure — structured-output strictness is never assumed.

Intelligence — model-agnostic, external, swappable

The agent’s sight and reasoning come from a model-agnostic external inference call to an inference-as-a-service Platform (the ColdVault Platform) — no model weights live in the engine; the model is configuration. Today it routes to a frontier vision model; pointing it at a different one is a setting, not a rebuild.

There are two model paths, both external and swappable: the agent run (sight + step reasoning) goes through the Platform, and the compiler (plain English → structured steps) is its own call. Keeping intelligence outside the engine is the architectural decision: the engine is the durable harness — grid, step loop, verification, evidence — and the model is a replaceable part.

Engine vs model: the harness is target- and model-agnostic; choosing a stronger or cheaper model, or swapping providers, changes a config value, not the architecture.

Evidence & compliance — a byproduct of execution

Every run captures per-step screenshots, each carrying a SHA-256 integrity hash printed in the report, and assembles a tamper-evident evidence record — any later alteration of a screenshot is detectable. The PDF execution summary opens with a short, model-written narrative laid over the deterministic step log — what was attempted, what happened, and the verified outcome of each step. You cannot fake what the engine did; the screenshots are the evidence.

The reports are aligned to ALCOA+, EU GMP Annex 11 and 21 CFR Part 11 and support human review; the AI executes the test and drafts the summary, and a qualified human reviews and adopts every artifact. It is alignment and supporting evidence, not a certification or an independent validation.

Reliability — No Silent Green

The cardinal rule: a test must never report PASS when it didn’t really pass. A wall the agent cannot legitimately get past — a login screen, a permission dialog, a file download, an unexpected popup, an SSRF-blocked navigation — fails the step loudly through outcome verification, with a clear run-level banner, rather than quietly going green.

Can

Drive any visible UI it can see and reach
Report a truthful PASS or FAIL, backed by step screenshots

Cannot

Report PASS on a step it could not actually complete
Cross a login wall headless — it says so, loudly
Invent a result the screenshots don’t support

Execution model — headless by default, your browser optionally

By default the agent drives a server-side headless Chromium — no setup, runs anywhere. Optionally it can drive the user’s own browser through a Chrome extension, for flows that need an already-authenticated session. A run is a long-lived background task streamed live to the dashboard, so you watch it execute step by step instead of waiting for a verdict at the end.

State — runs and evidence are persisted

A run streams from memory while it executes and is written through to a relational store (PostgreSQL): runs, steps, traces and the evidence record. A finished run’s evidence is durable and re-openable — the live view and the permanent record are the same data, captured once.

Implementation details

These are the concrete bindings of the architecture above — the hosting, browser, intelligence and packaging. They are swappable implementation choices, not the architecture.

HostingDual-cloud — managed serverless containers on two public clouds (one primary, one warm backup), each deployed independently and staggered

BrowserHeadless Chromium via Playwright, captured at 1280×1024

IntelligenceModel-agnostic inference via the ColdVault Platform — an external HTTPS call, no model weights in the engine

PackagingOne container — a Python FastAPI service serving the dashboard from the same process

DataManaged PostgreSQL for runs, traces and evidence

A plain-English test is compiled to steps (the compiler and the agent each call the external model on the ColdVault Platform); the agent runs an act-and-verify loop — observing and acting on a headless browser by grid grounding — and emits a tamper-evident evidence record.

Roadmap — not yet built

Multi-user, parallel isolated sessions — the engine is single-instance today.
HTTPS + custom-domain cutover of the primary serving cloud.
Multi-model step verification — a second model confirms each PASS, to further kill false greens.
A standing public demo target — a recognizable enterprise portal the engine drives end to end.