TestRobin is an AI test-execution engine — you write a test in plain English, and an autonomous agent drives a real browser by sight, locating elements on the rendered screen through a grid rather than by CSS/XPath selectors, then produces a tamper-evident, step-by-step evidence record.
A working single-instance reference implementation — one headless browser per run, packaged to run on two clouds (primary + warm backup) — built to demonstrate the architecture; multi-tenant parallelism is on the roadmap, not yet built.
The agent locates elements by looking at the screenshot through a grid rather than by CSS/XPath selectors — so a markup-only refactor is far less likely to break a test.
A test is written in natural language; a compiler turns it into structured steps; the agent runs each one and verifies the outcome before it advances.
The vision model is an external, swappable inference call to the ColdVault Platform — no model weights live in the engine; the engine is the harness.
Every run captures per-step screenshots and a tamper-evident evidence record with a PDF execution summary, derived from what actually happened — aligned to (not certified against) ALCOA+ / Annex 11 / Part 11.
TestRobin is live — sign in and run a test, or watch one execute headless, step by step. Open TestRobin →
The whole flow, in order: a plain-English test → a compiler that turns it into structured steps → an agent loop that, each step, screenshots the page, locates the target on a grid (this section), acts, and verifies the outcome → a tamper-evident evidence record and a read-only API. The sections below walk it in that order.
The agent works from the rendered screenshot — the same pixels a human sees — and locates elements by sight, not by CSS/XPath selectors, IDs or DOM queries. Targeting is done by looking. The architectural payoff is durability: a refactor that renames a class, reshuffles the markup or swaps a component framework is far less likely to break the test, because the test was never bound to the markup in the first place.
Locating happens on a two-level grid. A continuous grid is drawn over the screenshot and the model names the cell — the column from the header tile, the row from the row tile; then a finer grid is drawn over just that cell and the model picks the precise spot. The hardest case — a dense matrix form (a severity×likelihood grid of checkboxes, columns scrolled off the top) — is exactly the one this method is built to nail; it is the signature capability.
For off-screen targets the agent tiles the page (scrolls and captures each view as a separate image, never stitched — stitching downscales and loses legibility) and grounds within the right tile. Every step grounds from a fresh capture; there is no cross-step pixel memory to drift.
Tests are written in plain English. A compiler turns those natural-language steps into structured, executable steps. Steps name a value and a target semantically — “Click Rarely for the BCCA question” — never an ordinal position (“the third checkbox”), so a test survives columns being added or reordered.
The agent then runs each step in a two-phase loop: ACT (ground the target, then click / type / select) followed by VERIFY (confirm the page actually changed the way the step intended). It does not advance on an unverified step — a step that didn’t take effect is a failed step, not a silently-skipped one.
Model replies are parsed forgivingly (JSON or prose), and an unparseable reply triggers a re-ask rather than a hard failure — structured-output strictness is never assumed.
The agent’s sight and reasoning come from a model-agnostic external inference call to an inference-as-a-service Platform (the ColdVault Platform) — no model weights live in the engine; the model is configuration. Today it routes to a frontier vision model; pointing it at a different one is a setting, not a rebuild.
There are two model paths, both external and swappable: the agent run (sight + step reasoning) goes through the Platform, and the compiler (plain English → structured steps) is its own call. Keeping intelligence outside the engine is the architectural decision: the engine is the durable harness — grid, step loop, verification, evidence — and the model is a replaceable part.
Engine vs model: the harness is target- and model-agnostic; choosing a stronger or cheaper model, or swapping providers, changes a config value, not the architecture.
Every run captures per-step screenshots, each carrying a SHA-256 integrity hash printed in the report, and assembles a tamper-evident evidence record — any later alteration of a screenshot is detectable. The PDF execution summary opens with a short, model-written narrative laid over the deterministic step log — what was attempted, what happened, and the verified outcome of each step. You cannot fake what the engine did; the screenshots are the evidence.
The reports are aligned to ALCOA+, EU GMP Annex 11 and 21 CFR Part 11 and support human review; the AI executes the test and drafts the summary, and a qualified human reviews and adopts every artifact. It is alignment and supporting evidence, not a certification or an independent validation.
The cardinal rule: a test must never report PASS when it didn’t really pass. A wall the agent cannot legitimately get past — a login screen, a permission dialog, a file download, an unexpected popup, an SSRF-blocked navigation — fails the step loudly through outcome verification, with a clear run-level banner, rather than quietly going green.
By default the agent drives a server-side headless Chromium — no setup, runs anywhere. Optionally it can drive the user’s own browser through a Chrome extension, for flows that need an already-authenticated session. A run is a long-lived background task streamed live to the dashboard, so you watch it execute step by step instead of waiting for a verdict at the end.
A run streams from memory while it executes and is written through to a relational store (PostgreSQL): runs, steps, traces and the evidence record. A finished run’s evidence is durable and re-openable — the live view and the permanent record are the same data, captured once.
These are the concrete bindings of the architecture above — the hosting, browser, intelligence and packaging. They are swappable implementation choices, not the architecture.
A plain-English test is compiled to steps (the compiler and the agent each call the external model on the ColdVault Platform); the agent runs an act-and-verify loop — observing and acting on a headless browser by grid grounding — and emits a tamper-evident evidence record.