When AI Writes Your Tests: What We Tried, What Failed, and What We Kept (Selenium, Playwright, API) | 2025 QE Field Notes

A few sprints back, we merged a pull request with 60+ Playwright tests the LLM drafted in under 20 minutes. Dazzling velocity—until CI exposed three flaky selectors, two hallucinated API calls, and a “happy path” that never asserted the real business outcome. That’s the day we wrote a new rule on the whiteboard: AI can write tests fast, but only disciplined review turns them into assets.

These are field notes from the trenches—from our Omniit.ai QE team—reviewing AI-generated Selenium/Playwright and API tests in e-commerce and edtech stacks. This is exactly what we tried, what burned us, what stuck, and how we tuned the review process so velocity didn’t bury quality.

Why we’re sharing this now

AI is no longer just “helping” developers—it’s drafting code, proposing diffs, even raising PRs for small fixes. GitHub’s own engineering blog made one message unmissable: humans still own the merge button. AI can point; we decide. That’s been our practical reality as well. (The GitHub Blog)

AI review itself has grown up fast. GitHub Copilot now participates in pull-request reviews with suggested changes that we can apply with a click; GitLab and Bitbucket double down on approvals and merge checks so nothing lands without a human sign-off. We’ve leaned into those controls. (GitHub Docs, GitLab Docs, Atlassian Support)

And yet, we keep one research finding taped to the wall: AI assistance can nudge teams toward less secure patterns—and higher (false) confidence. It’s not a reason to avoid AI; it’s a reason to fortify review. (arXiv)

What actually changed in review once GenAI showed up

Before GenAI, most test PRs were small and hand-curated. After GenAI, we were triaging volume: more tests, more quickly, with uneven depth. That forced three pivots:

We pushed the nits out of human brains. Linters, formatters, and AI review catch trivial problems. Humans invest in intent, coverage, and risk (Did we assert the business outcome? Is the API state correct?). GitHub’s Copilot review helped us drain nit-picks so our eyes go straight to the logic. (GitHub Docs)
We hardened the merge gate. Required approvals, coverage deltas, green CI, no unresolved tasks. GitLab’s approval rules and Bitbucket’s merge checks made it mechanical to enforce. We learned to love the guardrails. (GitLab Docs, Atlassian Confluence)
We reframed AI as a junior teammate. AI drafts and reviews; humans coach and decide. GitHub’s “developers own the merge” line captured exactly how our process feels day to day. (The GitHub Blog)

What it looked like side-by-side

Dimension	Before GenAI (human-written tests)	After GenAI (AI-drafted + human review)
Throughput	Few PRs, curated scenarios	Many PRs, draft breadth explodes
Reviewer focus	Style, logic, coverage	Intent & domain fit, flake risk, governance
Typical issues found	Misused waits, brittle locators	Hallucinated flows, shallow asserts, redundancy
Tooling	Linters + CI + review UI	+ AI PR review, stricter approvals/merge checks
Approval model	1–2 peers sign off	Human approval mandatory, AI is advisory only
Success signal	PR merges	Merge and post-merge stability/coverage clarity

What we tried (and kept) for Selenium/Playwright + API reviews

1) Self-review before the PR exists

We asked authors to run their own “pre-review”:

Execute the tests locally; attach artifacts (Playwright videos/screens).
Let Copilot review in the IDE and accept only the helpful suggestions.
Trim anything that looks like a hallucination (e.g., nonexistent coupon codes).
This alone killed a surprising amount of noise before a reviewer ever looked. (GitHub Docs)

2) Small, themed PRs

Even if the AI drafted a buffet, we sliced PRs by user-focused themes (“Cart totals after tax,” “Invalid card path”). Small diffs kept reviewers thinking, not skimming.

3) Two reviewers, two hats

We standardized two review roles per PR:

Framework reviewer (SDET): selectors, waits, fixtures, teardown, helper reuse, flake risk.
Domain reviewer (PM/QA/Dev): “Does this prove the business rule?” For edtech, that meant verifying progress persistence; for e-commerce, inventory and ledger effects.

4) Merge checks we won’t compromise on

We locked in required approvals, green CI, coverage thresholds, and “no unresolved tasks.” (GitLab approvals and Bitbucket merge checks made this easy to enforce across repos.) (GitLab Docs, Atlassian Support)

5) Post-merge observability

We graph new flakes per release, time-to-quarantine, and coverage deltas on critical paths. If a new flake appears, we quarantine in <48h or revert.

What failed (and how we patched it)

“Looks right” tests that didn’t assert business truth. Early on, AI produced lovely UI flows that never checked post-conditions (e.g., inventory decrement, learning progress). We added a hard rule: no merge without a real post-condition.
Brittle selectors and sleep() land mines. Absolute XPaths and blind waits made beautiful tests flaky. We centralized helper utilities, standardized on role/test-id locators, and banned raw sleeps except in rare, documented cases.
Hallucinated steps/endpoints. In e-commerce checkout, AI kept “applying” fake coupons. In edtech, it “verified” a nonexistent profile badge. We built a review prompt: “Point to the live element/API you are asserting.” If no link, it’s out.
Over-trusting the AI review. Copilot’s PR suggestions saved us from nit-picks—but they’re suggestions, not decisions. We coach reviewers to treat AI comments the same way they’d treat a junior teammate’s note: evaluate, don’t rubber-stamp. (The GitHub Blog)

Community experiences mirrored ours: testers see speed, but also hallucinations and shallow checks—human validation remains mandatory. (Ministry of Testing)

What stuck: our living checklist (UI & API)

Selenium/Playwright

Selectors: test-ids or accessible locators; no absolute XPath.
Waits: explicit, bounded; helper wrappers over raw waits.
Assertions: verify business outcomes (e.g., taxed total, inventory) not just “toast visible.”
Fixtures/teardown: idempotent; unique test users/orders; clean up every time.
Flake magnets: animations, iframes, multi-tab—stabilize with helpers.

API

Contract & schema: validate shape; flag drifts.
Negative cases: auth failures, rate limits, malformed bodies, boundary values.
Stateful effects: inventory/ledger/learning progress—assert downstream state or event.
Idempotency: retry-safe design, correlation IDs in CI.

Governance

Secrets: never in code; env-scoped only.
Licensing/IP: reject pasted chunks of unknown origin.
Approvals: AI cannot merge. Humans sign and own the outcome. (The GitHub Blog)

Two stories we tell new teammates

E-commerce checkout (Playwright)

What the AI drafted: add-to-cart → apply coupon → choose shipping → pay.

What we changed:

Replaced brittle XPaths with getByRole/data-testid.
Asserts that matter: order confirmation and inventory decrement via admin API.
Added invalid card and 3-D secure challenge paths; stabilized async modals with helper waits.
Lesson: The AI nails the skeleton; we confirm the economic truth.

Edtech mobile + backend (API + UI)

What the AI drafted: open lecture, click “Play”, assert player UI.
What we changed:

Assert playback time increases and progress persists via learner profile API.
Simulated poor network; verified retries and UX fallback.
Lesson: “UI is visible” ≠ “learning happened.” Assertions must reflect outcomes, not presence.

The visuals we use to keep ourselves honest

We watched time-to-merge drop across sprints as we shifted nit-picks to automation (formatters + AI suggestions) and focused human eyes on logic and risk.

Review Cycle Time Trend

“Nits-per-PR” fell once AI suggested the obvious changes up front and our checklist took care of style. That freed reviews for the hard questions.

Nitpick comments per PR

Stabilizing selectors, waits, and teardown kept flake rate dropping even as AI boosted test volume.

Flaky Test Rate

The toolchain that actually helped

GitHub PRs + Copilot Code Review for suggested changes we can accept in one click—great for clearing noise, not for making business decisions. (GitHub Docs)
GitLab approvals to enforce minimum reviewers, coverage thresholds, and even special approvers for vulnerabilities. (GitLab Docs)
Bitbucket merge checks to block merges with unresolved tasks or missing approvals. (Atlassian Support)
Gerrit in a couple of regulated projects for granular, change-set workflows. (Fine-grained gates can be worth it.) (GitHub Docs)
Static checks & secret scanning in CI before any human looks—AI should never be your first filter.

Emerging reality we’re tracking: agentic workflows that draft PRs for small tickets. Treat them as eager juniors—useful, but still subject to the same human gate. (IT Pro)

Metrics that moved the room

Avg hours to merge for test PRs (down as nits shifted to automation).
Nit-to-logic comment ratio (we want logic comments to dominate).
New flakes introduced per release (goal: near zero; quarantine fast).
Coverage delta on critical journeys (cart→checkout; enroll→complete).
Time-to-quarantine flaky tests (<48h).
% PRs with domain approver (edtech learning outcomes, e-com order integrity).

Industry context: while AI boosts throughput, quality debt grows if teams chase speed blindly—recent industry press highlights widening gaps when guardrails are weak. Keep eyes on stability and security signals, not just velocity. TechRadar

What still trips us up (and how we cope)

Genuine understanding of intent. AI can’t feel the business like a product owner. We pair SDET + domain reviewer and insist on post-conditions that mirror business truth.
Noise vs. signal in AI suggestions. We tuned prompts and disabled rules that nag about style we don’t care about. (Humans decide; AI assists.) (GitHub Docs)
Maintaining a huge test suite. AI can create “too much of a good thing.” We prune redundant tests monthly and refactor helpers relentlessly.
Security complacency. We keep the Stanford finding in mind: AI can increase insecure patterns and overconfidence. Security review is part of our test PR checklist. (arXiv)

The broader testing community’s take aligns with ours: use AI to draft faster, then critically analyze and validate. (Ministry of Testing)

What we’d copy again next sprint

Codify the checklist (UI/API + governance) in the repo.
Lock merge checks (min approvals, CI green, coverage, no open tasks). (GitLab Docs, Atlassian Support)
Adopt AI review as pre-filter, not judge. Accept the good, ignore the noise. (GitHub Docs)
Split the review hats (framework + domain).
Make flakiness a first-class metric; quarantine in <48h.
Audit AI-generated tests monthly; prune redundancy.

Where Omniit.ai fits (and doesn’t)

We don’t use our Omniit.ai platform as a code-review tool—that’s GitHub/GitLab/Bitbucket territory. Where Omniit.ai does help our clients is everything around review that keeps AI-generated tests trustworthy:

Selector and fixture patterns that minimize flake risk before a PR exists.
Opinionated teardown & data management so suites stay clean at scale.
Governance & signal dashboards that make merge gates and stability visible to engineering leadership.

If your AI drafts are outpacing your review capacity (and confidence), we can help you dial in patterns and guardrails that keep speed and trust in balance—without changing the tools you already use for code review.

References & community reads

GitHub Engineering Blog — “Code review in the age of AI: Why developers will always own the merge button.” Jul 14, 2025. Humans decide; AI assists. (The GitHub Blog)
GitHub Docs — “Using GitHub Copilot code review.” PR suggestions you can apply with a click. (GitHub Docs)
GitLab Docs — “Merge request approvals.” Required approvals and coverage checks. (GitLab Docs)
Bitbucket Cloud Docs — “Require checks before a merge.” Approvals, builds, unresolved tasks. (Atlassian Support)
Stanford/ACM CCS 2023 — “Do Users Write More Insecure Code with AI Assistants?” Risks + overconfidence. (arXiv)
Ministry of Testing (community threads) — Pitfalls & critical analysis of AI-generated tests (hallucinations, validation). (Ministry of Testing)
Thoughtworks — Experimental study on AI-generated tests from user stories (coverage, iteration). (Thoughtworks)
GitHub “agents panel” news — AI agents that draft PRs (treat as junior contributors). (IT Pro)