Part 2 — How to review Cursor-generated test automation code before hits master
Then a senior engineer pinged me: “Cursor rewrote the failing tests and added self-healing so issue won’t happen again. Can we merge?”
I opened the PR. The diff was… enormous. Fuzzy matching. Locator fallbacks. An auto-retry loop that “stabilized” the flow. Fancy! But the real risk here isn’t flakiness anymore. It is tests that can pass while quietly rerouting the truth.
So picking up from Part 1, here’s how my QE code review changed, and how I review AI-generated test code and self-healing behavior so speed doesn’t turn into a blindfold.
Core problem: speed and self-heal both move the “truth boundary”
AI-generated automation code can absolutely accelerate delivery—especially when Cursor has enough repo context to stitch together page objects, fixtures, and stable locator patterns. Cursor also now advertises integrated review workflows (e.g., GitHub-connected review helpers) that can catch issues early.
Self-healing (locator fallback, semantic matching, auto-updating steps) reduces churn when UI changes. But the trade is subtle:
- Generation risk: AI can produce valid-looking automation that asserts the wrong thing, at the wrong layer, with the wrong waits.
- Self-heal risk: a tool can “find something clickable” and keep the test moving—masking regressions, role-based UI issues, and environment drift.
If Part 1 was “how to get value fast,” this Part 2 is the counterweight:
Treat AI-generated changes and healed changes as production code changes that require review-grade auditability.
What got easier in CR(honestly)
1) PRs arrive with structure, not fragments
Instead of 5 half-finished test files, you often get a cohesive set: fixtures, page helpers, and a first draft of selectors and assertions. That’s real velocity.
2) Consistency is easier to enforce
If your repo has a clear testing style (naming, locator strategy, assertion rules), Cursor tends to follow it after a few examples and guardrails. You can “shape” output faster than you can author from scratch.
3) Self-healing removes the most boring maintenance… sometimes
Locator-only healing can save hours when IDs/classes rename. But only when your test intent remains stable and the DOM shape doesn’t get ambiguous.
What got underestimated
1) The diff got bigger, and the risk got quieter
Cursor tends to refactor multiple files at once even when not “have to”. This means that most likely you end up reviewing a small testing system.
2) “Stability” features can become bug-masking features
Common patterns :
- Retry loops that ignore state correctness
- Fallback selectors accept the wrong element
- Broad text matching (“closest match”)
- Self-heal auto-updates that normalize unintended UI behavior
3) Failures stop being obvious
A broken test is noisy but honest.
A healed test can be green and wrong.
So your review has to shift from “does it run?” to “does it prove the thing we think it proves?”
Self-Healing Strategies Compared
| Strategy | Typical Tools | What It Fixes Well | Where It Breaks | Risk Level | Required Guardrails |
|---|---|---|---|---|---|
| Locator fallback | Framework plugins | renamed IDs/classes | duplicate elements | Medium | page-state assertions |
| Semantic matching | NLP/vision | text/style changes | wrong “closest match” | High | change audit + approvals |
| Auto-update steps | SaaS platforms | minor UI churn | workflow changes | High | human review, diff logs |

(Self-healing commonly spans more than locators. Some approaches aim to “heal” interaction or flow changes too, which raises governance stakes. )
The code review mindset shift: you’re auditing intent, not syntax
When I review Cursor/AI-generated automation code (especially with self-heal enabled), I assume three things until proven otherwise:
- The test may be asserting the wrong contract.
- A self-heal may have “fixed” the wrong element.
- The code may be robust in the wrong direction (stable but meaningless).
Code Review focus: Human-authored vs Cursor/AI-generated tests
| Review Dimension | Traditional Human Code | Cursor-Generated Code | What to do differently |
|---|---|---|---|
| Selector quality | Usually intentional | Often “works-first” | Enforce selector policy + ban brittle fallbacks |
| Assertions | Typically minimal but relevant | Can be verbose or misplaced | Require “assert the contract” rules |
| Waits/retries | Known anti-patterns | AI loves “stabilizers” | Reject silent retries without state checks |
| Test boundaries | Clear-ish (sometimes) | Can mix UI/API/fixtures in one step | Require layered structure & helper boundaries |
| Maintainability | Consistent with team norms | Can introduce new patterns fast | Block new patterns unless added to standards |
| Observability | Relies on CI output | May add logs but not signals | Require artifacts: screenshots, network logs, heal logs |
Mini case study: web app checkout flow + self-healing that “lies”
Scenario
A B2C web app checkout: Cart → Address → Payment → Confirm.
A UI change happens:
- “Place Order” button text changes to “Submit Order”
- A new “Submit” button appears in a newsletter widget on the same page
- Role-based users see different button order
What Cursor generates (typical failure mode)
Cursor updates the test to:
- Search for
Submittext - Add a fallback locator that chooses the first match
- Wrap click in a retry loop
- Add self-heal update to “remember” the new locator
Result: tests pass for admin users in staging, fail intermittently for standard users, and quietly click the newsletter widget in production-like data.
The QE review intervention
Review focuses:
1) Locator policy enforcement
- Prefer user-facing, stable selectors (role/label/testid) over text fuzzy matches when the page is busy.
- If a fallback exists, it must prove uniqueness (count = 1) and validate context.
2) Intent-anchored assertions before and after critical actions
- Before clicking “Place Order”: assert you’re in the payment confirmation section (not just “button visible”).
- After clicking: assert order confirmation banner + order ID format + backend call succeeded (if allowed).
3) Self-heal governance
- The “healed” locator update is not a convenience, it’s a behavior change.
- Treat the healed diff like any other code change: reviewed, approved, and traceable.

Where self-healing helps (the safe win)
In the same flow, a CSS class rename breaks a secondary selector used only for a stable container (data-testid remains). Locator fallback can safely recover if the test validates page-state and uniqueness.
Dividing line: self-heal is safe when it’s constrained to non-ambiguous recovery and proven by assertions.
Practical code review checklist for Cursor/AI-generated UI automation
Gate 1 — “What contract does this test prove?”
Require a one-paragraph PR description answering:
- What user behavior is validated?
- What critical states are asserted?
- What’s explicitly not asserted?
Gate 2 — Selector integrity (ban the silent footguns)
Reject if you see:
nth-child, layout-based selectors, or brittle XPaths as primary path- “First match wins” behavior without uniqueness check
- Text fuzzy matching on pages with repeated labels
- Locators that bypass accessibility roles where available
Accept if you see:
getByRole,getByLabel,getByTestIdpatterns (or your equivalent)- Strict uniqueness validation before action
- Page-object boundaries that keep locators centralized
(Playwright explicitly encourages resilient locator strategies and avoiding brittle selectors. )
Gate 3 — Anti-mask rules (the big one)
Hard “no” patterns:
- Retries that don’t assert state change
- Catch-and-continue blocks around clicks/navigation
- Broad “waitForTimeout” stabilization sprinkled everywhere
- Screenshot-on-failure disabled “because noisy”
Hard “yes” patterns:
- Retries only around known transient infrastructure issues (and logged)
- Every retry accompanied by state assertions (“we are on step X”)
- Failure artifacts captured consistently
Gate 4 — Self-heal changes must be reviewable
If your tool is doing any auto-heal / auto-update, require:
- A diff or artifact showing what changed (locator before/after)
- A confidence score or rationale (if available)
- A “why it’s safe” statement tied to uniqueness + page-state assertions
- A flag if the healed element appears in multiple regions/components
Rule of thumb:
If a healed change would scare you as a manual PR, it should scare you as a healed change.
A step-by-step adoption playbook (with guardrails)
Step 1 — Define “reviewable output” before you scale usage
Deliverables:
- Locator policy (preferred → fallback → forbidden)
- Assertion policy (critical state checks required)
- Self-heal policy (what can be healed automatically vs blocked)
- PR template for AI-generated changes
Guardrail: Block merges that don’t use the template.
Step 2 — Start with “AI-assisted authoring,” not “AI autonomy”
Use Cursor to draft tests, but require:
- Human-owned final assertions
- Human-approved locator strategy
- No self-heal auto-update to main branch in the first phase
Guardrail: Self-heal runs can propose changes, but can’t auto-merge.
Step 3 — Introduce self-healing in “observe-only mode”
Let self-heal:
- Suggest alternative locators
- Produce heal logs
- Provide diffs
But it does not update baselines automatically.
Guardrail: Review heals weekly like flaky triage—same rigor.
Step 4 — Expand to controlled auto-heal with strict boundaries
Allow auto-heal only for:
- Non-ambiguous elements (unique, anchored containers)
- Approved pages/components
- Low-risk flows
Guardrail: Any semantic/vision matching requires explicit approval.

Step 5 — Add metrics that expose “quiet reroutes”
Track:
- Healed rate by suite/module
- “Healed then failed” rate
- Escaped defects correlated with healed flows
- Blocked heals (good signal!) vs accepted heals
Guardrail: A rising healed rate without a falling escape rate is a governance smell.
Final Notes
Most teams fail here because they treat self-healing as a local test framework feature, not a governance problem.
The platform angle we push at Omniit.ai is simple and strict:
- Healed changes must be reviewed like human-made code changes (same approval workflow).
- Auto-flag suspicious heals (ambiguity, multiple matches, role drift, environment mismatch).
- Track healed rate vs escaped defects as first-class signals.
- Auto-create PRs for heal-generated updates (instead of silently accepting them).
- Require diff logs and audit trails so you can answer: “When did the test truth change, and who approved it?”





