Coding agents are starting to feel real now.
Claude Code, Codex, and similar tools made it normal to let an agent read a repo, edit files, run commands, and fix errors.
I’m curious whether GUI agents are the next step.
Instead of operating code, they would operate apps.
For mobile, this seems especially hard because the agent needs to keep understanding and verifying UI state over time:
What screen am I on?
Is this a search box, a tab, a modal, or a result card?
Did the last tap actually work?
Is the page loading or stuck?
Should I retry, go back, scroll, or stop?
This feels very different from browser automation because mobile UI is more visual, less structured, and full of app-specific patterns.
What do you think is the right technical path here?
VLM-first? Accessibility-tree-first? Hybrid?
No responses yet.