← All articles
Perspective Published 2026-06-15 · 6 min read

Agents should drive apps, not screens

Today's agents either screen-scrape or shell out to opaque CLIs. There's a third path: let the app declare what it can do and have the agent drive it through one open protocol — transparently.

Ask a simple question and the whole agent story gets clearer: when an AI agent needs to use one of your applications, how should it actually operate it? Not "what can the model reason about" — the mechanical question. How does the agent press the button, read the field, run the export? The honest answer for most of today's agents is that they either look at the screen and guess, or they route around the app entirely. Both work in a demo. Neither is something you'd want operating your software unattended. There's a third path, and it starts by letting the app participate.

Two options, both blind

The first option is computer-use vision: the agent takes a screenshot, finds the button by pixels, moves a cursor, clicks. It's general — it works on any app without integration — and that's the whole pitch. But it's brittle and slow. A layout change, a theme switch, a control that moved twelve pixels, and the agent is clicking the wrong thing. Worse, it bypasses the consent and accessibility gates the operating system already provides. The app has no idea it's being driven; nothing in the loop can say "this is an automated agent, and it's about to delete a file."

The second option is a per-app CLI or API adapter. This is cleaner — the agent calls a real function instead of hunting for pixels — but it doesn't scale and it isn't transparent. Every new app needs a new adapter, written and maintained by someone. And when the agent proposes to run an export command against a shared folder, the user approves a line of opaque text. They're trusting a string. They can't see which application is being operated, which feature, or what happens inside it. Approval becomes a rubber stamp.

Both options leave the human blind. One can't see the semantics; the other can't see the action. That's the gap.

The third path: the app participates

The third path is the obvious one once you say it out loud: let the app describe itself. Instead of the agent guessing what's on screen, the application declares a surface — its screens, the elements on them, and the actions it supports — in an open, declarative spec.[1] The agent reads that spec and drives the app through a single protocol: describe the surface, read a screen, get and set values, invoke an action. No pixels, no bespoke adapter — one vocabulary that every participating app speaks.

Because the protocol rides on the Model Context Protocol, it's agent-neutral.[3] There's no assumption about which model or which client is on the other end. Any MCP-capable agent can drive any app that exposes the surface, and the app author only writes the surface — the protocol's transport, tokens, and audit come from the SDK. The app declares what it can do; the agent does it; nobody reverse-engineers anyone.

What changes

Three things change the moment the app participates.

Transparency. Because the agent operates declared features by name, the human in the loop sees exactly which application and which feature is being driven, and can watch the calls as they happen.[2] Approval stops being a rubber stamp on a string and becomes a real decision about a named action in a named app.

Stability. The agent reads declared semantics, not rendered pixels. When a button moves, when the theme changes, when a control is restyled, the surface's meaning is unchanged — so the automation doesn't break. You're binding to what the app means, not to where it happens to be drawn this week.

It works with the apps you already have. Participation isn't a rewrite. The app implements one interface to expose its surface; the SDK handles the wire — the loopback transport, the per-instance token, registration, and the audit and consent gates.[2] A desktop app you shipped years ago can become agent-drivable without changing how it looks or how a person uses it.

Across devices

And it doesn't stop at one machine. When apps live on a desktop, a phone, and a shared appliance, the hubs that broker them can federate into a cluster, so a single agent connected to one hub can transparently discover and drive apps across all of them — see the federation overview.

The point

Screen-scraping treats your app as an image to be poked. Opaque adapters treat it as a black box behind a command line. The third path treats it as a participant — something that can say what it does and be driven on those terms, in the open. That's a better deal for everyone: the agent gets stable semantics, the user gets to see what's happening, and you get there by implementing one interface instead of fighting pixels forever. Read the protocol and let your app speak for itself.

References

  1. App Use — Overview. https://aiappuse.com/docs/overview.htm
  2. App Use — Architecture. https://aiappuse.com/docs/architecture.htm
  3. Model Context Protocol. https://modelcontextprotocol.io

Related articles

Let your app speak for itself

One open protocol. The app declares what it can do; any agent drives it — transparently, without a bespoke adapter or a single screenshot.

Read the protocol