Open‑source VLM agent to control computer GUIs via mouse/keyboard planning and execution.
ScreenAgent is an open‑source Vision Language Model agent that interacts with real computer screens via screenshot observation and mouse/keyboard actions, following a planning‑execution‑reflection loop. It supports multi‑step GUI tasks, dataset collection, and achieves positioning accuracy comparable to GPT‑4V.
81%
Loading Community Opinions...