UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions (original) (raw)

View PDF HTML (experimental)

Abstract:Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction errors and propose a training-free two-step grounding refinement method to alleviate minor localization deviations. Extensive experiments on multiple benchmarks demonstrate that UITron-Speech achieves robust performance and superior adaptability, underscoring the feasibility and potential of speech-driven GUI agents for more accessible and intelligent human-computer interaction. Our code and datasets are available at this https URL.

Submission history

From: Wenkang Han [view email]
[v1] Tue, 10 Jun 2025 12:16:27 UTC (2,252 KB)
[v2] Wed, 6 Aug 2025 09:16:16 UTC (2,307 KB)
[v3] Wed, 26 Nov 2025 07:48:14 UTC (2,304 KB)