Alibaba’s Tongyi Lab has unveiled MAI-UI, a groundbreaking family of foundation GUI agents designed to revolutionize mobile interaction. These agents demonstrate superior performance in general GUI grounding and mobile navigation, surpassing established models like Gemini 2.5 Pro, Seed1.8, and UI-Tars-2 on the AndroidWorld benchmark. MAI-UI uniquely addresses critical gaps in current GUI agents by natively integrating user interaction, MCP tool use, and a device-cloud collaboration architecture that prioritizes privacy.
What is MAI-UI?
MAI-UI is a collection of multimodal GUI agents built upon the Qwen3 VL architecture, with model sizes varying from 2B to 235B A22B. These agents process natural language instructions and UI screenshots to generate structured actions for Android environments. Their capabilities extend beyond basic operations like clicking and typing to include answering user queries, seeking clarification, and invoking external tools via MCP calls, enabling a seamless blend of GUI actions, direct language responses, and API operations.
The system’s architecture unifies three key components: a self-evolving data pipeline for navigation that incorporates user interactions and MCP cases, an online Reinforcement Learning (RL) framework capable of scaling to hundreds of parallel Android instances, and a native device-cloud collaboration system that intelligently routes execution based on task status and privacy considerations.
GUI Grounding with Instruction Reasoning
A fundamental aspect of GUI agents is grounding – mapping natural language commands to specific on-screen controls. MAI-UI employs a novel grounding strategy inspired by multi-perspective instruction descriptions. For each UI element, the training pipeline utilizes multiple views, including appearance, function, spatial location, and user intent, as reasoning evidence. This approach mitigates issues arising from flawed or ambiguous instructions. The models achieve impressive accuracy on public GUI grounding benchmarks, outperforming Gemini 3 Pro and Seed1.8 on ScreenSpot Pro and significantly exceeding earlier open models on UI Vision.
Self-Evolving Navigation Data and MobileWorld
Navigating complex mobile interfaces requires maintaining context across multiple steps and applications. To foster robust navigation, Tongyi Lab developed a self-evolving data pipeline. This pipeline starts with seed tasks from app manuals and designed scenarios, which are then expanded through parameter perturbation and object-level substitutions. Multiple agents and human annotators execute these tasks, and a judge model refines the resulting trajectories. This continuous feedback loop ensures the training data distribution aligns with the current agent policy.
MAI-UI is evaluated on MobileWorld, a benchmark featuring 201 tasks across 20 applications, encompassing pure GUI tasks, agent-user interaction tasks, and MCP-augmented tasks. MAI-UI demonstrates strong performance on this benchmark, significantly improving over existing end-to-end GUI baselines and showing competitiveness with proprietary agentic frameworks.
Online RL in Containerized Android Environments
To ensure robustness in dynamic mobile applications, MAI-UI leverages an online RL framework operating within containerized Android Virtual Devices. This setup allows the agent to learn directly from interactions. The RL framework utilizes an asynchronous on-policy method (GRPO) that supports extensive parallelism and long context sequences, enabling learning from trajectories with up to 50 steps. Rewards are generated by verifiers or judge models, with penalties for looping behaviors. The research highlights that scaling the number of parallel GUI environments and increasing the allowed environment steps significantly boosts navigation success rates.
On the AndroidWorld benchmark, the largest MAI-UI variant achieved 76.7% success, setting a new standard and surpassing previous leading models.