Adaptive Mobile Agent for Dynamic Interactions

Published in IEEE International Conference on Multimedia and Expo (ICME) 2025, 2025

With the rise of Multimodal Large Language Models (MLLM), LLM-driven visual agents are transforming software interfaces, especially those with graphical user interfaces. However, existing methods often struggle with diverse and complex mobile environments, such as rapidly changing app interfaces or non-standard UI components, limiting their adaptability and precision.

This work presents a novel LLM-based multimodal agent framework for mobile devices, designed to enhance interaction and adaptive capabilities in dynamic mobile environments. By autonomously navigating devices and emulating human-like behaviors, the agent integrates parsing, text, and vision descriptions to construct a flexible action space.

During the exploration phase, functionalities of user interface elements are documented into a customized structured knowledge base. In the deployment phase, RAG technology enables efficient retrieval and updates from this knowledge base. Experimental results across multiple benchmarks validate the framework’s superior performance and practical effectiveness.

Recommended citation: Li, Y., Zhang, C., Yang, W., Fu, B., Cheng, P., Chen, X., Chen, L., & Wei, Y. (2025). Adaptive Mobile Agent for Dynamic Interactions. In Proceedings of IEEE International Conference on Multimedia and Expo (ICME) 2025.