Alibaba Unveils Qwen Robot Suite: First Foundation Models for Embodied AI and Real-World Robotics

18:31, 16 June

Alibaba’s Tongyi Lab has launched the Qwen Robot Suite, marking the debut of a specialized series of models engineered specifically for the physical world. This represents more than a simple upgrade to existing multimodal systems; it signifies a fundamental shift from chat-based AI that understands the world to embodied agents capable of perceiving their surroundings, planning actions, and executing them in real time.

The release features three core foundation models:

Qwen-RobotNav — navigation and movement within physical spaces;
Qwen-RobotManip — object manipulation and environmental interaction;
Qwen-RobotWorld — predicting scene dynamics and future world states.

Built upon the Qwen architecture—specifically the Qwen3-VL and Qwen3.5 vision-language models—these tools were trained exclusively on open-source data and are already being piloted by Alibaba Cloud customers.

Bridging the Gap Between Perception and Action

While Qwen models have long demonstrated a sophisticated understanding of the physical world—identifying objects, spatial relationships, and causal logic—a significant barrier remained. This disconnect existed between high-level visual-linguistic reasoning and the low-level control commands required to operate a robot.

The Qwen Robot Suite addresses this specific challenge by building specialized "bridges" between perception and action across three critical domains: mobility, manipulation, and world prediction.

Qwen-RobotNav: Navigation and Mobility

This model consolidates five distinct navigation tasks into a single, unified framework:

following natural language instructions;
navigating to specific points or objects;
tracking moving targets;
autonomous driving;
answering questions within a physical environment (Embodied Question Answering).

The system employs guided observation encoding alongside a specialized tool interface, allowing a high-level planner like Qwen3.7 to dynamically switch operational modes and manage context.

The performance metrics are impressive: a 76.5% success rate on VLN-CE RxR, 75.6% on HM3Dv2 (using only RGB images for object navigation), and 91.4 PDMS on NAVSIM for closed-loop autonomous driving. The model has already undergone successful real-world testing on the Unitree Go2 quadruped robot using only a single low-resolution camera.

Qwen-RobotManip: Manipulation and Interaction

This stands out as the most mature and powerful component of the entire suite. It is built on the Qwen3.5-4B architecture with a flow-matching DiT action head. This approach introduces a unified 80-dimensional state-action space using camera-coordinate delta positions, enabling effective learning from various robot types—including single-arm, dual-arm, high-dexterity, and mobile platforms—without data conflicts.

The scale and quality of its training are unprecedented: it utilizes over 38,100 hours of open-source data, including real robot logs, egocentric human videos, and synthetic data generated through a specialized human-to-robot conversion pipeline.

The benchmark results highlight its capabilities:

91.4% on LIBERO-Plus, outperforming the previous best by 7 percentage points;
1st place overall in RoboChallenge Table30 v1 with a 45% success rate, leading the third-place finisher by 20%;
strong performance across RoboTwin, RoboCasa, EBench, and other tests, particularly in out-of-distribution scenarios and cross-platform skill transfer.

The model also exhibits emergent properties, including resilience to external disturbances, autonomous error recovery, the ability to follow open-ended instructions, and the capacity to transfer skills between different robot types without additional training.

Qwen-RobotWorld: World Modeling and Future Prediction

This component is a language-conditioned video world model that generates physically plausible future states of a scene based on current observations and text instructions. Trained on 8.6 million video-text pairs (totaling over 200 million frames), it demonstrates a deep understanding of physics, including laws of motion, mass conservation, and fluid dynamics.

It currently holds the top spots on EWMBench, DreamGen Bench, WorldModelBench (among open models), and PBBench. Its most valuable features are its precise linguistic control and its ability to generate consistent scenes from multiple viewpoints.

Qwen-RobotClaw: The Integration Layer

An essential addition to the suite is Qwen-RobotClaw, an internal toolkit designed for robotic agents. It allows standard Qwen vision-language agents to call upon Robot Suite models as tools for the physical world, managing context and memory during long-term tasks.

This layer is what transforms the three foundation models into a cohesive system for agents operating in the real world.

These models are currently in pilot use with selected Alibaba Cloud corporate clients within the robotics sector.

Technical reports and GitHub repositories (including Qwen-RobotNav and Qwen-RobotManip) have been published. While the models are available through the Qwen ecosystem and Hugging Face, full weights and detailed integration instructions are expected to follow shortly.

Qwen