— what 30 standardized tasks look like under the leaderboard lens. Tier 1 — Hello World DEFINITION Top-3 = 100% All three top models clear the task Representative tasks: · stack_bowls · stack_color_blocks SIGNAL Foundational manipulation has a working baseline across the leading models. Tier 2 — Easy DEFINITION Top1 ≥ 90%, Top3 ≥ 70% A leader exists; followers within reach Representative tasks: · place_shoes_on_rack · search_green_boxes SIGNAL Visual discrimination is mostly solved; pickup precision becomes the differentiator. Tier 3 — Specialty Win DEFINITION Top 3 = 0% – 10% Share drop after Representative task: · press_three_buttons (wall-oss-v0.1 only) SIGNAL Architectural specialization shows. Different VLAs encode different priors — diversity is real. Source: RoboChallenge per-task aggregate, snapshot 2026-01-23 13 / 20
machine validation a necessity in the industry. 2. Stacking bowls and moving objects into boxes have become "Hello World" level tasks. 3. Organizing paper cups and making sandwiches remain challenging problems. 4. The top model's success rate is about 60%, indicating room for improvement. 5. VLA models are still at a very early stage, operating at a near-basic level of human intelligence. Core Findings and Hightlights