Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RoboChallenge Annual Report Large Scale Real-Ro...

Sponsored · Your Podcast. Everywhere. Effortlessly. Share. Educate. Inspire. Entertain. You do you. We'll handle the rest.

RoboChallenge Annual Report Large Scale Real-Robot Evaluation for Embodied AI 2026 May 09

Avatar for TAKASU Masakazu

TAKASU Masakazu

May 08, 2026

More Decks by TAKASU Masakazu

Other Decks in Research

Transcript

  1. RoboChallenge Annual Report Large Scale Real-Robot Evaluation for Embodied AI

    Emily Chen GM of Dexmal Committee of RoboChallenge
  2. The Current Status of VLA Evaluation The vast majority is

    conducted in simulation, while the availability of real-world testing is very limited
  3. Data Driving Growth Hacking 41969 Accumulated Rollouts 181 The peak

    of day roll out 39.2% Conversion rate of users 17K Table30 Dataset Download 03
  4. Where Models Win, Where Models Fail Per-capability success rate. Each

    row shows the leader of that dimension. 20% 40% 60% 80% 100% Simple-pick 85% — Spirit-v1.5 Manipulation 68.3% — Spirit-v1.5 Classification 52% — pi0.5 Temporal / Sequence 40% — wall-oss-v0.1 Softbody 13.3% — hardest dimension 50% line Hard surfaces are easy. Soft, deformable, and long-horizon are still open research. 12 / 20
  5. Task Difficulty Spectrum — Three Tiers Hello-world, easy, specialty win

    — what 30 standardized tasks look like under the leaderboard lens. Tier 1 — Hello World DEFINITION Top-3 = 100% All three top models clear the task Representative tasks: · stack_bowls · stack_color_blocks SIGNAL Foundational manipulation has a working baseline across the leading models. Tier 2 — Easy DEFINITION Top1 ≥ 90%, Top3 ≥ 70% A leader exists; followers within reach Representative tasks: · place_shoes_on_rack · search_green_boxes SIGNAL Visual discrimination is mostly solved; pickup precision becomes the differentiator. Tier 3 — Specialty Win DEFINITION Top 3 = 0% – 10% Share drop after Representative task: · press_three_buttons (wall-oss-v0.1 only) SIGNAL Architectural specialization shows. Different VLAs encode different priors — diversity is real. Source: RoboChallenge per-task aggregate, snapshot 2026-01-23 13 / 20
  6. 1. The demand for testing has grown exponentially, making real

    machine validation a necessity in the industry. 2. Stacking bowls and moving objects into boxes have become "Hello World" level tasks. 3. Organizing paper cups and making sandwiches remain challenging problems. 4. The top model's success rate is about 60%, indicating room for improvement. 5. VLA models are still at a very early stage, operating at a near-basic level of human intelligence. Core Findings and Hightlights
  7. Roadmap 2026 — Conferences & Platform Evolution From a 90-day

    prototype to a permanent venue for real-robot evaluation. TIMELINE 2025-10 Platform launch 2025-11-20 Committee founded 2026-01 Leaderboard milestones 2026-04-15 ↔ 05-15 CVPR Track 2026-05-08 ↔ 05-25 ICRA Track 2026 H2 → Real 100 · Sim-vs-Real · Zero-shot CVPR 2026 · DENVER Table30 v2 — 18 New Bimanual Tasks Open submissions: 2026-04-15 → 2026-05-15 120+ teams interested 96 registered 400+ participants 68 / 28 universities / enterprises Workshop spotlight: bimanual coordination, scene generalization, long-horizon planning under real-world distribution shift. ICRA 2026 · VIENNA Real Supermarket — AGIBOT Track Open submissions: 2026-05-08 → 2026-05-25 50+ teams Real supermarket scene Closed-loop embodied evaluation Beyond tabletop: shopping aisle navigation, cluttered shelf retrieval, multi-stage planning under live human bystanders. 19 / 20
  8. The Competition is On Fire Download the Table 30 v2

    dataset https://huggingface.co/datasets/RoboChallenge/Table30v2
  9. Real Scenarios – Supermarket - Targeting real retail supermarket scenarios

    - focusing on the deployable, real-world capabilities of embodied intelligence