EO-1

An Open Unified Embodied Foundation Model for General Robot Control

The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. EO Robotics tries to equip automatous robots with humanlike seamless multimodal reasoning and physical acting capabilities through unified embodied foundation models EO-Series.

Today we introduce EO-1 model, an open-source unified embodied foundation model comprising 3B parameters, is trained on the carefully curated interleaved embodied dataset EO-Data1.5M. The EO-1 model adopt a single unified decoder-only transformer that integrates discrete auto-regressive decoding with continuous flow matching denoising for multimodal embodied reasoning and robot control, enabling seamless perception, planning, reasoning, and acting in single model through interleaved vision-text-action pretraining.

EO-1 model surpasses existing open-source models across multiple embodied reasoning and robot control benchmarks, including ERQA, LIBERO, SimplerEnv, and the self-constructed EO-Bench. Meanwhile, extensive real-robot evaluations demonstrate its substantially stronger reasoning capabilities and dexterous control in open-world generalization.

ERQA
EO-Bench
Simpler@Google VM
LIBERO

EO-1 surpasses existing open-source models across embodied reasoning and robot control benchmarks.

Franka Pick-and-Place
AgiBot Long-horizon Dexterity
WidowX Out-of-Box
Reasoning Control

EO-1 can perform a wide range of real-world manipulation tasks across various robotic platforms.

Specializing in Long-horizon Dexterity

We investigate the EO-1 model’s ability to specialize in long-horizon dexterous tasks that requires multi-stage successful execution to finish. Four tasks that require intricate multi-step decisions and fine manipulation are selected: 1) Make Breakfast Sandwich, 2)Roast Beef Steak, 3) Fold Household Clothes, and 4) Sort Grocery Items. EO-1 shows stable and strong long-horizon dexterity on various tasks that demand both multimodal understanding and fine manipulation, demonstrating its capabilities in handle complex real-world environments.

Emerging Open-world Embodied Generalization

The key challenge for embodied foundation models is generalizing to real-world scenarios where natural language instructions must be grounded into precise, executable actions. To evaluate this capability, we perform a generalization assessment with different task instructions, changing object locations, dynamic light conditions, and unseen background. We observe that EO-1 demonstrates stable instruction following and open-world generalization capabilities.

Enhanced Generalization with Unified Reasoning

To evaluate whether a sole interleaved vision–text–action policy can seamlessly integrate high-level reasoning with low-level control in real environments, we design two reasoning-control tasks: Visual Rearrangement, Tic-Tac-Toe. These tasks require joint perception, spatial reasoning, multi-step planning, and bimanual manipulation under real-world dynamics.
EO-1 seamlessly integrates high-level embodied reasoning with low-level robot control, enable smooth, correct execution in the reasoning control tasks that requires context-aware reasoning to guide acting.

Accessible Multimodal Training Data

EO-1 is trained on a diverse range of datasets across multiple modalities, including text, image, video, and robot control data, to perform embodied reasoning and dextrous control, all through a unified multimodal interface. The pre-training data corpus is structured into three main categories: web multimodal data, robot control data, and interleaved embodied data.

The interleaved embodied data EO-Data1.5M dataset is a self-curated, massive, high-quality multimodal embodied reasoning dataset, featuring interleaved embodied reasoning and and robot control through a scalable data construction pipeline. The data consists of 1) Physical Common Sense for physical environments understanding, 2) task reasoning and spatial understanding QA data focusing on task planning and spatial relationship understanding of complex manipulation task, and 3) Interleaved Manipulation data connecting temporal/spatial reasoning data with robot control data for learning multimodal causal relationships in embodied interactions.

A Foundation for Future Work

In summary, we present EO-1 and EO-Data1.5M, a fully open training recipe to foster the research community to develop an advanced embodied foundation model. EO-Data1.5M is collected by a scalable pipeline for curating interleaved vision-text-action data to improve EO-1 model’s open-world generalization. EO-1 is released with full openness, including model weights, training code, and all components of the interleaved embodied dataset, We hope EO-1 will serve as a solid foundation for developing general-purpose automatous robots with humanlike powerful reasoning and acting abilities.

In the future, we plan to enhance EO-1's reasoning and action ability to handle complex scenarios involving navigation, obstacle avoidance, failure detection and analysis, human intent recognition, and human-robot interaction/cooperation, leading to more easy-to-use robots in real life.