HumanEgo Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

Robot-Data-Free Hardware-Agnostic Data-Efficient Zero-Shot Transferable

The world is its own best model… It is better to use the world as its own simulator and let the body directly interact with the physical physics.

Rodney Brooks

Real-World Video Clips

Water Flowers

Downstack Cups

Adjust Table

Serve Bread

Charge Devices

Unscrew Cap

Open Door

Open Cabinet

Grab Tissue

Downstack Cups: Cross-Embodiment Test

Water Flowers: Generalization Test

Serve Bread: Generalization Test

HumanEgo — Full Video

Insights

  1. Human video is not merely a cheap substitute but a superior and more efficient data source for policy learning.

  2. Explicit spatial representation, not visual fidelity, is the key to bridging the embodiment gap. Neither hand nor object alone defines a skill—what matters is their interaction.

  3. Beyond action labels, human videos contain way more information. Richer supervision signal provides complementary gains.

Overview

We introduce HumanEgo, a robot-data-free, hardware-agnostic, and data-efficient pipeline that learns robot manipulation policies from minutes of raw human egocentric videos—powered by a flow matching policy with dense auxiliary objectives.

Abstract

Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remains challenging due to the embodiment gap between human and robot in both visual appearance and kinematics. We present HumanEgo, a framework that bridges the embodiment gap by lifting each human demonstration to an entity-level representation of hand–object interaction, and training a flow matching policy with dense auxiliary objectives that amplify supervision from every trajectory. HumanEgo is robot-data-free, hardware-agnostic, data-efficient, and zero-shot human-to-robot transferable. With only 30 minutes of human videos per task, HumanEgo achieves 92.5% average success across four real-world tasks (75% with just 15 minutes), outperforms matched-time robot teleoperation by 41%, and robustly transfers zero-shot across novel robots, cameras, and environments.

Architecture

System overview. Arm inpainting and visual keypoints bridge the visual gap; Interaction-Centric Tokens (ICT) encode spatial relationships among all entities; a flow matching policy with dense auxiliary objectives learns bimanual robot actions from minutes-scale human data.

Data Collection and Data Preprocessing

Data Collection. Data collection by anyone, anytime, anywhere; policies deployed to any lab, any camera, any robot — all from only 30 minutes of data. A human demonstrator wears Aria glasses and performs each task in any convenient environment — regardless of table height, lighting, or background, and without specialized workspace or calibration.

Data collection setup

Data collection setup.

Data Preprocessing. Each demonstration takes only seconds. Aria glasses are particularly well suited for learning from human video: their Machine Perception Services (MPS) provide high-quality 6-DoF SLAM tracking, calibrated 3D hand pose estimation, and synchronized egocentric RGB streams—all from a single lightweight wearable device.

Hand-to-gripper mapping

Hand-to-gripper mapping.

ResultsHumanEgo Bridges the Embodiment Gap Efficiently

Overall Real-World Evaluation. Real-world success rate (%) for each method across all four tasks. HumanEgo with 30 min of data achieves the highest success rate on every task, demonstrating consistent improvements over both human-video baselines and robot teleoperation methods.

ResultsThe Efficiency of Human Demonstrations

Human vs. robot data. Human egocentric data exhibits higher SNR, smoother motion, less idle time (top), and greater spatial and trajectory diversity (bottom).

Data efficiency. Success rate (%) vs. data collection time. HumanEgo trained on 8 min of human data surpasses ACT's 30-min robot data.

ResultsOne Policy, Many Conditions

Cross-condition real-world evaluation: cross-embodiment / environment / setup.

Zero-Shot Cross-Condition Generalization. HumanEgo maintains robust success across different conditions without retraining.

ResultsWhat Drives Performance of HumanEgo?

Representation study. Success rate (%) for five input configurations. Visual-only methods plateau at 32.5% with any strategy; adding spatial tokens yields +52.5 pp.

Auxiliary training study. Success rate (%) at 15 min of data for each auxiliary objective individually. Object motion contributes the most (+17.5 pp); all three combine for +25 pp.

FAQ

Why use human data?

Human egocentric video captures rich manipulation demonstrations without any robot hardware — anyone can collect it, anywhere, anytime, no specialized workspace, no teleoperation rig, no calibration.

More fundamentally, we see the corpus of human–world interaction as one of the richest yet most under-explored data sources in existence. If the long-term goal is robots that operate effectively in the human world — helping people in homes, kitchens, labs, and workshops — then the most direct and natural source of supervision is people themselves interacting with that exact world. Every minute of egocentric video encodes how a body, a brain, and the physical environment jointly solve a manipulation task. From this lens, human video isn't merely a cheap substitute for robot data; it is a superior and more efficient data source for policy learning.

Why efficient robot learning from human egocentric videos?

The internet hosts an enormous amount of human video, and people often assume "just train on YouTube" is a viable shortcut. But if you actually look at what's inside these datasets, you find all kinds of issues — most clips have no accurate action labels, many suffer from uncompensated head motion (so the camera shakes everywhere), and most are random everyday activities with no specific task in mind. Human data is super rich in quantity, but actually very poor in quality.

We find it useful to think about an egocentric data pyramid, analogous to the well-known robotics data pyramid:

  • Bottom (largest, lowest quality) — passive videos like YouTube. Pixels-only, unlabeled, noisy. The human in frame isn't collecting data for us. The biggest layer, but also the hardest to use directly.
  • Middle — egocentric demonstrations. Someone deliberately wears a camera and performs a task, with hand poses tracked. Cleaner, has action labels, the human is actively demonstrating. But still not good enough for training a deployable policy.
  • Top (smallest, highest quality) — teleop-grade human data: fully structured, interactive, with accurate action labels, where the human interacts with the scene in a way a robot could reproduce. This is what we want — but it is very rare.

So the central question becomes: how do we squeeze every bit of learning signal out of the small amount of teleop-grade human data we can actually collect? That is exactly what HumanEgo is built for — making the most of minutes, not hours or days, of high-quality human egocentric data.

Why use Aria glasses?

Aria glasses are, today, the most mature and capable platform for egocentric data collection. Two things set them apart from anything else in the category: genuinely lightweight, production-grade hardware that a demonstrator can wear naturally for extended sessions in any environment; and Meta's MPS pipeline, which delivers calibrated SLAM, hand-pose estimation, and synchronized multi-stream egocentric RGB out of the box — turnkey, no per-session calibration.

The precision of these signals is what really matters: our experiments show that the accuracy of the upstream SLAM and hand-tracking signal directly bounds downstream policy performance. Noisy poses propagate into the action loss and the learned representation; the clean, drift-free trajectories from Aria are exactly what enable HumanEgo to converge on minutes of data. No other consumer egocentric device today delivers this level of out-of-the-box accuracy.

Why are Interaction-Centric Tokens (ICTs) so powerful — and why doesn't visual fidelity matter as much?

ICTs encode each entity (hand and object) by its 6-DoF pose relative to other task entities, not relative to the camera or a fixed world frame. This makes the representation invariant to embodiment, viewpoint, and environment — the same ICT tokens describe the same skill whether the demonstrator is a human or a robot, whether the camera is a RealSense or a ZED, whether the table is tall or short. The policy learns once and transfers everywhere.

As for why visual fidelity matters less than people expect: vision is obviously central to how humans experience the world — but for most manipulation tasks, it isn't strictly required. Imagine glancing at a table, closing your eyes, and then putting a flower into a vase. You can still do it. Once a brief look gives you a coarse spatial map, the rest of the task runs on spatial memory, proprioception, touch, and the geometric interaction between hand and object — not on continued visual confirmation. Manipulation is fundamentally a spatial-and-interaction problem; vision is just one of several inputs that surfaces that structure.

Inspired by this, we designed ICT to make spatial observation and hand–object interaction the first-class citizens of the policy. The result is an elegantly compact representation that is unified (every entity, hand or object, lives in the same token format), easy to implement (off-the-shelf pose estimators are enough), variable-length (works whether the scene has one object or many), and embodiment-invariant (the same token whether the body is a human or a robot).

That mix of simple-but-right choices is precisely why HumanEgo generalizes so widely. Empirically, adding ICTs to raw human RGB jumps Water Flowers success from 7.5% to 85% — a +77.5 pp gap that no amount of visual preprocessing can close. We believe ICT is a simple yet critical representation that should reach well beyond this paper — a general-purpose spatial encoding for any embodied agent that needs to reason about object interactions.

Why do dense auxiliary objectives work?

Think about what supervision the policy actually receives from a single demonstration: at each timestep, the model sees one image and one set of ICTs, and is asked to regress one action chunk. That's a single, narrow learning signal — action in, action out. With only minutes of data, that's just not a lot of bits flowing through the gradient.

But each demonstration secretly carries far more information than the action label alone. Where does the object end up? Where do the hand and object project on the image over time? How does the scene's internal state evolve? All of that signal is sitting for free inside every trajectory — we just aren't asking the model to predict it.

So we add three auxiliary objectives, all sharing the same encoder as the flow-matching head: Object Motion (forecast each manipulated object's future 6-DoF trajectory), 2D Trace (forecast the image-plane projection of hand and object — the path your eyes would follow watching the video), and Latent Consistency (forecast the encoder's own internal representation K steps ahead).

Zoom out, and all three are asking the model to do the same thing: forecast how the scene evolves over the next few steps, in three complementary spaces — 3D physical, 2D visual, and the encoder's own latent state. Together, this turns the encoder into a lightweight world model of hand–object interaction, sitting right inside the policy for free.

Because all three losses share the same encoder, the encoder is forced to learn the causal structure of manipulation — not just what action comes next, but why: how the scene will respond. And critically, none of these targets cost us anything to obtain — they are computed automatically from the same perception pipeline that gave us the ICTs. No extra annotation, no extra data — just more questions asked of the same demonstration.

Why is human data better and more efficient than robot data?

First, an important caveat: when we say human data, we mean carefully designed human-egocentric data — captured with intent, the manipulation task clearly in mind, the camera stable on the head, the hand pose tracked. Comparing random YouTube footage against carefully designed teleop data is apples-to-oranges. What we are comparing here is carefully designed human data against carefully designed teleop data.

Higher-quality data is intrinsically easier for humans to produce. A demonstrator naturally generates motion that is smooth, dexterous, fast, and physically plausible: they cover a much larger workspace than any single robot, switch grip strategies on the fly, and adapt in milliseconds — without any active control loop or training. Teleoperation, by contrast, is fundamentally a lossy remote-control problem: smoothness, speed, and dexterity all bottleneck on the operator's skill at piloting the robot. Top-tier teleop exists, but it is rare and expensive; everyday teleop produces choppy, slow, or unrepresentative trajectories that the policy then has to learn from.

Human data generalizes more naturally across embodiments and environments. Given the right processing pipeline (ours uses an entity-relative ICT representation), human data is inherently more transferable — it isn't bound to any particular robot's kinematic configuration, gripper, base height, or camera mount, so the same dataset can serve many target embodiments. Teleop data is the opposite: it is born inside one specific robot's body, so a new arm, a new gripper, or even a new camera mount typically means re-collecting the entire dataset.

Our Team

University of Maryland

Acknowledge

We thank Eadom Dessalene, Yoonkyo Jung, Zikui Cai, and other members of the PRG Lab and Furong's Lab for their helpful feedback and support throughout this project.

BibTeX

@misc{humanego2026,
  title         = {HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos},
  author        = {Wang, Zhi and He, Botao and Yu, Kelin and Lee, Seungjae and Gao, Ruohan and Huang, Furong and Aloimonos, Yiannis},
  year          = {2026},
  eprint        = {XXXX.XXXXX},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO}
}