How can robots learn reward functions that capture true human preferences when
demonstrations and instructions are ambiguous?
Abstract
Robots can adapt to user preferences by learning reward functions from demonstrations, but with limited data, reward models often overfit to spurious correlations and fail to generalize.
This happens because demonstrations show robots how to do a task but not what matters for that task, causing the model to focus on irrelevant state details.
Natural language can more directly specify what the robot should focus on, and, in principle, disambiguate between many reward functions consistent with the demonstrations. However, existing language-conditioned reward learning methods typically treat instructions as simple conditioning signals, without fully exploiting their potential to resolve ambiguity. Moreover, real instructions are often ambiguous themselves, so naive conditioning is unreliable. Our key insight is that these two input types carry complementary information: demonstrations show how to act, while language specifies what is important. We propose Masked Inverse Reinforcement Learning (Masked IRL), a framework that uses large language models (LLMs) to combine the strengths of both input types. Masked IRL infers state-relevance masks from language instructions and enforces invariance to irrelevant state components. When instructions are ambiguous, it uses LLM reasoning to clarify them in the context of the demonstrations. In simulation and on a real robot, Masked IRL outperforms prior language-conditioned IRL methods by up to 15% while using up to 4.7 times less data, demonstrating improved sample-efficiency, generalization, and robustness to ambiguous language.
Method: Masked Inverse Reinforcement Learning
Masked IRL is a language-conditioned reward learning framework that reasons jointly
over language and demonstrations. Given an ambiguous instruction and a user
demonstration, an LLM first disambiguates the language in the context of a reference
(shortest-path) trajectory. The clarified instruction is then passed to a second LLM
that predicts which state dimensions are relevant for that preference, producing a
binary state relevance mask.
Instead of explicitly zeroing out masked dimensions, Masked IRL applies an
implicit masking loss: it perturbs irrelevant state dimensions with random noise
and penalizes changes in the predicted reward. This drives the reward model to become
invariant to irrelevant features while remaining sensitive to the parts of the state that
language indicates matter for the task. The overall objective combines a standard
Maximum Entropy IRL loss with this masking loss.
At test time, the reward model takes a new instruction and state as input and implicitly
infers which state components are important through its language-conditioned architecture,
enabling trajectory optimization for novel language-specified preferences.
System overview of Masked IRL: LLM-based language disambiguation, state mask prediction,
masking loss, and trajectory optimization.
Experiments
RQ1: Efficiency of the Masking Loss (Simulation)
We first evaluate Masked IRL in a PyBullet simulation of an object handover task with a Franka
arm. Ground-truth rewards are linear combinations of five semantic features (distances to the
table, human, laptop, human face, and mug orientation), and each preference is paired with a
language instruction that refers to a subset of these features. We compare Masked IRL to a
language-conditioned IRL baseline (LC-RL) and an explicit masking baseline that zeros out
state dimensions indicated as irrelevant by the mask.
Average win rate when comparing trajectories across sparse, medium, and dense rewards.
Masked IRL achieves higher win rates than LC-RL and remains robust under noisy LLM-generated
masks, whereas explicit masking degrades sharply when the masks are imperfect. The masking
loss improves sample efficiency: Masked IRL can match or exceed LC-RL performance with up to
4.7× fewer demonstrations.
RQ2: Robustness to Ambiguous Language
To study robustness to underspecified instructions, we generate ambiguous commands (such as
referent-omitted “Stay away” or expression-omitted “Table”) for sparse preferences that focus
on a single feature. Masked IRL uses the LLM disambiguation step to propose clarified
commands in context (e.g., “Stay away from the table” or “Stay close to the human”), and then
derives state masks from these clarified instructions.
Disambiguated instructions (DI) improve state-mask quality and downstream reward prediction
compared to using ambiguous commands directly (AI).
RQ3: Real-World Evaluation on a Franka Arm
Finally, we evaluate zero-shot transfer to a real Franka Emika Panda arm performing an object
handover task with a human. We collect kinesthetic demonstrations for 50 language-labeled
preferences and fit reward models using the same training procedure as in simulation. At test
time, we optimize trajectories over a set of candidate motions and execute the trajectory that
maximizes the learned reward.
Trajectory optimization selects the motion that maximizes the learned reward.
Zero-shot performance on novel real-robot preferences: win rate, reward variance under
irrelevant perturbations, and regret of optimized trajectories.
Masked IRL obtains higher win rates, lower reward variance when irrelevant state dimensions
are perturbed, and substantially lower regret of optimized trajectories than LC-RL and explicit
masking baselines, indicating better alignment with the underlying human preferences.
Example Real-Robot Executions per Method
Below are video comparisons of our method (Masked IRL) versus baselines across four different scenarios:
Scenario 1: Stay away from human
The robot is tasked with moving a sealed snack bag from the left side of a dining table to a marked spot on the right,
while a person stands close by, casually leaning against the table and occasionally shifting their weight or moving their arms
while having breakfast. The snack bag itself is lightweight and harmless, so the human's concern is not about spilling or
damaging the object, but about feeling comfortable and safe sharing the workspace.
BaselineLC-RL
Baseline method execution.
OursMasked IRL
Masked IRL maintains safe distance from the human.
Scenario 2: Stay close to table surface
The robot is instructed to wipe a section of the table using a towel, with the goal of moving the towel from one point
on the table to the other to remove moisture and dust. The towel is flexible, so its orientation and grasp direction do not matter,
but effective cleaning clearly depends on maintaining continuous contact with the table surface.
BaselineLC-RL
Baseline method execution.
OursMasked IRL
Masked IRL maintains continuous contact with the table surface.
Scenario 3: Stay away from the table surface. Stay away from the human.
The robot is asked to transfer a soft snack bag from a pickup location to a person's hand,
while the table surface itself is visibly cluttered with small objects with different shapes and heights.
The object remains intact in all cases, but the preferred behavior is implied by how the robot chooses a path
that keeps clear of both the human and the potentially contaminated table surface.
BaselineLC-RL
Baseline method execution.
OursMasked IRL
Masked IRL avoids both the contaminated table surface and the human.
Scenario 4: Stay away from the human. Stay away from the laptop.
The robot is moving a coffee mug filled with liquid from one end of a desk to the other,
in a setting where a person is seated directly in front of an open laptop,
typing the keyboard. The scene immediately suggests high stakes: a spill could damage the laptop or splash the person,
even though the robot's task is nominally simple.
The main difference between trajectories is about how carefully the robot avoids regions associated with risk and discomfort.
BaselineLC-RL
Baseline method execution.
OursMasked IRL
Masked IRL carefully avoids both the human and the laptop to prevent spills.
BibTeX
@article{hwang2026maskedirl,
title = {Masked IRL: LLM-Guided Reward Disambiguation from Demonstrations and Language},
author = {Hwang, Minyoung and Forsey-Smerek, Alexandra and Dennler, Nathaniel and Bobu, Andreea},
journal = {{IEEE} International Conference on Robotics and Automation, {ICRA}},
year = {2026},
}