Masked IRL: LLM-Guided Reward Disambiguation
from Demonstrations and Language

1MIT CSAIL

ArXiv TBD Code


How can robots learn reward functions that capture true human preferences when demonstrations and instructions are ambiguous?

Abstract


Robots can adapt to user preferences by learning reward functions from demonstrations, but with limited data, reward models often overfit to spurious correlations and fail to generalize. This happens because demonstrations show robots how to do a task but not what matters for that task, causing the model to focus on irrelevant state details. Natural language can more directly specify what the robot should focus on, and, in principle, disambiguate between many reward functions consistent with the demonstrations. However, existing language-conditioned reward learning methods typically treat instructions as simple conditioning signals, without fully exploiting their potential to resolve ambiguity. Moreover, real instructions are often ambiguous themselves, so naive conditioning is unreliable. Our key insight is that these two input types carry complementary information: demonstrations show how to act, while language specifies what is important. We propose Masked Inverse Reinforcement Learning (Masked IRL), a framework that uses large language models (LLMs) to combine the strengths of both input types. Masked IRL infers state-relevance masks from language instructions and enforces invariance to irrelevant state components. When instructions are ambiguous, it uses LLM reasoning to clarify them in the context of the demonstrations. In simulation and on a real robot, Masked IRL outperforms prior language-conditioned IRL methods by up to 15% while using up to 4.7 times less data, demonstrating improved sample-efficiency, generalization, and robustness to ambiguous language.


Method: Masked Inverse Reinforcement Learning

Masked IRL is a language-conditioned reward learning framework that reasons jointly over language and demonstrations. Given an ambiguous instruction and a user demonstration, an LLM first disambiguates the language in the context of a reference (shortest-path) trajectory. The clarified instruction is then passed to a second LLM that predicts which state dimensions are relevant for that preference, producing a binary state relevance mask.

Instead of explicitly zeroing out masked dimensions, Masked IRL applies an implicit masking loss: it perturbs irrelevant state dimensions with random noise and penalizes changes in the predicted reward. This drives the reward model to become invariant to irrelevant features while remaining sensitive to the parts of the state that language indicates matter for the task. The overall objective combines a standard Maximum Entropy IRL loss with this masking loss.

At test time, the reward model takes a new instruction and state as input and implicitly infers which state components are important through its language-conditioned architecture, enabling trajectory optimization for novel language-specified preferences.

Overview of the Masked IRL pipeline.
System overview of Masked IRL: LLM-based language disambiguation, state mask prediction, masking loss, and trajectory optimization.

Experiments

RQ1: Efficiency of the Masking Loss (Simulation)

We first evaluate Masked IRL in a PyBullet simulation of an object handover task with a Franka arm. Ground-truth rewards are linear combinations of five semantic features (distances to the table, human, laptop, human face, and mug orientation), and each preference is paired with a language instruction that refers to a subset of these features. We compare Masked IRL to a language-conditioned IRL baseline (LC-RL) and an explicit masking baseline that zeros out state dimensions indicated as irrelevant by the mask.

Average win rate vs number of demonstrations for each method.
Average win rate when comparing trajectories across sparse, medium, and dense rewards.

Masked IRL achieves higher win rates than LC-RL and remains robust under noisy LLM-generated masks, whereas explicit masking degrades sharply when the masks are imperfect. The masking loss improves sample efficiency: Masked IRL can match or exceed LC-RL performance with up to 4.7× fewer demonstrations.


Example Simulated Trajectories per Method

Below are example optimized trajectories in simulation.

Zero-shot performance and regret on real robot.
Zero-shot performance and regret on real robot.

Example Visualization of Learned Rewards

Below are the visualization of learned rewards on trajectory sets. Trajectories with higher reward are drawn in darker blue color.

Zero-shot performance and regret on real robot.
Zero-shot performance and regret on real robot.

RQ2: Robustness to Ambiguous Language

To study robustness to underspecified instructions, we generate ambiguous commands (such as referent-omitted “Stay away” or expression-omitted “Table”) for sparse preferences that focus on a single feature. Masked IRL uses the LLM disambiguation step to propose clarified commands in context (e.g., “Stay away from the table” or “Stay close to the human”), and then derives state masks from these clarified instructions.

Performance with ambiguous vs disambiguated instructions.
Disambiguated instructions (DI) improve state-mask quality and downstream reward prediction compared to using ambiguous commands directly (AI).

RQ3: Real-World Evaluation on a Franka Arm

Finally, we evaluate zero-shot transfer to a real Franka Emika Panda arm performing an object handover task with a human. We collect kinesthetic demonstrations for 50 language-labeled preferences and fit reward models using the same training procedure as in simulation. At test time, we optimize trajectories over a set of candidate motions and execute the trajectory that maximizes the learned reward.

Trajectory optimization diagram.
Trajectory optimization selects the motion that maximizes the learned reward.
Zero-shot performance and regret on real robot.
Zero-shot performance on novel real-robot preferences: win rate, reward variance under irrelevant perturbations, and regret of optimized trajectories.

Masked IRL obtains higher win rates, lower reward variance when irrelevant state dimensions are perturbed, and substantially lower regret of optimized trajectories than LC-RL and explicit masking baselines, indicating better alignment with the underlying human preferences.


Example Real-Robot Executions per Method

For the test instruction “Stay close to the table surface and away from the human’s face.”:

Baseline LC-RL
LC-RL often fails to simultaneously satisfy both distance and safety constraints.
Baseline Explicit Mask (LLM Mask)
Explicit masking improves behavior but remains sensitive to mask errors.
Ours Masked IRL (LLM Mask)
Masked IRL maintains a safe distance from the human’s face while staying close to the table and keeping the cup upright.

BibTeX

@article{hwang2025maskedirl,
  title   = {Masked IRL: LLM-Guided Reward Disambiguation from Demonstrations and Language},
  author  = {Hwang, Minyoung and Forsey-Smerek, Alexandra and Dennler, Nathaniel and Bobu, Andreea},
  journal = {arXiv preprint},
  year    = {2025},
}