GIFT: Generalizing Intent for Flexible Test-Time Rewards

1NC State University, 2MIT CSAIL

How can a robot generalize a learned reward by reasoning about what the human intended, not what looks or sounds similar?

Abstract

Robots learn reward functions from user demonstrations, but these rewards often fail to generalize to new environments. This failure occurs because learned rewards latch onto spurious correlations in training data rather than the underlying human intent that demonstrations represent. Existing methods leverage visual or semantic similarity to improve robustness, yet these surface-level cues often diverge from what humans actually care about. We present Generalizing Intent for Flexible Test-Time rewards (GIFT), a framework that grounds reward generalization in human intent rather than surface cues. GIFT leverages language models to infer high-level intent from user demonstrations by contrasting preferred with non-preferred behaviors. At deployment, GIFT maps novel test states to behaviorally equivalent training states via intent-conditioned similarity, enabling learned rewards to generalize across distribution shifts without retraining. We evaluate GIFT on tabletop manipulation tasks with new objects and layouts. Across four simulated tasks with over 50 unseen objects, GIFT consistently outperforms visual and semantic similarity baselines in test-time pairwise win rate and state-alignment F1 score.

GIFT pipeline teaser

Top. During training, the robot receives task demonstrations (loading a paintbrush) and uses them to infer the user's high-level intent (“load art supplies”). Bottom. At test time, the robot encounters three unseen objects. GIFT uses the inferred intent to correctly identify that the molding clay is relevant. In contrast, visual-similarity baselines incorrectly prefer the dish scrubber due to its appearance, and language-similarity baselines make an analogous mistake (“toothbrush” and “paintbrush”).

How GIFT Works

GIFT pipeline

Generalizing Intent for Flexible Test-Time rewards. Left. GIFT infers the human's intent given pairs of human-preferred demonstrations and reference trajectories. Right. During inference, GIFT deduces which objects in the unseen states should behave as objects in the training states. Afterwards, the unseen state components are aligned to training states so that the reward function learned before deployment can be used for planning.

GIFT achieves test-time reward generalization by defining a state similarity function conditioned on the human’s high-level intent. This allows novel test states to be mapped into the training domain so the learned reward can be reused without retraining.

Inferring high-level intent. GIFT first infers the user’s underlying intent by contrasting human-preferred demonstrations with reference trajectories from the same scenes. The LM outputs a natural-language summary of the high-level intent that explains why the preferred behavior is preferred.

Intent-conditioned alignment. Given the inferred intent, GIFT treats states as similar based on that intent rather than low-level visual or language features. At test time, GIFT aligns each novel state to its nearest intent-equivalent training state, so unseen objects can be interpreted according to the same task-relevant roles seen during training.

Test-time reward reuse. After alignment, GIFT evaluates the aligned trajectory with the fixed reward learned on the training data. In this way, the alignment operator carries the burden of generalization, while the reward itself remains unchanged at deployment.

Experiments

We evaluate GIFT on four tabletop manipulation tasks under test-time distribution shift, including unseen objects, new layouts, and real-world deployment. We compare against visual similarity, language similarity, and an LM-based mapping baseline without inferred intent.

The tasks span distinct forms of high-level intent:

  • Place Mug: avoid carrying liquids near water-sensitive objects.
  • Sweep Spill: move paper-based items away from a spill.
  • Pack Backpack: pack art supplies.
  • Store Into Drawer: store valuables.

Before evaluating reward generalization, we first verify that the LM infers reasonable high-level intents from demonstrations. We then study three questions: whether intent-conditioned similarity improves test-time reward performance, whether it improves state alignment under confounds, and whether it transfers to the physical world.

Sanity Check: Does the LM infer reasonable intents?

We compare LM-inferred intent against the ground-truth task intent in both simulation and real-world demonstrations. This checks whether the inferred intent is a usable conditioning signal for alignment.

Similarity between LM-inferred intent and ground-truth intent

Similarity between LM-inferred intent and ground-truth intent for simulated Jaco and real-world Franka demonstrations. We gave the LM three demonstration pairs from a virtual Jaco robot and a real-world Franka robot, and tasked it with deducing the human's intent. We found that LMs produced an acceptable conditioning variable for alignment.

RQ1: Does intent-conditioned similarity improve test-time reward performance?

We evaluate pairwise win rate: given two candidate trajectories, can the method correctly predict which behavior the human would prefer at test time?

Test-time pairwise win rate

Test-time pairwise win rate across the four tasks. GIFT achieves the highest win rate across environments and unseen objects.

RQ2: When do low-level visual and language features fail at test time?

We next study where low-level similarity breaks down. Each method maps a test-time object either to a training-time role or to a distractor, and we evaluate the resulting alignment using F1, false positives, and false negatives. For the more complex tasks, we also isolate confounding cases involving similar names, similar appearances, and objects whose grouping depends on the intended task.

Test-time state-alignment F1 score

Test-time state-alignment F1 across tasks. GIFT achieves the strongest overall alignment by conditioning on intent rather than surface similarity.

False positive and false negative rates for Store Into Drawer confounding states
False positive and false negative rates for Pack Backpack confounding states

FP/FN (%) on the Confounding States, Sconf. GIFT remains low across categories by using intent-relevance. On the other hand, the oracle baselines merely retune thresholds and trade errors across confounds. Thresholding cannot correct a misaligned similarity signal, and therefore they show high errors. GIFT's ablation, LMNo Intent, is less performant due to not recognizing which test states are intent relevant, leading to a high FN rate.

To understand these failures qualitatively, we also inspect how low-level language and vision similarities organize unseen objects. In the Pack Backpack task, distractors such as toothbrush and broomstick are pulled toward paintbrush, while relevant objects such as molding clay and tablet with stylus are pushed away. In the Store Valuables task, low-level similarity makes paper ring appear too close to diamond ring, while other valuable items can appear too dissimilar.

Language-vision similarity plot for pack backpack with art supplies

Language-vision similarity plot for Pack Backpack with Art Supplies.

Language-vision similarity plot for store valuables

Language-vision similarity plot for Store Valuables Into Drawer.

RQ3: Does GIFT transfer to robots in the physical world?

We recreate two tasks on a 7-DoF Franka Panda robot using held-out physical objects. For each method, we repeatedly sample small sets of executable candidate trajectories, score them with the induced reward, and convert those scores into a Boltzmann distribution over the candidates. This tests whether intents inferred from real-world demonstrations can guide reward reuse at deployment.

Real-world behavior on Franka

Real-world behavior on Franka. GIFT places more probability mass on held-out objects that match the inferred intent, such as packing molding clay and storing valuable items, while treating confounds as distractors.

Pack Backpack with Art Supplies

GIFT: Molding Clay

Vision Similarity: Dish Scrubber

Language Similarity: Toothbrush

LM No Intent: Salt Shaker

Store Valuables Into Drawer

GIFT: iPhone

Vision Similarity: Paper Ring

Language Similarity: Digital Clock

LM No Intent: Hand Sanitizer

RQ4: Is GIFT flexible to various intents?

We also test whether GIFT can handle the same environments under different intents. In these examples, the task setup remains similar, but the relevant object category changes with the user’s intended goal.

Pack backpack with reading materials

Store electronics into the drawer

Left. GIFT correctly ignores the art supply and packs the notebook under the new intent. Right. GIFT treats the valuable bracelet as a distractor and stores the electronic headset instead.

BibTeX

@misc{amin2026giftgeneralizingintentflexible,
  title         = {GIFT: Generalizing Intent for Flexible Test-Time Rewards},
  author        = {Fin Amin and Nathaniel Dennler and Andreea Bobu},
  year          = {2026},
  eprint        = {2603.22574},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO},
  url           = {https://arxiv.org/abs/2603.22574}
}