In this study, we consider the problem of predicting task success for object manipulation by a robot, based on the instruction sentences and ego centric images before and after manipulation. Conventional approaches, including multimodal large language models, fail to appropriately understand detailed characteristics of objects and subtle changes in the position of objects.
We propose the Contrastive \(\boldsymbol \lambda\)-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into an multi-level aligned representation: features that preserve local image information; features structured through natural language; and features aligned with natural language. This allows the model to focus on the subtle changes by looking at the differences in the representation between two images.
To evaluate our approach, we built a novel dataset from real-world environments. The results show that our approach outperformed existing approaches including multimodal LLMs on the dataset. Our best model achieved an improvement of 8.66 points in accuracy compared to the representative multimodal LLM-based model.
We tackle the task of predicting whether a tabletop object manipulation task was performed successfully, given the instruction sentence and ego-centric images taken before and after the manipulation. We define this task as Success Prediction for Object Manipulation (SPOM).
Fig. 1: Typical samples of SPOM task. The each top sentence is the given instruction for each sample. The top and bottom images depict the scene before and after the manipulation, respectively.
To tackle the task, we propose the novel approach to create the multi-level aligned representations for images, and build the success prediction method based on them—Contrastive \(\boldsymbol \lambda\)-Repformer.
Fig. 2: Overview of Contrastive \(\lambda\)-Repformer. Given an instruction sentence and images before and after manipulation, our model outputs the predicted probability that the robot successfully performed the manipulation.
Fig. 3: \(\boldsymbol \lambda\)-Representation—the multi-level aligned visual representation composed of three types of latent representations: features that capture visual characteristics such as colors and shapes (Scene Representation), features aligned with natural language (Aligned Representation), and features structured through natural language (Narrative Representation).
The Scene Representation \(\boldsymbol{h}_s\) is obtained by concatenating the outputs of several unimodal image encoders (e.g.: ViT, Swin Transformer, DINOv2).
We get Aligned Representation \(\boldsymbol{h}_a\) using the Aligned Representation Module, which is composed of multimodal foundation models such as CLIP, SigLIP and BLIP.
Narrative Representation \(\boldsymbol{h}_n\) is obtained using the Narrative Representation Module, containing MLLMs (e.g.: InstructBLIP, Gemini, GPT-4) and text embedders (e.g.: BERT, text-embedding-ada-002). We designed the text prompt to focus on the colors, sizes and shapes of the target objects, how they are placed, their position within the image and relative position to other objects. From the output of MLLMs, we acquire its features using text embedders. Then, they are concatenated to get \(\boldsymbol{h}_n\).
Finally, we obtain the \(\boldsymbol \lambda\)-Representation by concatenating the three representations: \[\boldsymbol{h}_{\lambda}=\left[\boldsymbol{h}_s^\mathsf{T}, \boldsymbol{h}_a^\mathsf{T}, \boldsymbol{h}_n^\mathsf{T}\right]^\mathsf{T}.\]
The differences between the images do not by themselves necessarily indicate the success of the task specified by the instructions. To address this issue, we propose the Contrastive \(\lambda\)-Representation Decoder, which use cross attention based architecture to obtain the predicted probability \(P(\hat{y}=1)\), indicating the probability that the manipulator has successfully executed the task.
Fig. 4: Successful cases of the proposed method from SP-RT-1: The left and right images show the scene before and after
the manipulation, respectively.
Fig. 5: Qualitative results of the proposed method in zero-shot transfer experiments. The left image depicts
the scene before the manipulation, while the right image shows it afterward. Examples (i) and (ii) are true
positive and true negative cases, respectively.
Table 1: Quantitative results. The best results are marked in bold.
@inproceedings{
goko2024task,
title={{Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations}},
author={Miyu Goko and Motonari Kambara and Daichi Saito and Seitaro Otsuki and Komei Sugiura},
booktitle={8th Annual Conference on Robot Learning},
year={2024}
}