Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

CoRL 2024

Miyu Goko^*, Motonari Kambara^*, Daichi Saito, Seitaro Otsuki, Komei Sugiura

Keio University
* denotes equal contribution

We will set the links as soon as possible.
At this moment, our code and additional report are provided as supplementary materials.

Contrastive \(\boldsymbol \lambda\)-Repformer performs task success prediction based on
the multi-level aligned representations—\(\boldsymbol \lambda\)-Representation.

Abstract

In this study, we consider the problem of predicting task success for object manipulation by a robot, based on the instruction sentences and ego centric images before and after manipulation. Conventional approaches, including multimodal large language models, fail to appropriately understand detailed characteristics of objects and subtle changes in the position of objects.

We propose the Contrastive \(\boldsymbol \lambda\)-Repformer, which predicts task success for table-top manipulation tasks by aligning images with instruction sentences. Our method integrates the following three key types of features into an multi-level aligned representation: features that preserve local image information; features structured through natural language; and features aligned with natural language. This allows the model to focus on the subtle changes by looking at the differences in the representation between two images.

To evaluate our approach, we built a novel dataset from real-world environments. The results show that our approach outperformed existing approaches including multimodal LLMs on the dataset. Our best model achieved an improvement of 8.66 points in accuracy compared to the representative multimodal LLM-based model.

Zero-Shot Transfer (playback in x8)

PREDICT "SUCCESS"

Instruction: "place a mug in front of the banana"

PREDICT "FAILURE"

Instruction: "place a chips can in front of the red round can"

PREDICT "SUCCESS"

Instruction: "pick the yellow bottle at the back right"

PREDICT "FAILURE"

Instruction: "pick the blue spam can"

PREDICT "SUCCESS"

Instruction: "move the purple small cup close to the light blue cup"

PREDICT "FAILURE"

Instruction: "move the apple near by the banana"

PREDICT "SUCCESS"

Instruction: "move the blue can close to the soup can"

PREDICT "FAILURE"

Instruction: "move the rubik's cube near by the purple cup"

Overview

We tackle the task of predicting whether a tabletop object manipulation task was performed successfully, given the instruction sentence and ego-centric images taken before and after the manipulation. We define this task as Success Prediction for Object Manipulation (SPOM).

Fig. 1: Typical samples of SPOM task. The each top sentence is the given instruction for each sample. The top and bottom images depict the scene before and after the manipulation, respectively.

To tackle the task, we propose the novel approach to create the multi-level aligned representations for images, and build the success prediction method based on them—Contrastive \(\boldsymbol \lambda\)-Repformer.

CORE NOVELTIES:
1. \(\boldsymbol \lambda\)-Representation Encoder computes three types of latent representations of an image and integrates them into \(\boldsymbol \lambda\)-Representation—the multi-level aligned visual representation composed of three types of latent representations: features that capture visual characteristics such as colors and shapes (Scene Representation), features aligned with natural language (Aligned Representation), and features structured through natural language (Narrative Representation).
2. Contrastive \(\boldsymbol \lambda\)-Representation Decoder identifies the difference between the \(\lambda\)-Representation of two images. This allows the model to take into consideration the alignment between the differences in the images and the instruction sentence when performing task success prediction.

Fig. 2: Overview of Contrastive \(\lambda\)-Repformer. Given an instruction sentence and images before and after manipulation, our model outputs the predicted probability that the robot successfully performed the manipulation.

\(\boldsymbol \lambda\)-Representation

Fig. 3: \(\boldsymbol \lambda\)-Representation—the multi-level aligned visual representation composed of three types of latent representations: features that capture visual characteristics such as colors and shapes (Scene Representation), features aligned with natural language (Aligned Representation), and features structured through natural language (Narrative Representation).

Scene Representation

The Scene Representation \(\boldsymbol{h}_s\) is obtained by concatenating the outputs of several unimodal image encoders (e.g.: ViT, Swin Transformer, DINOv2).

Aligned Representation

We get Aligned Representation \(\boldsymbol{h}_a\) using the Aligned Representation Module, which is composed of multimodal foundation models such as CLIP, SigLIP and BLIP.

Narrative Representation

Narrative Representation \(\boldsymbol{h}_n\) is obtained using the Narrative Representation Module, containing MLLMs (e.g.: InstructBLIP, Gemini, GPT-4) and text embedders (e.g.: BERT, text-embedding-ada-002). We designed the text prompt to focus on the colors, sizes and shapes of the target objects, how they are placed, their position within the image and relative position to other objects. From the output of MLLMs, we acquire its features using text embedders. Then, they are concatenated to get \(\boldsymbol{h}_n\).

Finally, we obtain the \(\boldsymbol \lambda\)-Representation by concatenating the three representations: \[\boldsymbol{h}_{\lambda}=\left[\boldsymbol{h}_s^\mathsf{T}, \boldsymbol{h}_a^\mathsf{T}, \boldsymbol{h}_n^\mathsf{T}\right]^\mathsf{T}.\]

Contrastive \(\boldsymbol \lambda\)-Representation Decoder

The differences between the images do not by themselves necessarily indicate the success of the task specified by the instructions. To address this issue, we propose the Contrastive \(\lambda\)-Representation Decoder, which use cross attention based architecture to obtain the predicted probability \(P(\hat{y}=1)\), indicating the probability that the manipulator has successfully executed the task.

Results

Qualitative Results

Fig. 4: Successful cases of the proposed method from SP-RT-1: The left and right images show the scene before and after the manipulation, respectively.

Successful cases of the proposed method from zero-shot transfer experiments

Fig. 5: Qualitative results of the proposed method in zero-shot transfer experiments. The left image depicts the scene before the manipulation, while the right image shows it afterward. Examples (i) and (ii) are true positive and true negative cases, respectively.

Quantitative Results

Table 1: Quantitative results. The best results are marked in bold.

Demonstration in the Real-World environment

BibTeX


      @inproceedings{
        goko2024task,
        title={{Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations}},
        author={Miyu Goko and Motonari Kambara and Daichi Saito and Seitaro Otsuki and Komei Sugiura},
        booktitle={8th Annual Conference on Robot Learning},
        year={2024}
      }

Task Success Prediction for Open-Vocabulary Manipulation Based on Multi-Level Aligned Representations

Contrastive \(\boldsymbol \lambda\)-Repformer performs task success prediction based on the multi-level aligned representations—\(\boldsymbol \lambda\)-Representation.

Abstract

Zero-Shot Transfer (playback in x8)

Overview

\(\boldsymbol \lambda\)-Representation

Scene Representation

Aligned Representation

Narrative Representation

Contrastive \(\boldsymbol \lambda\)-Representation Decoder

Results

Qualitative Results

Quantitative Results

Demonstration in the Real-World environment

BibTeX

Contrastive \(\boldsymbol \lambda\)-Repformer performs task success prediction based on
the multi-level aligned representations—\(\boldsymbol \lambda\)-Representation.