D2TriPO-DETR: Dual-Decoder Triple-Parallel-Output Detection Transformer

Author(s): (place author names here)

Paper Code

Vision-based grasping, though widely employed for industrial and household applications, still struggles with object stacking scenarios. Current methods face three major challenges: 1) limited inter-object relationship understanding, 2) poor grasping adaptation across viewpoints, and 3) error propagation. We propose D2TriPO-DETR, a dual-decoder transformer producing three parallel outputs — object detection, manipulation relationship, and grasp detection — with modules for distributed attention perception and visual attention adaptation to address these issues. On the Visual Manipulation Relationship Dataset our method outperforms prior work across metrics and shows strong real-world performance.

Fig.1
Fig.7

Experiment setup

描述文字(在此处填写关于 Fig. 7 的说明)。

Fig.8

To evaluate applicability to real-world grasping, we test configurations with 2–5 objects. For each object count we run five trials in both cluttered and stacked settings, yielding 40 experiments in total.

Model results

Models Cluttered (%) Stacked (%)
Mutli-Task CNN [8] 90.60 65.65
SMTNet [30] 86.13 65.00
EGNet [5] 93.60 69.60
D2TriPO-DETR (Ours) 95.71 74.29

Per-object success rates

Objects 2 3 4 5
Cluttered
Scenes
100.00% 100.00% 95.00% 92.00%
Stacked
Scenes
90.00% 73.33% 75.00% 68.00%

Real-World Experiment

In the cluttered setting objects are placed separately to test detection among nearby items. The stacked setting introduces occlusions that require not only detection and grasp estimation but also reasoning about manipulation order to avoid failures. Our pipeline outputs detections, grasp candidates, and stacking relations in parallel; detections map to fixed queries to produce object indices and update an adjacency matrix P. If P is all-zero all objects are selectable; otherwise successive powers of P identify top-layer candidates. We keep grasp candidates with IoU > 0.5, rank them by confidence, and execute high-confidence candidates via inverse kinematics with closed-loop verification, updating P after each grasp–place cycle. D2TriPO-DETR shows superior relation recognition and sequence-planning, enabling robust ordered grasping in stacked scenes.

Fig.9

Stability evaluation

We selected representative scenarios with 2–5 objects and repeated each 30 times. The system shows only a modest drop in success rate as object count increases, maintaining high and stable performance across scales.

Fig.10

Robustness under boundary ambiguity

Under severe occlusion, similar colors, and insertion cases we observe grasp success rates of roughly 65%, 70%, and 60% respectively. Although boundary ambiguity reduces performance, the model still detects objects and infers stacking relations for most scenes, enabling successful grasps.

Demonstration video — real robot experiments