Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models

1National University of Singapore, 2University of Toronto 3Peking University 4Sichuan University 5Zhejiang University
* denotes equal contribution

Abstract

Humans possess an extraordinary ability to understand and execute complex manipulation tasks by interpreting abstract instruction manuals. For robots, however, this capability remains a substantial challenge, as they lack the ability to interpret abstract instructions and translate them into executable actions. In this paper, we present Manual2Skill, a novel framework that enables robots to perform complex assembly tasks guided by high-level manual instructions. Our approach leverages Vision-Language Models (VLMs) to extract structured information from instructional images, which is then used to construct hierarchical assembly graphs. These graphs represent parts, subassemblies, and the relationships between them. To facilitate task execution, a pose estimation model predicts the relative 6D poses of components at each assembly step, while a motion planning module generates actionable sequences for real-world robotic implementation. We demonstrate the effectiveness of Manual2Skill by successfully assembling multiple real-world IKEA furniture items. This application highlights its ability to manage long-horizon manipulation tasks with both efficiency and precision, significantly enhancing the practicality of robot learning from instruction manuals. This work marks a step forward in advancing robotic systems capable of understanding and executing complex manipulation tasks in a manner akin to human capabilities.

Pipeline Overview

Pipeline Overview

Overview of Manual2Skill: (1) GPT-4o is queried with manual pages to generate a text-based assembly plan, and this plan can be represented as a hierarchical assembly graph. (2) The components’ point clouds, together with the corresponding manual image, are passed through a pose estimation module to predict the target pose for each component. (3) With poses estimated, the system plans the assembly action sequence, and the robot executes the assembly step.

Assembly Graph Generation Results

Selected Image

(Left) Input Manual & Scene, (right) Predicted Assembly Graph

Explore a real-time demo of VLM Assembly Graph Generation in our Google Colab Notebook with your OpenAI API Keys.

Real World Demos

Chair

Chair Assembly

Stool

Stool Assembly

Shelf

Shelf Assembly

Box

Box Assembly

Generalizing Beyond Furniture Assembly

Selected Image

(Left) Input Manual & Scene, (right) Predicted Assembly Graph

BibTeX

@article{tie2025manual,
  title     = {Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models},
  author    = {Tie, Chenrui and Sun, Shengxiang and Zhu, Jinxuan and Liu, Yiwei and Guo, Jingxiang and Hu, Yue and Chen, Haonan and Chen, Junting and Wu, Ruihai and Shao, Lin},
  journal   = {arXiv preprint arXiv:2502.10090},
  year      = {2025}
}