Instruction-ViT: Multi-modal prompts for instruction learning in vision transformer

Zhenxiang Xiao; Yuzhong Chen; Junjie Yao; Lu Zhang; Zhengliang Liu; Zihao Wu; Xiaowei Yu; Yi Pan; Lin Zhao; Chong Ma; Xinyu Liu; Wei Liu; Xiang Li; Yixuan Yuan; Dinggang Shen; Dajiang Zhu; Dezhong Yao; Tianming Liu; Xi Jiang

doi:10.1016/j.inffus.2023.102204

Instruction-ViT: Multi-modal prompts for instruction learning in vision transformer

Zhenxiang Xiao, Yuzhong Chen, Junjie Yao, Lu Zhang, Zhengliang Liu, Zihao Wu, Xiaowei Yu, Yi Pan, Lin Zhao, Chong Ma, Xinyu Liu, Wei Liu, Xiang Li, Yixuan Yuan, Dinggang Shen, Dajiang Zhu, Dezhong Yao, Tianming Liu, Xi Jiang

Radiation Oncology

Research output: Contribution to journal › Article › peer-review

Abstract

Prompts play a crucial role in enhancing the control, adaptability, and scalable application of large language models. In recent years, strategies involving prompts have also been applied to visual models. However, the extent to which the fusion of multi-modal prompts (e.g., text or image prompts) can improve downstream task performance in visual models has not been systematically investigated. To address this issue, this paper focuses on adapting the design of prompts based on instruction tuning in a vision transformer model for visual tasks, which we have named Instruction-ViT. The key idea involves implementing and fusing multi-modal prompts (either text or image prompts) related to category information, guiding the fine-tuning of the model. Based on the experiments conducted on several image understanding tasks, including classification, segmentation, image captioning, and object detection, we observe consistently improved performance and domain adaptability. Our work presents an innovative strategy for fusing multi-modal prompts, enhancing performance and adaptability in visual models.

Original language	English (US)
Article number	102204
Journal	Information Fusion
Volume	104
DOIs	https://doi.org/10.1016/j.inffus.2023.102204
State	Published - Apr 2024

Keywords

Instruction learning
Multi-modal information fusion
Multi-modal prompt
Vision transformer

ASJC Scopus subject areas

Software
Signal Processing
Information Systems
Hardware and Architecture

Access to Document

10.1016/j.inffus.2023.102204

Cite this

@article{e1eb4642d62b41eaa2560e6033cd734d,

title = "Instruction-ViT: Multi-modal prompts for instruction learning in vision transformer",

abstract = "Prompts play a crucial role in enhancing the control, adaptability, and scalable application of large language models. In recent years, strategies involving prompts have also been applied to visual models. However, the extent to which the fusion of multi-modal prompts (e.g., text or image prompts) can improve downstream task performance in visual models has not been systematically investigated. To address this issue, this paper focuses on adapting the design of prompts based on instruction tuning in a vision transformer model for visual tasks, which we have named Instruction-ViT. The key idea involves implementing and fusing multi-modal prompts (either text or image prompts) related to category information, guiding the fine-tuning of the model. Based on the experiments conducted on several image understanding tasks, including classification, segmentation, image captioning, and object detection, we observe consistently improved performance and domain adaptability. Our work presents an innovative strategy for fusing multi-modal prompts, enhancing performance and adaptability in visual models.",

keywords = "Instruction learning, Multi-modal information fusion, Multi-modal prompt, Vision transformer",

author = "Zhenxiang Xiao and Yuzhong Chen and Junjie Yao and Lu Zhang and Zhengliang Liu and Zihao Wu and Xiaowei Yu and Yi Pan and Lin Zhao and Chong Ma and Xinyu Liu and Wei Liu and Xiang Li and Yixuan Yuan and Dinggang Shen and Dajiang Zhu and Dezhong Yao and Tianming Liu and Xi Jiang",

note = "Publisher Copyright: {\textcopyright} 2023 Elsevier B.V.",

year = "2024",

month = apr,

doi = "10.1016/j.inffus.2023.102204",

language = "English (US)",

volume = "104",

journal = "Information Fusion",

issn = "1566-2535",

publisher = "Elsevier",

}

TY - JOUR

T1 - Instruction-ViT

T2 - Multi-modal prompts for instruction learning in vision transformer

AU - Xiao, Zhenxiang

AU - Chen, Yuzhong

AU - Yao, Junjie

AU - Zhang, Lu

AU - Liu, Zhengliang

AU - Wu, Zihao

AU - Yu, Xiaowei

AU - Pan, Yi

AU - Zhao, Lin

AU - Ma, Chong

AU - Liu, Xinyu

AU - Liu, Wei

AU - Li, Xiang

AU - Yuan, Yixuan

AU - Shen, Dinggang

AU - Zhu, Dajiang

AU - Yao, Dezhong

AU - Liu, Tianming

AU - Jiang, Xi

PY - 2024/4

Y1 - 2024/4

N2 - Prompts play a crucial role in enhancing the control, adaptability, and scalable application of large language models. In recent years, strategies involving prompts have also been applied to visual models. However, the extent to which the fusion of multi-modal prompts (e.g., text or image prompts) can improve downstream task performance in visual models has not been systematically investigated. To address this issue, this paper focuses on adapting the design of prompts based on instruction tuning in a vision transformer model for visual tasks, which we have named Instruction-ViT. The key idea involves implementing and fusing multi-modal prompts (either text or image prompts) related to category information, guiding the fine-tuning of the model. Based on the experiments conducted on several image understanding tasks, including classification, segmentation, image captioning, and object detection, we observe consistently improved performance and domain adaptability. Our work presents an innovative strategy for fusing multi-modal prompts, enhancing performance and adaptability in visual models.

AB - Prompts play a crucial role in enhancing the control, adaptability, and scalable application of large language models. In recent years, strategies involving prompts have also been applied to visual models. However, the extent to which the fusion of multi-modal prompts (e.g., text or image prompts) can improve downstream task performance in visual models has not been systematically investigated. To address this issue, this paper focuses on adapting the design of prompts based on instruction tuning in a vision transformer model for visual tasks, which we have named Instruction-ViT. The key idea involves implementing and fusing multi-modal prompts (either text or image prompts) related to category information, guiding the fine-tuning of the model. Based on the experiments conducted on several image understanding tasks, including classification, segmentation, image captioning, and object detection, we observe consistently improved performance and domain adaptability. Our work presents an innovative strategy for fusing multi-modal prompts, enhancing performance and adaptability in visual models.

KW - Instruction learning

KW - Multi-modal information fusion

KW - Multi-modal prompt

KW - Vision transformer

UR - http://www.scopus.com/inward/record.url?scp=85182430800&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85182430800&partnerID=8YFLogxK

U2 - 10.1016/j.inffus.2023.102204

DO - 10.1016/j.inffus.2023.102204

M3 - Article

AN - SCOPUS:85182430800

SN - 1566-2535

VL - 104

JO - Information Fusion

JF - Information Fusion

M1 - 102204

ER -

Instruction-ViT: Multi-modal prompts for instruction learning in vision transformer

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Cite this