TY - JOUR
T1 - Instruction-ViT
T2 - Multi-modal prompts for instruction learning in vision transformer
AU - Xiao, Zhenxiang
AU - Chen, Yuzhong
AU - Yao, Junjie
AU - Zhang, Lu
AU - Liu, Zhengliang
AU - Wu, Zihao
AU - Yu, Xiaowei
AU - Pan, Yi
AU - Zhao, Lin
AU - Ma, Chong
AU - Liu, Xinyu
AU - Liu, Wei
AU - Li, Xiang
AU - Yuan, Yixuan
AU - Shen, Dinggang
AU - Zhu, Dajiang
AU - Yao, Dezhong
AU - Liu, Tianming
AU - Jiang, Xi
N1 - Publisher Copyright:
© 2023 Elsevier B.V.
PY - 2024/4
Y1 - 2024/4
N2 - Prompts play a crucial role in enhancing the control, adaptability, and scalable application of large language models. In recent years, strategies involving prompts have also been applied to visual models. However, the extent to which the fusion of multi-modal prompts (e.g., text or image prompts) can improve downstream task performance in visual models has not been systematically investigated. To address this issue, this paper focuses on adapting the design of prompts based on instruction tuning in a vision transformer model for visual tasks, which we have named Instruction-ViT. The key idea involves implementing and fusing multi-modal prompts (either text or image prompts) related to category information, guiding the fine-tuning of the model. Based on the experiments conducted on several image understanding tasks, including classification, segmentation, image captioning, and object detection, we observe consistently improved performance and domain adaptability. Our work presents an innovative strategy for fusing multi-modal prompts, enhancing performance and adaptability in visual models.
AB - Prompts play a crucial role in enhancing the control, adaptability, and scalable application of large language models. In recent years, strategies involving prompts have also been applied to visual models. However, the extent to which the fusion of multi-modal prompts (e.g., text or image prompts) can improve downstream task performance in visual models has not been systematically investigated. To address this issue, this paper focuses on adapting the design of prompts based on instruction tuning in a vision transformer model for visual tasks, which we have named Instruction-ViT. The key idea involves implementing and fusing multi-modal prompts (either text or image prompts) related to category information, guiding the fine-tuning of the model. Based on the experiments conducted on several image understanding tasks, including classification, segmentation, image captioning, and object detection, we observe consistently improved performance and domain adaptability. Our work presents an innovative strategy for fusing multi-modal prompts, enhancing performance and adaptability in visual models.
KW - Instruction learning
KW - Multi-modal information fusion
KW - Multi-modal prompt
KW - Vision transformer
UR - http://www.scopus.com/inward/record.url?scp=85182430800&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85182430800&partnerID=8YFLogxK
U2 - 10.1016/j.inffus.2023.102204
DO - 10.1016/j.inffus.2023.102204
M3 - Article
AN - SCOPUS:85182430800
SN - 1566-2535
VL - 104
JO - Information Fusion
JF - Information Fusion
M1 - 102204
ER -