TY - GEN
T1 - ReAugKD
T2 - 61st Annual Meeting of the Association for Computational Linguistics, ACL 2023
AU - Zhang, Jianyi
AU - Muhamed, Aashiq
AU - Anantharaman, Aditya
AU - Wang, Guoyin
AU - Chen, Changyou
AU - Zhong, Kai
AU - Cui, Qingjun
AU - Xu, Yi
AU - Zeng, Belinda
AU - Chilimbi, Trishul
AU - Chen, Yiran
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Knowledge Distillation (KD) (Hinton et al., 2015) is one of the most effective approaches for deploying large-scale pre-trained language models in low-latency environments by transferring the knowledge contained in the large-scale models to smaller student models. Previous KD approaches use the soft labels and intermediate activations generated by the teacher to transfer knowledge to the student model parameters alone. In this paper, we show that having access to non-parametric memory in the form of a knowledge base with the teacher’s soft labels and predictions can further enhance student capacity and improve generalization. To enable the student to retrieve from the knowledge base effectively, we propose a new Retrieval-augmented KD framework with a loss function that aligns the relational knowledge in teacher and student embedding spaces. We show through extensive experiments that our retrieval mechanism can achieve state-of-the-art performance for task-specific knowledge distillation on the GLUE benchmark (Wang et al., 2018a).
AB - Knowledge Distillation (KD) (Hinton et al., 2015) is one of the most effective approaches for deploying large-scale pre-trained language models in low-latency environments by transferring the knowledge contained in the large-scale models to smaller student models. Previous KD approaches use the soft labels and intermediate activations generated by the teacher to transfer knowledge to the student model parameters alone. In this paper, we show that having access to non-parametric memory in the form of a knowledge base with the teacher’s soft labels and predictions can further enhance student capacity and improve generalization. To enable the student to retrieve from the knowledge base effectively, we propose a new Retrieval-augmented KD framework with a loss function that aligns the relational knowledge in teacher and student embedding spaces. We show through extensive experiments that our retrieval mechanism can achieve state-of-the-art performance for task-specific knowledge distillation on the GLUE benchmark (Wang et al., 2018a).
UR - https://www.scopus.com/pages/publications/85172268482
U2 - 10.18653/v1/2023.acl-short.97
DO - 10.18653/v1/2023.acl-short.97
M3 - Conference contribution
AN - SCOPUS:85172268482
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 1128
EP - 1136
BT - Short Papers
PB - Association for Computational Linguistics (ACL)
Y2 - 9 July 2023 through 14 July 2023
ER -