TY - GEN
T1 - Dynamic Cross-Feature Fusion for American Sign Language Translation
AU - Ananthanarayana, Tejaswini
AU - Kotecha, Nikunj
AU - Srivastava, Priyanshu
AU - Chaudhary, Lipisha
AU - Wilkins, Nicholas
AU - Nwogu, Ifeoma
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - While a significant amount of work has been done on the commonly used, tightly -constrained weather-based, German sign language (GSL) dataset, little has been done for continuous sign language translation (SLT) in more realistic settings, including American sign language (ASL) translation. Also, while CNN - based features have been consistently shown to work well on the GSL dataset, it is not clear whether such features will work as well in more realistic settings when there are more heterogeneous signers in non-uniform backgrounds. To this end, in this work, we introduce a new, realistic phrase-level ASL dataset (ASLing), and explore the role of different types of visual features (CNN embeddings, human body keypoints, and optical flow vectors) in translating it to spoken American English. We propose a novel Transformer-based, visual feature learning method for ASL translation. We demonstrate the explainability efficacy of our proposed learning methods by visualizing activation weights under various input conditions and discover that the body keypoints are consistently the most reliable set of input features. Using our model, we successfully transfer-learn from the larger GSL dataset to ASLing, resulting in significant BLEU score improvements. In summary, this work goes a long way in bringing together the AI resources required for automated ASL translation in unconstrained environments.
AB - While a significant amount of work has been done on the commonly used, tightly -constrained weather-based, German sign language (GSL) dataset, little has been done for continuous sign language translation (SLT) in more realistic settings, including American sign language (ASL) translation. Also, while CNN - based features have been consistently shown to work well on the GSL dataset, it is not clear whether such features will work as well in more realistic settings when there are more heterogeneous signers in non-uniform backgrounds. To this end, in this work, we introduce a new, realistic phrase-level ASL dataset (ASLing), and explore the role of different types of visual features (CNN embeddings, human body keypoints, and optical flow vectors) in translating it to spoken American English. We propose a novel Transformer-based, visual feature learning method for ASL translation. We demonstrate the explainability efficacy of our proposed learning methods by visualizing activation weights under various input conditions and discover that the body keypoints are consistently the most reliable set of input features. Using our model, we successfully transfer-learn from the larger GSL dataset to ASLing, resulting in significant BLEU score improvements. In summary, this work goes a long way in bringing together the AI resources required for automated ASL translation in unconstrained environments.
UR - https://www.scopus.com/pages/publications/85125105843
U2 - 10.1109/FG52635.2021.9667027
DO - 10.1109/FG52635.2021.9667027
M3 - Conference contribution
AN - SCOPUS:85125105843
T3 - Proceedings - 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2021
BT - Proceedings - 2021 16th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2021
A2 - Struc, Vitomir
A2 - Ivanovska, Marija
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 16th IEEE International Conference on Automatic Face and Gesture Recognition, FG 2021
Y2 - 15 December 2021 through 18 December 2021
ER -