TY - GEN
T1 - AirObject
T2 - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
AU - Keetha, Nikhil Varma
AU - Wang, Chen
AU - Qiu, Yuheng
AU - Xu, Kuan
AU - Scherer, Sebastian
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Object encoding and identification are vital for robotic tasks such as autonomous exploration, semantic scene understanding, and relocalization. Previous approaches have attempted to either track objects or generate descriptors for object identification. However, such systems are limited to a 'fixed' partial object representation from a single viewpoint. In a robot exploration setup, there is a requirement for a temporally 'evolving' global object representation built as the robot observes the object from multiple viewpoints. Furthermore, given the vast distribution of unknown novel objects in the real world, the object identification process must be class-agnostic. In this context, we propose a novel temporal 3D object encoding approach, dubbed AirObject, to obtain global keypoint graph-based embeddings of objects. Specifically, the global 3D object embeddings are generated using a temporal convolutional network across structural information of multiple frames obtained from a graph attention-based encoding method. We demonstrate that AirObject achieves the state-of-the-art performance for video object identification and is robust to severe occlusion, perceptual aliasing, viewpoint shift, deformation, and scale transform, outperforming the state-of-the-art single-frame and sequential descriptors. To the best of our knowledge, AirObject is one of the first temporal object encoding methods. Source code is available at https://github.com/Nik-v9/AirObject.
AB - Object encoding and identification are vital for robotic tasks such as autonomous exploration, semantic scene understanding, and relocalization. Previous approaches have attempted to either track objects or generate descriptors for object identification. However, such systems are limited to a 'fixed' partial object representation from a single viewpoint. In a robot exploration setup, there is a requirement for a temporally 'evolving' global object representation built as the robot observes the object from multiple viewpoints. Furthermore, given the vast distribution of unknown novel objects in the real world, the object identification process must be class-agnostic. In this context, we propose a novel temporal 3D object encoding approach, dubbed AirObject, to obtain global keypoint graph-based embeddings of objects. Specifically, the global 3D object embeddings are generated using a temporal convolutional network across structural information of multiple frames obtained from a graph attention-based encoding method. We demonstrate that AirObject achieves the state-of-the-art performance for video object identification and is robust to severe occlusion, perceptual aliasing, viewpoint shift, deformation, and scale transform, outperforming the state-of-the-art single-frame and sequential descriptors. To the best of our knowledge, AirObject is one of the first temporal object encoding methods. Source code is available at https://github.com/Nik-v9/AirObject.
KW - 3D from multi-view and sensors
KW - Deep learning architectures and techniques
KW - Machine learning
KW - Recognition: detection
KW - Representation learning
KW - Robot vision
KW - Video analysis and understanding
KW - Vision applications and systems
KW - categorization
KW - retrieval
UR - https://www.scopus.com/pages/publications/85143057401
U2 - 10.1109/CVPR52688.2022.00822
DO - 10.1109/CVPR52688.2022.00822
M3 - Conference contribution
AN - SCOPUS:85143057401
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 8397
EP - 8406
BT - Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022
PB - IEEE Computer Society
Y2 - 19 June 2022 through 24 June 2022
ER -