TY - GEN
T1 - Synthetic Data Generation for Semantic Segmentation of Lecture Videos
AU - Davila, Kenny
AU - Xu, Fei
AU - Molina, James
AU - Setlur, Srirangaraj
AU - Govindaraju, Venu
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - Lecture videos have become a great resource for students and teachers. These videos are a vast information source, but most search engines only index them by their audio. To make these videos searchable by handwritten content, it is important to develop accurate methods for analyzing such content at scale. However, training deep neural networks to their full potential requires large-scale lecture video datasets. In this paper, we use synthetic data generation to improve binarization of lecture videos. We also use it to semantically segment pixels into background, speaker, text, mathematical expressions, and graphics. Our method for synthetic data generation renders content from multiple handwritten and typeset datasets, and blends it into real images using random tight layouts and the location of the people. In addition, we also propose a mixed data approach that trains networks on two detection tasks at once: person and text. Both binarization and semantic segmentation are carried out using fully convolutional neural networks with a typical encoder-decoder architecture and residual connections. Our experiments show that pretraining on both synthetic and mixed data leads to better performance than training with real data alone. While final results are promising, more work will be needed to reduce the domain shift between synthetic and real data. Our code and data are publicly available.
AB - Lecture videos have become a great resource for students and teachers. These videos are a vast information source, but most search engines only index them by their audio. To make these videos searchable by handwritten content, it is important to develop accurate methods for analyzing such content at scale. However, training deep neural networks to their full potential requires large-scale lecture video datasets. In this paper, we use synthetic data generation to improve binarization of lecture videos. We also use it to semantically segment pixels into background, speaker, text, mathematical expressions, and graphics. Our method for synthetic data generation renders content from multiple handwritten and typeset datasets, and blends it into real images using random tight layouts and the location of the people. In addition, we also propose a mixed data approach that trains networks on two detection tasks at once: person and text. Both binarization and semantic segmentation are carried out using fully convolutional neural networks with a typical encoder-decoder architecture and residual connections. Our experiments show that pretraining on both synthetic and mixed data leads to better performance than training with real data alone. While final results are promising, more work will be needed to reduce the domain shift between synthetic and real data. Our code and data are publicly available.
KW - Lecture videos
KW - Semantic Segmentation
KW - Synthetic data
UR - https://www.scopus.com/pages/publications/85144363670
U2 - 10.1007/978-3-031-21648-0_32
DO - 10.1007/978-3-031-21648-0_32
M3 - Conference contribution
AN - SCOPUS:85144363670
SN - 9783031216473
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 468
EP - 483
BT - Frontiers in Handwriting Recognition - 18th International Conference, ICFHR 2022, Proceedings
A2 - Porwal, Utkarsh
A2 - Fornés, Alicia
A2 - Shafait, Faisal
PB - Springer Science and Business Media Deutschland GmbH
T2 - 18th International Conference on Frontiers in Handwriting Recognition, ICFHR 2022
Y2 - 4 December 2022 through 7 December 2022
ER -