TY - GEN
T1 - Tools for enabling digital access to multi-lingual Indic documents
AU - Govindaraju, Venu
AU - Khedekar, Swapnil
AU - Kompalli, Suryaprakash
AU - Farooq, Faisal
AU - Setlur, Srirangaraj
AU - Vemulapati, Ramanaprasad
PY - 2004
Y1 - 2004
N2 - In this paper we present methodologies for three important tasks that will eventually enable digital access of multilingual Indian document images. First, we describe several document image analysis techniques necessary to prepare Devanagari document images for OCR. The second task is OCR for machine printed Devanagari words without the help of a lexicon. We describe the OCR methodology and show how it is being extended to other Indian languages. Finally, we describe a versatile platform that facilitates automatic segmentation of document images in multiple Indian languages and an interface to capture the ground truth corresponding to the text. We use transliterated English text and virtual keyboards in a range of Indian languages for this purpose. The multi-lingual data entry capabilities of the tool and its underlying UNICODE data representation within a structured XML document also allow users to annotate passages of text in one language in other languages using a markup scheme to switch between scripts. Text and annotations are rendered in the appropriate scripts as the text is being annotated, thus providing users prompt and natural feedback. The XML back-end allows meta-data to be recorded describing the annotated document.
AB - In this paper we present methodologies for three important tasks that will eventually enable digital access of multilingual Indian document images. First, we describe several document image analysis techniques necessary to prepare Devanagari document images for OCR. The second task is OCR for machine printed Devanagari words without the help of a lexicon. We describe the OCR methodology and show how it is being extended to other Indian languages. Finally, we describe a versatile platform that facilitates automatic segmentation of document images in multiple Indian languages and an interface to capture the ground truth corresponding to the text. We use transliterated English text and virtual keyboards in a range of Indian languages for this purpose. The multi-lingual data entry capabilities of the tool and its underlying UNICODE data representation within a structured XML document also allow users to annotate passages of text in one language in other languages using a markup scheme to switch between scripts. Text and annotations are rendered in the appropriate scripts as the text is being annotated, thus providing users prompt and natural feedback. The XML back-end allows meta-data to be recorded describing the annotated document.
UR - https://www.scopus.com/pages/publications/1942516408
U2 - 10.1109/DIAL.2004.1263244
DO - 10.1109/DIAL.2004.1263244
M3 - Conference contribution
AN - SCOPUS:1942516408
SN - 076952088X
SN - 9780769520889
T3 - Proceedings - First International Workshop on Document Image Analysis for Libraries - DIAL 2004
SP - 122
EP - 133
BT - Proceedings First International Workshop on Document Image Analysis for Libraries - DIAL 2004
T2 - Proceedings First International Workshop on Document Image Analysis for Libraries DIAL 2004
Y2 - 23 January 2004 through 24 January 2004
ER -