TY - GEN
T1 - A case restoration approach to named entity tagging in degraded documents
AU - Srihari, Rohini K.
AU - Niu, Cheng
AU - Li, Wei
AU - Ding, Jihong
N1 - Publisher Copyright:
© 2003 IEEE.
PY - 2003
Y1 - 2003
N2 - This paper describes a novel approach to named entity (NE) tagging on degraded documents. NE tagging is the process of identifying salient text strings in unstructured text, corresponding to names of people, places, organizations, times/dates, etc. Although NE tagging is typically part of a larger information extraction process, it has other applications, such as improving search in an information retrieval system, and post-processing the results of an OCR system. We focus on degraded documents, i.e. case insensitive documents that lack orthographic information. Examples include output of speech recognition systems, as well as e-mail. The traditional approach involves retraining an NE tagger on degraded text, a cumbersome operation. This paper describes an approach whereby text is first "restored" to its implicit case sensitive form, and subsequently processed by the original NE tagger. Results show that this new approach leads to far less precision loss in NE tagging of degraded documents.
AB - This paper describes a novel approach to named entity (NE) tagging on degraded documents. NE tagging is the process of identifying salient text strings in unstructured text, corresponding to names of people, places, organizations, times/dates, etc. Although NE tagging is typically part of a larger information extraction process, it has other applications, such as improving search in an information retrieval system, and post-processing the results of an OCR system. We focus on degraded documents, i.e. case insensitive documents that lack orthographic information. Examples include output of speech recognition systems, as well as e-mail. The traditional approach involves retraining an NE tagger on degraded text, a cumbersome operation. This paper describes an approach whereby text is first "restored" to its implicit case sensitive form, and subsequently processed by the original NE tagger. Results show that this new approach leads to far less precision loss in NE tagging of degraded documents.
UR - https://www.scopus.com/pages/publications/84945974273
U2 - 10.1109/ICDAR.2003.1227756
DO - 10.1109/ICDAR.2003.1227756
M3 - Conference contribution
AN - SCOPUS:84945974273
T3 - Proceedings of the International Conference on Document Analysis and Recognition, ICDAR
SP - 720
EP - 724
BT - Proceedings - 7th International Conference on Document Analysis and Recognition, ICDAR 2003
PB - IEEE Computer Society
T2 - 7th International Conference on Document Analysis and Recognition, ICDAR 2003
Y2 - 3 August 2003 through 6 August 2003
ER -