Skip to main navigation Skip to search Skip to main content

Shape codebook based handwritten and machine printed text zone extraction

  • Jayant Kumar
  • , Rohit Prasad
  • , Huiagu Cao
  • , Wael Abd-Almageed
  • , David Doermann
  • , Premkumar Natarajan
  • University of Maryland, College Park
  • BBN Technologies

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

15 Scopus citations

Abstract

In this paper, we present a novel method for extracting handwritten and printed text zones from noisy document images with mixed content. We use Triple-Adjacent-Segment (TAS) based features which encode local shape characteristics of text in a consistent manner. We first construct two codebooks of the shape features extracted from a set of handwritten and printed text documents respectively. We then compute the normalized histogram of codewords for each segmented zone and use it to train a Support Vector Machine (SVM) classifier. The codebook based approach is robust to the background noise present in the image and TAS features are invariant to translation, scale and rotation of text. In experiments, we show that a pixel-weighted zone classification accuracy of 98% can be achieved for noisy Arabic documents. Further, we demonstrate the effectiveness of our method for document page classification and show that a high precision can be achieved for the detection of machine printed documents. The proposed method is robust to the size of zones, which may contain text content at line or paragraph level.

Original languageEnglish
Title of host publicationProceedings of SPIE-IS and T Electronic Imaging - Document Recognition and Retrieval XVIII
DOIs
StatePublished - 2011
EventDocument Recognition and Retrieval XVIII - San Francisco, CA, United States
Duration: Jan 26 2011Jan 27 2011

Publication series

NameProceedings of SPIE - The International Society for Optical Engineering
Volume7874
ISSN (Print)0277-786X

Conference

ConferenceDocument Recognition and Retrieval XVIII
Country/TerritoryUnited States
CitySan Francisco, CA
Period01/26/1101/27/11

Keywords

  • Arabic
  • handwriting
  • noisy documents
  • page classification
  • zone classification
  • zone segmentation

Fingerprint

Dive into the research topics of 'Shape codebook based handwritten and machine printed text zone extraction'. Together they form a unique fingerprint.

Cite this