Authors' names extraction from scanned documents

Manabu Ohta, Shun Yamasaki, Takayuki Yakushi, Atsuhiro Takasu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Authors' names are a critical bibliographic element when searching or browsing academic articles stored in digital libraries. However, extracting such bibliographic data from printed documents requires human intervention; it is therefore not cost-effective, even using various document image-processing techniques such as Optical Character Recognition (OCR). In this paper, we describe an automatic authors' names extraction method for academic articles scanned with OCR mark-up. The proposed method first extracts authors' blocks, which include assumed author/delimiter characters based on layout analysis, and then uses a specifically designed Hidden Markov Model (HMM) for labeling the unsegmented character strings in the block as those of either an author or a delimiter. We applied the proposed method to Japanese academic articles. Results of these experiments showed that the proposed method correctly extracted more than 99% of authors' blocks with manual tuning; the proposed HMM correctly labeled more than 95% of the author name strings.

Original languageEnglish
Title of host publication2007 2nd International Conference on Digital Information Management, ICDIM
PublisherIEEE Computer Society
Pages67-72
Number of pages6
ISBN (Print)1424414768, 9781424414765
DOIs
Publication statusPublished - 2007
Event2007 2nd International Conference on Digital Information Management, ICDIM - Lyon, France
Duration: Oct 28 2007Oct 31 2007

Publication series

Name2007 2nd International Conference on Digital Information Management, ICDIM
Volume1

Other

Other2007 2nd International Conference on Digital Information Management, ICDIM
Country/TerritoryFrance
CityLyon
Period10/28/0710/31/07

ASJC Scopus subject areas

  • Information Systems
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'Authors' names extraction from scanned documents'. Together they form a unique fingerprint.

Cite this