TY - GEN
T1 - Authors' names extraction from scanned documents
AU - Ohta, Manabu
AU - Yamasaki, Shun
AU - Yakushi, Takayuki
AU - Takasu, Atsuhiro
PY - 2007
Y1 - 2007
N2 - Authors' names are a critical bibliographic element when searching or browsing academic articles stored in digital libraries. However, extracting such bibliographic data from printed documents requires human intervention; it is therefore not cost-effective, even using various document image-processing techniques such as Optical Character Recognition (OCR). In this paper, we describe an automatic authors' names extraction method for academic articles scanned with OCR mark-up. The proposed method first extracts authors' blocks, which include assumed author/delimiter characters based on layout analysis, and then uses a specifically designed Hidden Markov Model (HMM) for labeling the unsegmented character strings in the block as those of either an author or a delimiter. We applied the proposed method to Japanese academic articles. Results of these experiments showed that the proposed method correctly extracted more than 99% of authors' blocks with manual tuning; the proposed HMM correctly labeled more than 95% of the author name strings.
AB - Authors' names are a critical bibliographic element when searching or browsing academic articles stored in digital libraries. However, extracting such bibliographic data from printed documents requires human intervention; it is therefore not cost-effective, even using various document image-processing techniques such as Optical Character Recognition (OCR). In this paper, we describe an automatic authors' names extraction method for academic articles scanned with OCR mark-up. The proposed method first extracts authors' blocks, which include assumed author/delimiter characters based on layout analysis, and then uses a specifically designed Hidden Markov Model (HMM) for labeling the unsegmented character strings in the block as those of either an author or a delimiter. We applied the proposed method to Japanese academic articles. Results of these experiments showed that the proposed method correctly extracted more than 99% of authors' blocks with manual tuning; the proposed HMM correctly labeled more than 95% of the author name strings.
UR - http://www.scopus.com/inward/record.url?scp=50149103478&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=50149103478&partnerID=8YFLogxK
U2 - 10.1109/ICDIM.2007.4444202
DO - 10.1109/ICDIM.2007.4444202
M3 - Conference contribution
AN - SCOPUS:50149103478
SN - 1424414768
SN - 9781424414765
T3 - 2007 2nd International Conference on Digital Information Management, ICDIM
SP - 67
EP - 72
BT - 2007 2nd International Conference on Digital Information Management, ICDIM
PB - IEEE Computer Society
T2 - 2007 2nd International Conference on Digital Information Management, ICDIM
Y2 - 28 October 2007 through 31 October 2007
ER -