TY - GEN
T1 - Bibliographic element extraction from scanned documents using conditional random fields
AU - Ohta, Manabu
AU - Yakushi, Takayuki
AU - Takasu, Atsuhiro
PY - 2008
Y1 - 2008
N2 - Bibliographic databases are indispensable to digital libraries for academic articles. However, extracting bibliographic elements from printed documents requires a lot of human intervention; it is not cost-effective, even when using various document image-processing techniques such as optical character recognition (OCR). In this paper, we propose an automatic bibliographic element extraction method for academic articles scanned with OCR markup. The proposed method first labels text blocks as predetermined bibliographic elements and then further labels the characters in each labeled text block if necessary. The second labeling enables us to extract each author's name from the authors' text block. The method uses conditional random fields (CRF) for labeling both text blocks and the characters in them. We applied the method to Japanese academic articles. The experiments showed that the proposed text block labeling correctly extracted all the predefined bibliographic elements from more than 97% of the articles; the proposed character labeling also correctly extracted all the author name strings from more than 99% of the authors' text blocks in Japanese.
AB - Bibliographic databases are indispensable to digital libraries for academic articles. However, extracting bibliographic elements from printed documents requires a lot of human intervention; it is not cost-effective, even when using various document image-processing techniques such as optical character recognition (OCR). In this paper, we propose an automatic bibliographic element extraction method for academic articles scanned with OCR markup. The proposed method first labels text blocks as predetermined bibliographic elements and then further labels the characters in each labeled text block if necessary. The second labeling enables us to extract each author's name from the authors' text block. The method uses conditional random fields (CRF) for labeling both text blocks and the characters in them. We applied the method to Japanese academic articles. The experiments showed that the proposed text block labeling correctly extracted all the predefined bibliographic elements from more than 97% of the articles; the proposed character labeling also correctly extracted all the author name strings from more than 99% of the authors' text blocks in Japanese.
UR - http://www.scopus.com/inward/record.url?scp=62949156808&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=62949156808&partnerID=8YFLogxK
U2 - 10.1109/ICDIM.2008.4746745
DO - 10.1109/ICDIM.2008.4746745
M3 - Conference contribution
AN - SCOPUS:62949156808
SN - 9781424429172
T3 - 3rd International Conference on Digital Information Management, ICDIM 2008
SP - 99
EP - 104
BT - 3rd International Conference on Digital Information Management, ICDIM 2008
T2 - 3rd International Conference on Digital Information Management, ICDIM 2008
Y2 - 13 November 2008 through 16 November 2008
ER -