TY - GEN
T1 - Empirical evaluation of active sampling for CRF-based analysis of pages
AU - Ohta, Manabu
AU - Inoue, Ryohei
AU - Takasu, Atsuhiro
PY - 2010
Y1 - 2010
N2 - We propose an automatic method of extracting bibliographies for academic articles scanned with OCR markup. The method uses conditional random fields (CRF) for labeling serially OCR-ed text lines on an article's title page as appropriate names for bibliographic elements. Although we achieved excellent extraction accuracies for some Japanese academic journals, we needed a substantial amount of training data that had to be obtained through costly manual extraction of bibliographies from printed documents. Therefore, this paper reports an empirical evaluation of active sampling applied to the CRF-based extraction of bibliographies to reduce the amount of training data. We applied active sampling techniques to three academic journals published in Japan. The experiments revealed that the sampling strategy using the proposed criteria for selecting samples could reduce the amount of training data to less than half or even a third of those for two academic journals. This paper also reports the effect of pseudo-training data that were added to training.
AB - We propose an automatic method of extracting bibliographies for academic articles scanned with OCR markup. The method uses conditional random fields (CRF) for labeling serially OCR-ed text lines on an article's title page as appropriate names for bibliographic elements. Although we achieved excellent extraction accuracies for some Japanese academic journals, we needed a substantial amount of training data that had to be obtained through costly manual extraction of bibliographies from printed documents. Therefore, this paper reports an empirical evaluation of active sampling applied to the CRF-based extraction of bibliographies to reduce the amount of training data. We applied active sampling techniques to three academic journals published in Japan. The experiments revealed that the sampling strategy using the proposed criteria for selecting samples could reduce the amount of training data to less than half or even a third of those for two academic journals. This paper also reports the effect of pseudo-training data that were added to training.
KW - Active sampling
KW - Bibliography extraction
KW - CRF
KW - Digital library
KW - OCR
UR - http://www.scopus.com/inward/record.url?scp=77958016174&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77958016174&partnerID=8YFLogxK
U2 - 10.1109/IRI.2010.5558973
DO - 10.1109/IRI.2010.5558973
M3 - Conference contribution
AN - SCOPUS:77958016174
SN - 9781424480975
T3 - 2010 IEEE International Conference on Information Reuse and Integration, IRI 2010
SP - 13
EP - 18
BT - 2010 IEEE International Conference on Information Reuse and Integration, IRI 2010
T2 - 11th IEEE International Conference on Information Reuse and Integration, IRI 2010
Y2 - 4 August 2010 through 6 August 2010
ER -