TY - GEN
T1 - Error detection of CRF-based bibliography extraction from reference strings
AU - Ohta, Manabu
AU - Arauchi, Daiki
AU - Takasu, Atsuhiro
AU - Adachi, Jun
PY - 2012
Y1 - 2012
N2 - We proposed a parsing method for reference strings usually listed at the end of research papers to extract important bibliographies such as a title from them. The method uses a conditional random field (CRF) to estimate the correct bibliographic label for each token in the token sequence generated from a reference string. Although we achieved reasonable parsing accuracies for a Japanese academic journal, errors are inevitable. Therefore, this paper proposes ways to increase confidence for CRF-based bibliography parsing to detect such parsing errors. This paper also reports an empirical evaluation of the proposed parsing on the basis not only of its accuracies but also of how easy it is to detect errors. The experiments showed that the proposed measures reasonably indicated parsing errors and could be used to improve the quality of extracted bibliographies at a moderate manual post-editing cost.
AB - We proposed a parsing method for reference strings usually listed at the end of research papers to extract important bibliographies such as a title from them. The method uses a conditional random field (CRF) to estimate the correct bibliographic label for each token in the token sequence generated from a reference string. Although we achieved reasonable parsing accuracies for a Japanese academic journal, errors are inevitable. Therefore, this paper proposes ways to increase confidence for CRF-based bibliography parsing to detect such parsing errors. This paper also reports an empirical evaluation of the proposed parsing on the basis not only of its accuracies but also of how easy it is to detect errors. The experiments showed that the proposed measures reasonably indicated parsing errors and could be used to improve the quality of extracted bibliographies at a moderate manual post-editing cost.
KW - bibliography extraction
KW - conditional random field (CRF)
KW - confidence measure
KW - digital library
KW - error detection
KW - reference string
UR - http://www.scopus.com/inward/record.url?scp=84869046469&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84869046469&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-34752-8_29
DO - 10.1007/978-3-642-34752-8_29
M3 - Conference contribution
AN - SCOPUS:84869046469
SN - 9783642347511
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 229
EP - 238
BT - The Outreach of Digital Libraries
T2 - 14th International Conference on Asia-Pacific Digital Libraries, ICADL 2012
Y2 - 12 November 2012 through 15 November 2012
ER -