TY - GEN
T1 - Rule management for information extraction from title pages of academic papers
AU - Takasu, Atsuhiro
AU - Ohta, Manabu
PY - 2014
Y1 - 2014
N2 - This paper discusses the problem of managing rules for page layout analysis and information extraction. We have been developing a system to extract information from academic papers that exploits both page layout and textual information. For this purpose, a conditional random field (CRF) analyzer is designed according to the layout of the object pages. Because various layouts are used in academic papers, we must prepare a set of rules for each type of layout to achieve high extraction accuracy. As the number of papers in a system grows, rule management becomes a big problem. For example, when should we make a new set of rules, and how can we acquire them efficiently while receiving new articles? This paper examines two scores to measure the fitness of rules and the applicability of rules learned for another type of layout. We evaluate the scores for bibliographic information extraction from title pages of academic papers and show that they are effective for measuring the fitness. We also examine the sampling of training data when learning a new set of rules.
AB - This paper discusses the problem of managing rules for page layout analysis and information extraction. We have been developing a system to extract information from academic papers that exploits both page layout and textual information. For this purpose, a conditional random field (CRF) analyzer is designed according to the layout of the object pages. Because various layouts are used in academic papers, we must prepare a set of rules for each type of layout to achieve high extraction accuracy. As the number of papers in a system grows, rule management becomes a big problem. For example, when should we make a new set of rules, and how can we acquire them efficiently while receiving new articles? This paper examines two scores to measure the fitness of rules and the applicability of rules learned for another type of layout. We evaluate the scores for bibliographic information extraction from title pages of academic papers and show that they are effective for measuring the fitness. We also examine the sampling of training data when learning a new set of rules.
KW - CRF
KW - Digital library
KW - Document understanding
KW - Information extraction
UR - http://www.scopus.com/inward/record.url?scp=84902341099&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84902341099&partnerID=8YFLogxK
U2 - 10.5220/0004827204380444
DO - 10.5220/0004827204380444
M3 - Conference contribution
AN - SCOPUS:84902341099
SN - 9789897580185
T3 - ICPRAM 2014 - Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods
SP - 438
EP - 444
BT - ICPRAM 2014 - Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods
PB - SciTePress
T2 - 3rd International Conference on Pattern Recognition Applications and Methods, ICPRAM 2014
Y2 - 6 March 2014 through 8 March 2014
ER -