Rule management for information extraction from title pages of academic papers

Atsuhiro Takasu, Manabu Ohta

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

This paper discusses the problem of managing rules for page layout analysis and information extraction. We have been developing a system to extract information from academic papers that exploits both page layout and textual information. For this purpose, a conditional random field (CRF) analyzer is designed according to the layout of the object pages. Because various layouts are used in academic papers, we must prepare a set of rules for each type of layout to achieve high extraction accuracy. As the number of papers in a system grows, rule management becomes a big problem. For example, when should we make a new set of rules, and how can we acquire them efficiently while receiving new articles? This paper examines two scores to measure the fitness of rules and the applicability of rules learned for another type of layout. We evaluate the scores for bibliographic information extraction from title pages of academic papers and show that they are effective for measuring the fitness. We also examine the sampling of training data when learning a new set of rules.

Original languageEnglish
Title of host publicationICPRAM 2014 - Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods
PublisherSciTePress
Pages438-444
Number of pages7
ISBN (Print)9789897580185
DOIs
Publication statusPublished - 2014
Event3rd International Conference on Pattern Recognition Applications and Methods, ICPRAM 2014 - Angers, Loire Valley, France
Duration: Mar 6 2014Mar 8 2014

Publication series

NameICPRAM 2014 - Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods

Other

Other3rd International Conference on Pattern Recognition Applications and Methods, ICPRAM 2014
Country/TerritoryFrance
CityAngers, Loire Valley
Period3/6/143/8/14

Keywords

  • CRF
  • Digital library
  • Document understanding
  • Information extraction

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Rule management for information extraction from title pages of academic papers'. Together they form a unique fingerprint.

Cite this