TY - GEN
T1 - Reduction of expanded search terms for fuzzy English-text retrieval
AU - Ohta, Manabu
AU - Takasu, Atsuhiro
AU - Adachi, Jun
PY - 1998/1/1
Y1 - 1998/1/1
N2 - Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to the con- fusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, in English fuzzy retrieval, occasionally a few million search terms are generated, which has an intolerable effect on retrieval speed. Therefore, this paper presents two heuristics to reduce the number of generated search terms by restricting the number of errors included in each expanded search term while maintaining retrieval effectiveness.
AB - Optical character reader (OCR) misrecognition is a serious problem when OCR-recognized text is used for retrieval purposes in digital libraries. We have proposed fuzzy retrieval methods that, instead of correcting the errors manually, assume that errors remain in the recognized text. Costs are thereby reduced. The proposed methods generate multiple search terms for each input query term by referring to the con- fusion matrices, which store all characters likely to be misrecognized and the respective probability of each misrecognition. The proposed methods can improve recall rates without decreasing precision rates. However, in English fuzzy retrieval, occasionally a few million search terms are generated, which has an intolerable effect on retrieval speed. Therefore, this paper presents two heuristics to reduce the number of generated search terms by restricting the number of errors included in each expanded search term while maintaining retrieval effectiveness.
UR - http://www.scopus.com/inward/record.url?scp=84945237083&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84945237083&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84945237083
SN - 9783540651017
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 619
EP - 633
BT - Research and Advanced Technology for Digital Libraries - 2nd European Conference, ECDL 1998, Proceedings
A2 - Nikolaou, Christos
A2 - Stephanidis, Constantine
PB - Springer Verlag
T2 - 2nd European Conference on Digital Libraries, ECDL 1998
Y2 - 21 September 1998 through 23 September 1998
ER -