TY - JOUR
T1 - Bio-medical entity extraction using support vector machines
AU - Takeuchi, Koichi
AU - Collier, Nigel
N1 - Funding Information:
This work was supported in part by the Japan Society for the Promotion of Science (grant no. 14701020) and by a Leadership Fund grant from the National Institute of Informatics. The authors would like to thank Jun-ichi Tsujii (University of Tokyo) for providing the data set Bio1. We would also like to thank the anonymous reviewers for their many helpful comments.
PY - 2005/2
Y1 - 2005/2
N2 - Objective: Support vector machines (SVMs) have achieved state-of-the-art performance in several classification tasks. In this article we apply them to the identification and semantic annotation of scientific and technical terminology in the domain of molecular biology. This illustrates the extensibility of the traditional named entity task to special domains with large-scale terminologies such as those in medicine and related disciplines. Methods and materials: The foundation for the model is a sample of text annotated by a domain expert according to an ontology of concepts, properties and relations. The model then learns to annotate unseen terms in new texts and contexts. The results can be used for a variety of intelligent language processing applications. We illustrate SVMs capabilities using a sample of 100 journal abstracts texts taken from the {human, blood cell, transcription factor} domain of MEDLINE. Results: Approximately 3400 terms are annotated and the model performs at about 74% F-score on cross-validation tests. A detailed analysis based on empirical evidence shows the contribution of various feature sets to performance. Conclusion: Our experiments indicate a relationship between feature window size and the amount of training data and that a combination of surface words, orthographic features and head noun features achieve the best performance among the feature sets tested.
AB - Objective: Support vector machines (SVMs) have achieved state-of-the-art performance in several classification tasks. In this article we apply them to the identification and semantic annotation of scientific and technical terminology in the domain of molecular biology. This illustrates the extensibility of the traditional named entity task to special domains with large-scale terminologies such as those in medicine and related disciplines. Methods and materials: The foundation for the model is a sample of text annotated by a domain expert according to an ontology of concepts, properties and relations. The model then learns to annotate unseen terms in new texts and contexts. The results can be used for a variety of intelligent language processing applications. We illustrate SVMs capabilities using a sample of 100 journal abstracts texts taken from the {human, blood cell, transcription factor} domain of MEDLINE. Results: Approximately 3400 terms are annotated and the model performs at about 74% F-score on cross-validation tests. A detailed analysis based on empirical evidence shows the contribution of various feature sets to performance. Conclusion: Our experiments indicate a relationship between feature window size and the amount of training data and that a combination of surface words, orthographic features and head noun features achieve the best performance among the feature sets tested.
KW - MEDLINE
KW - Machine learning
KW - Multi-classifier
KW - Named entity
KW - Natural language processing
KW - Support vector machines
KW - Text mining
UR - http://www.scopus.com/inward/record.url?scp=16244362685&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=16244362685&partnerID=8YFLogxK
U2 - 10.1016/j.artmed.2004.07.019
DO - 10.1016/j.artmed.2004.07.019
M3 - Article
C2 - 15811781
AN - SCOPUS:16244362685
SN - 0933-3657
VL - 33
SP - 125
EP - 137
JO - Artificial Intelligence in Medicine
JF - Artificial Intelligence in Medicine
IS - 2
ER -