TY - GEN
T1 - De-identifying Free Text of Japanese Electronic Health Records
AU - Kajiyama, Kohei
AU - Horiguchi, Hiromasa
AU - Okumura, Takashi
AU - Morita, Mizuki
AU - Kano, Yoshinobu
N1 - Funding Information:
This work was partially supported by Japanese Health Labour Sciences Research Grant and JST CREST.
Publisher Copyright:
© 2018 Association for Computational Linguistics.
PY - 2018
Y1 - 2018
N2 - A new law was established in Japan to promote utilization of EHRs for research and developments, while de-identification is required to use EHRs. However, studies of automatic de-identification in the healthcare domain is not active for Japanese language, no de-identification tool available in practical performance for Japanese medical domains, as far as we know. Previous work shows that rule-based methods are still effective, while deep learning methods are reported to be better recently. In order to implement and evaluate a de-identification tool in a practical level, we implemented three methods, rule-based, CRF, and LSTM. We prepared three datasets of pseudo EHRs with de-identification tags manually annotated. These datasets are derived from shared task data to compare with previous work, and our new data to increase training data. Our result shows that our LSTM-based method is better and robust, which leads to our future work that plans to apply our system to actual de-identification tasks in hospitals.
AB - A new law was established in Japan to promote utilization of EHRs for research and developments, while de-identification is required to use EHRs. However, studies of automatic de-identification in the healthcare domain is not active for Japanese language, no de-identification tool available in practical performance for Japanese medical domains, as far as we know. Previous work shows that rule-based methods are still effective, while deep learning methods are reported to be better recently. In order to implement and evaluate a de-identification tool in a practical level, we implemented three methods, rule-based, CRF, and LSTM. We prepared three datasets of pseudo EHRs with de-identification tags manually annotated. These datasets are derived from shared task data to compare with previous work, and our new data to increase training data. Our result shows that our LSTM-based method is better and robust, which leads to our future work that plans to apply our system to actual de-identification tasks in hospitals.
UR - http://www.scopus.com/inward/record.url?scp=85090204648&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090204648&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85090204648
T3 - EMNLP 2018 - 9th International Workshop on Health Text Mining and Information Analysis, LOUHI 2018 - Proceedings of the Workshop
SP - 65
EP - 70
BT - EMNLP 2018 - 9th International Workshop on Health Text Mining and Information Analysis, LOUHI 2018 - Proceedings of the Workshop
PB - Association for Computational Linguistics (ACL)
T2 - 9th International Workshop on Health Text Mining and Information Analysis, LOUHI 2018, co-located with EMNLP 2018
Y2 - 31 October 2018
ER -