Drugs, Diseases, Genes and Proteins in the CORD-19 Corpus
Authors/Creators
- 1. Universidad Politécnica de Madrid
Description
The BioNER+BioNEN system described in the paper "An Overview of Drugs, Diseases, Genes and Proteins in the CORD-19 Corpus", Badenes-Olmedo, Carlos et. al, (2022) was used to identify and normalize the drugs, diseases and genetic-related terms mentioned in the CORD-19 corpus (January 2022 Edition). Entity recognition and normalization was done for each paragraph of the scientific article. A first group of labels is created to identify the medical terms as they appear in the text (i.e. diseases_ss, chemicals_ss, genetics_ss), and in a standardized way (i.e. disease_terms_ss, chemical_terms_ss, genetic_terms_ss). In the case of diseases and genes/proteins, a predefined category is also established during the normalization process (i.e. disease_types_ss, genetic_types_ss ). The following group of labels contains the codes for each of the classification systems described in Section 3 (i.e. mesh_codes_ss, atc_codes_ss, cid_codes_ss, doid_codes_ss, cui_codes_ss, icd10_codes_ss, icd9_codes_ss, gard_codes_ss, snomed_codes_ss, nci_codes_ss, ncbi_codes_ss, uniprot_codes_ss). The suffix _ss in all tags indicates that the format is a textual list (i.e. string sequence).
Files
Files
(3.7 GB)
| Name | Size | Download all |
|---|---|---|
|
md5:2923356eb55ed4455e63db3eabf6c3bc
|
3.7 GB | Download |