Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Feb 28;21(1):157.
doi: 10.1186/s12967-023-04011-y.

Precision information extraction for rare disease epidemiology at scale

Affiliations

Precision information extraction for rare disease epidemiology at scale

William Z Kariampuzha et al. J Transl Med. .

Erratum in

Abstract

Background: The United Nations recently made a call to address the challenges of an estimated 300 million persons worldwide living with a rare disease through the collection, analysis, and dissemination of disaggregated data. Epidemiologic Information (EI) regarding prevalence and incidence data of rare diseases is sparse and current paradigms of identifying, extracting, and curating EI rely upon time-intensive, error-prone manual processes. With these limitations, a clear understanding of the variation in epidemiology and outcomes for rare disease patients is hampered. This challenges the public health of rare diseases patients through a lack of information necessary to prioritize research, policy decisions, therapeutic development, and health system allocations.

Methods: In this study, we developed a newly curated epidemiology corpus for Named Entity Recognition (NER), a deep learning framework, and a novel rare disease epidemiologic information pipeline named EpiPipeline4RD consisting of a web interface and Restful API. For the corpus creation, we programmatically gathered a representative sample of rare disease epidemiologic abstracts, utilized weakly-supervised machine learning techniques to label the dataset, and manually validated the labeled dataset. For the deep learning framework development, we fine-tuned our dataset and adapted the BioBERT model for NER. We measured the performance of our BioBERT model for epidemiology entity recognition quantitatively with precision, recall, and F1 and qualitatively through a comparison with Orphanet. We demonstrated the ability for our pipeline to gather, identify, and extract epidemiology information from rare disease abstracts through three case studies.

Results: We developed a deep learning model to extract EI with overall F1 scores of 0.817 and 0.878, evaluated at the entity-level and token-level respectively, and which achieved comparable qualitative results to Orphanet's collection paradigm. Additionally, case studies of the rare diseases Classic homocystinuria, GRACILE syndrome, Phenylketonuria demonstrated the adequate recall of abstracts with epidemiology information, high precision of epidemiology information extraction through our deep learning model, and the increased efficiency of EpiPipeline4RD compared to a manual curation paradigm.

Conclusions: EpiPipeline4RD demonstrated high performance of EI extraction from rare disease literature to augment manual curation processes. This automated information curation paradigm will not only effectively empower development of the NIH Genetic and Rare Diseases Information Center (GARD), but also support the public health of the rare disease community.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

Fig. 1
Fig. 1
Implementation workflow of EpiPipeline4RD. A Steps applied to prepare ES data for deep learning model training. EMBL-EBI refers to the EBI API for gathering abstracts. ES_Predict is a Long Short-Term Memory Recurrent Neural Network for ES prediction. B Methods applied for the epidemiology corpus generation. Distant supervision draws upon the NGKG from Neo4J and Wikipedia. Noisy supervision draws upon a spaCy NER model. Prescriptive supervision is dependent upon rules described in the Additional file 2. C Transformer model architecture. Positional embeddings are added to the WordPiece embeddings. “Add” refers to the addition of the sub-layer output to its input (residual connection). “Norm” refers to sub-layer normalization after employing a residual connection [55]. D EpiPipeline4RD implementation. Output of the EI extraction via the User Interface
Fig. 2
Fig. 2
An example of labeling using weakly supervised ML techniques and NLP. Correct labeling is bolded on the left. Actual programmatic output is on the right. Abstract is from [66]
Fig. 3
Fig. 3
A screenshot of EpiPipeline4RD User Interface
Fig. 4
Fig. 4
Composition of the entire rare disease epidemiology dataset for named entity recognition (NER)
Fig. 5
Fig. 5
Absolute difference between token-level and entity-level test results

Comment in

References

    1. Health Promotion and Disease Prevention Amendments of 1984. In: 21 USC 360bb, 98th Congress, 2nd Session edition. United States of America: U.S. Government Printing Office; 1984. p. 2817.
    1. Regulation (EC) N°141/2000 of the European Parliament and of the Council of 16 December 1999 on orphan medicinal products. European Union; 2000. p. 1.
    1. Dicken J. Rare diseases: although limited available evidence suggests medical and other costs can be substantial. Goverment Accountability Office (GAO); 2021.
    1. Nguengang Wakap S, Lambert DM, Olry A, Rodwell C, Gueydan C, Lanneau V, Murphy D, Le Cam Y, Rath A. Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database. Eur J Hum Genet. 2020;28(2):165–173. doi: 10.1038/s41431-019-0508-0. - DOI - PMC - PubMed
    1. Stanarevic KS. Health information behaviour of rare disease patients: seeking, finding and sharing health information. Health Info Libr J. 2019;36(4):341–356. doi: 10.1111/hir.12261. - DOI - PubMed

Publication types

LinkOut - more resources