How CRDC Resources are Used by the Research Community: Review Article by Zhaoyi Chen, PhD, et. al.

Several members of the CRDC team recently published an assessment of publications that reference CRDC-housed data and/or the use of NCI Cloud Resources and the analytical tools CRDC makes publicly available. The paper titled, “Usage of the National Cancer Institute Cancer Research Data Commons by Researchers: A Scoping Review of the Literature,” was published in JCO Clinical Cancer Informatics, an American Society of Clinical Oncology Journal.
The authors include Zhaoyi Chen, PhD, an NIH Data and Technology Advancement (DATA) Scholar; Erika Kim, PhD, a Supervisory Health Science Administrator at the NCI Center for Biomedical Informatics & Information Technology (CBIIT); Tanja Davidsen, PhD, the Branch Chief at NCI CBIIT; and Jill Barnholtz-Sloan, PhD, the Acting Director of NCI CBIIT.
As Dr. Chen notes, “Our scoping study examined published studies using CRDC’s resources to understand how the cancer research community relies on the CRDC as the foundation of a cancer research data ecosystem. In understanding current use, we will determine areas for ongoing improvements.”
More than 200 papers published through 2023 were included in this review. Among the main findings:
- The Cancer Genome Atlas (TCGA) is the most widely used dataset, referenced in approximately 80 percent of the reviewed papers.
- Most researchers relied on downloading datasets, even though secure cloud-based environments exist through the NCI Cloud Resources.
- More than half of the 200 papers were descriptive or association analyses, including associations between biomarkers and cancer risks or outcomes.
- The most recent publications used a wider range of research approaches, including validation studies comparing locally acquired cohorts against data from the CRDC.
The authors highlight opportunities for the CRDC to make a greater impact within the research community. They note that, as of 2023:
- While multi-modal data analysis was only applied in a few relatively recent papers, this is increasingly possible with CRDC’s ongoing work to aggregate data across all its data commons for easier multi-modal data search and use.
- The research community does not yet use NCI Cloud Resources to the extent possible for secure analysis, pointing to the need to raise awareness about these environments and available training.
- The research community would benefit from a unified portal for easy data search and analysis across all CRDC data commons.
The authors also note that federated and transfer learning models could be used to leverage CRDC resources. They suggest that a federated learning approach could be applied by training models separately on distinct CRDC data sets of the same type of cancer from different studies or sources, without compiling raw data into a single dataset. The learned insights could then be aggregated centrally, enabling collaborative analysis. The transfer learning approach could also facilitate a wide range of research applications. For example, in translational research, models could be trained using canine data available through the CRDC. The extracted knowledge could then be applied to human data for the same or similar types of cancer, also available through the CRDC.
As Dr. Kim notes, “The CRDC, as the foundation of a national cancer data ecosystem, is continuing to expand its data and analytical resources to empower the research community. Leveraging large datasets and using CRDC resources to develop analytical strategies will accelerate discovery to prevent, diagnose, and treat cancer more effectively.”
A short write-up on the paper is also available on the CBIIT website.