Cancer Data Aggregator
Integrative cancer research currently is hampered by the fact that important datasets are stored in separate, non-interoperable data commons.
The Cancer Data Aggregator (CDA) improves findability within the cancer data ecosystem by aggregating diverse data types generated by NCI-funded programs, such as The Cancer Genome Atlas Program (TCGA) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC).
The CDA combines descriptive information about these datasets into a common model – the CRDC-H – that users can search using variables such as participant, sample, tissue, disease, or race.
This aggregation makes it possible for researchers to create complex synthetic datasets from both open- and controlled-access datasets that can be used for integrative analysis.
While anyone can browse and download this indexed metadata, researchers will still need to apply for appropriate access to get the data files.
To facilitate aggregation, the CDA uses a common language to describe basic clinical and biospecimen metadata, and indexes that information inside the CRDC-H. This structured format supports federation across multiple data commons, serves as a primary resource for the CRDC, and will act as a basic set of required variables for new datasets being submitted to the CRDC.
The CDA index is accessible via an Application Programming Interface (API), as well as a custom python library, cda-python, that can be used to easily build custom queries.