CRDC Components: Updates
The CRDC team, whether engaged in activities specific to the CRDC Data Commons, NCI Cloud Resources, or CRDC’s Core Services, remains focused on advancing its mission of making data and resources securely accessible to the cancer research community. The team has provided updates.
Data Commons
Core Standards and Services
-
Genomic Data Commons (GDC)
Data Release
The GDC’s most recent data release (#43) came out in May 2025. Full release notes are available here. Highlights from the May release include:
- New Whole Genome Sequence (WGS) variant calls, featuring VarScan2 software
- 350+ additional cases from the Human Cancer Models Initiative (HCMI), with WGS, Whole Exome Sequence (WXS), and RNA-Sequence data
- Updated clinical data from The Cancer Genome Atlas (TCGA) and the Clinical Proteomic Tumor Analysis Consortium (CPTAC)
Resources to Support AI-informed Research
A new resources page focused on AI-informed research using GDC data has been created. It details various AI applications and objectives that showcase how GDC data is used by the research community and provides information on GDC resources supporting AI applications. Learn more.Release of New Single-Cell RNA-Seq Analysis Tool
The newly developed GDC single-cell RNA-Sequence (scRNA-Seq) Tool streamlines the exploration of single-cell RNA sequencing data through cluster plots and gene expression overlays. Key features include the visualization of dimensionality reduction plots such as UMAP, t-SNE, and PCA, differential gene expression analysis, gene set enrichment analysis, and a violin plot summarizing the data. An overview of the scRNA-Seq Tool is available in the GDC Data Portal User’s Guide. There is also a corresponding scRNA-Seq gene expression Application Programming Interface (API) endpoint with instructions available in the GDC API documentation.New Copy Number Variation (CNV) Categories
The GDC now offers enhanced CNV categories with more specific values. Gains are now classified as "gain" or "amplification," while losses are categorized as "heterozygous deletion" and "homozygous deletion." See the GDC Data Portal Release Notes for details. -
Proteomic Data Commons (PDC)
This year the Proteomic Data Commons (PDC) has added nearly 9 TBs of data from several ongoing programs, including:
- Clinical Proteomic Tumor Analysis Consortium (CPTAC)
- International Cancer Proteogenome Consortium (ICPC)
- Applied Proteogenomics OrganizationaL Learning and Outcomes (APOLLO)
For more information about the projects contributing the data, please visit the PDC Release Notes page.
Clinical data from CPTAC studies, which include attributes related to patient outcomes, have also been updated for several cohorts, including:
• Clear Cell Renal Cell Carcinoma
• Glioblastoma
• Head and Neck Squamous Cell Carcinoma
• Lung Adenocarcinoma
• Lung Squamous Cell Carcinoma
• Pancreatic Ductal Adenocarcinoma
• Uterine Corpus Endometrial CarcinomaThe release notes page also provides detailed infromation on downloading publicly accessible data, including instructions for filtering by demographics and diagnoses.
-
Imaging Data Commons (IDC)
The Imaging Data Commons (IDC) recently released V21. Detailed data release notes can be found in the IDC User Guide.
This release includes many additions:
• Files: +174,244 (45,784,454 total)
• Series: +3,308 (950,888 total)
• Studies: +2,070 (149,577 total)
• Cases: +1,893 (71,082 total)
• Collections: +1 (150 total)
• Disk size: +1.94 TB (87.49 TB total)The main highlight of this release is the update of the Childhood Cancer Data Initiative Molecular Characterization (CCDI-MCI) collection (see more details on the collection page), which has doubled the number of slides and patients to include 3,715 digital pathology slides and 3,582 patients.
Users can get started working with the pathology slides available through the IDC by using this notebook. The team has also created a guide to provide general information on accessing patient-level clinical data with images for the CCDI-MCI Collection.
Earlier this year, the IDC team added a new feature to its “cart.” Users can now create a collection of data files for individual patients, studies, or series, and then download the selected files directly to their computing environment.
The IDC also refined the appearance of the Explore section of its portal, making it more compact and intuitive.
Also note that the IDC has revised the way it supports researchers. It has replaced standing virtual office hours with appointments made online. A one-on-one support session with an IDC team member can be requested by filling out a short form (which is also linked from the landing page of the IDC Portal and can be found in the IDC documentation).
-
Integrated Canine Data Commons (ICDC)
The Integrated Canine Data Commons (ICDC) recently released several new studies, including:
- OSA02: Association of Canine Osteosarcoma Outcomes with Clinical, Genomic Mutations, and Transcriptomic Expression Profiles. The related publication for this new data is Immune Pathways and TP53 missense mutations are associated with longer survival in canine osteosarcoma. This adds
- 117 cases
- 53 Whole Exome Sequencing files (~ 1 TB)
- 117 Affymetrix GeneChip Analysis files (~ 1 GB)
- COTC021: Evaluation of Orally Administered mTOR Inhibitor Rapamycin in Dogs in the Adjuvant Setting with Osteosarcoma. The related publication has the same name. This study is related to COTC022, which was released in January 2025. This study adds
- 152 cases
- 186 FastqQ files (~200 GB)
- UBC03: Transcriptomic analyses of early-stage bladder cancer in Scottish Terriers detected through screening. The related publication is Identification of a naturally-occurring canine model for early detection and intervention research in high grade urothelial carcinoma. This study adds
- 20 cases
- 40 RNA-Seq files (~160 GB)
Additionally, the ICDC has added informational videos to each of the respective Program Details pages.
- OSA02: Association of Canine Osteosarcoma Outcomes with Clinical, Genomic Mutations, and Transcriptomic Expression Profiles. The related publication for this new data is Immune Pathways and TP53 missense mutations are associated with longer survival in canine osteosarcoma. This adds
-
Clinical and Translational Data Commons
The Clinical and Translational Data Commons (CTDC) has been actively developing several new features to improve the user experience. These will be available after the upcoming software release, expected by the end of June, and will include the following:
- Single-click export of molecular files to the Velsera Seven Bridges Cancer Genomics Cloud (SB-CGC). With this feature, users can select files of interest and, with one click, transfer that data to SB-CGC for analysis in a secure cloud-based environment.
- Expanded Histogram View, which graphically illustrates the data and files selected in the Explore tab. This allows users to easily see the breakdown of a cohort they are building by characteristic properties such as diagnosis, stage of disease, race and ethnicity, sex, and targeted therapy.
- A new Interoperability Microservice that will help users find related data from other CRDC resources directly on a CTDC study page. This makes it easier to explore supporting information, such as matching datasets, that may exist in other data commons. This is a step toward a more connected experience across NCI’s data platforms.
- The Data Availability Landscape View, a new feature that displays a high-level summary of available data types across CTDC studies, is located on the CTDC Studies page.
- The Data Model Navigator (DMN), a new tool that lets users navigate the nodes, properties, controlled vocabularies, common data elements, and relationships of the CTDC data model. This tool is available within the CTDC to facilitate data submission.
-
General Commons
The Cancer Data Service Has Been Renamed the General Commons
The Cancer Data Service (CDS), one of CRDC’s Data Commons, has a new name – the General Commons (GC). This better reflects its role within the CRDC of providing data management, storage, and sharing capabilities for NCI-funded studies that fall under the following categories:
- Studies with data that do not match existing CRDC Data Commons
- Studies with data that do not fit current data type criteria and/or the required metadata standards for existing CRDC Data Commons
Datasets
Over the last several months, the GC has added distinct datasets as well as new data to existing program data, including:
- Human Tumor Atlas Network (HTAN)
- Childhood Cancer Data Initiative (CCDI): Genomic Sequencing of Pediatric Rhabdomyosarcoma
- Childhood Cancer Data Initiative (CCDI): Single-Cell Atlas of NF1 Nerve Sheath Tumors
- CPTAC Proteogenomic Study
- The Molecular Profiling to Predict Response to Treatment (MP2PRT – CESC)
Features and Resources
Earlier this year, the GC added new features and support resources to enhance the user experience, including:- Enhanced data discovery capabilities with improved search functionality across key metadata fields, including diagnosis, accession numbers, study identifiers, and file types
- Updated terminology to increase consistency
- Updated GC data model to share patient-derived xenograft (PDX) data
- Refined library source classification by separating Common Data Element (CDE) into distinct fields for molecule type and material source, improving data accuracy and traceability
Most recently, the GC added even more features and functionality, including:
- Enhanced performance, scalability, and support for advanced graph queries enabled by the migration of metadata to the Memgraph Database
- Improved user support with a direct link to the CRDC Helpdesk
- Enhanced search filters within the Data page, for Participants, Samples, and Files
- The addition of a GC User Guide and GC API Query documentation in the About section
- Improved user experience with streamlined Programs pages
- Updated data releases grouped by years and software releases tracked on the GC Releases page under the About tab. Even more granular information about bug fixes can be found on the GC portal’s GitHub page.
-
Cancer Data Aggregator
The Cancer Data Aggregator (CDA) completed its most recent data and metadata update in late May. Full notes can be found here.
CDA regularly refreshes its central search database to align with any updates that the CRDC Data Commons such as GDC, PDC, IDC, GC, and ICDC are making to their data and metadata. When accessing CDA, users can search by harmonized, common language terms and build comprehensive cohorts from subjects that may have data in various data commons. Users can then take the aggregated cohorts to the Cloud Resources for further analysis.
-
Data Commons Framework
The Data Commons Framework (DCF) enables access to more than 14 PB of CRDC data through consistent indexing. It also ensures access to both open and controlled-access data with appropriate permissions. Recent activities from the DCF team include:
Data Indexing
- Made an additional 750 terabytes of data (250,000 files from 28 research studies) easier to find and use, utilizing the IndexD system.
- Supported the Genomic Data Commons (GDC) so their data can be accessed more quickly by researchers. Specifically, the DCF updated GDC data replication strategies to improve access across multiple platforms, including portals and NCI Cloud Resources.
Security & Compliance
- Completed its annual security audit and received a 3-year renewal of its FedRAMP authorization, which allows the CRDC to continue to deliver controlled-access data to the research community in compliance with the NIH Genomic Data Sharing Policy.
-
CRDC Submission Portal
Since the CRDC Submission Portal was launched in June 2024, the Data Hub team has successfully deployed quarterly releases with new front and back-end improvements to facilitate data submission to CRDC for NCI and NIH-funded studies. Some of the most recent features that were added to enhance the submitter’s experience include:
- An ability for multiple users to collaborate on a single data submission
- A progress bar for submitters to better visualize and understand the status of the data upload
- Validation errors providing an aggregated view so submitters can see the classes of errors that are causing the most issues
- Permissible value suggestion feature if the entered value is found in the NCI Thesaurus (NCIt)
- An option to restore a previously cancelled Submission Request
Detailed Release Notes can be found on the CRDC Submission Portal.
-
Broad Institute FireCloud
Updated Features
- The FireCloud Data Library is now powered by Data Use Oversight System (DUOS) that streamlines access to data. NCI datasets such as TARGET and TCGA can be searched and accessed through this new system, which is compatible with FireCloud/Terra’s analytical tools and applications.
- FireCloud/Terra has released its publicly available roadmap, providing a transparent view of the Terra platform's ongoing development. It gives insight into features and functionality across four active development stages: Near Term, Preview, Launching, and Released.
Billing and Cost Management
- Several cost management and visibility improvements have been made, allowing users to set lifecycle rules to delete files in a specified location after a defined period. However, files in workspace buckets can still be restored if needed. Additionally:
- Users can now view a combined spend report for all the workspaces they own across all their billing projects within Terra.
- For new workflows initiated after March 2025, the billing dashboard’s Run Cost column displays the estimated run cost when the actual cost is unavailable.
- Users can also set workflow cost thresholds to manage their budgets.
Note that the estimated run cost and threshold features do not apply retroactively to older workflows.
-
Seven Bridges Cancer Genomics Cloud
New Tools and Apps
- ImmunoVerse Interactive App: Developed by Guangyuan (Frank) Li, PhD, Postdoctoral Researcher, Perlmutter Cancer Center, NYU Grossman School of Medicine. This app supports the discovery of tumor-specific antigens across molecular classes (e.g., splicing, transposable elements). This tool is publicly available on the SB-CGC platform. For more details, go to the GitHub reference page.
- JSON2TSV App: Published by Catherine Bullen, PhD, Bioinformatics Manager, Frederick National Laboratory. This app is available through SB-CGC’s Public Apps Gallery and converts JSON to TSV files to facilitate analysis of clinical data. Learn more on the SB-CGC GitHub page.
Workshops & Educational Engagement
- Georgetown University (February and March 2025): The SB-CGC team collaborated with Yuriy Gusev, PhD, to showcase MCMICRO as a tool for analyzing slide images and the COPD Machine Learning tutorial for examining medical imaging data. Dr. Gusev presented his work in a webinar held on May 28, 2025. A recording of the webinar is available.
- Purdue Workshop (week of May 21): The SB-CGC team presented a workshop on creating bioinformatics workflows using SB-CGC tools and associated CRDC resources, particularly data from the Integrated Canine Data Commons (ICDC). Slides are available.
- University of Florida Cancer Systems Biology Course (Spring Semester): The SB-CGC team taught how to use the SB-CGC for RNA/protein CWL pipelines. Meghan Farrall Fairbanks, PhD, hosted the classroom work.
- NCI BTEP Presentation (April 9): The SB-CGC team presented SB-CGC access and analysis options for HTAN imaging and spatial transcriptomic data to the Bioinformatics Training and Education Program of the NCI.
- Learn more about collaborative instruction on the SB-CGC website.
Collaboration & Public Engagement
- AACR Participation
- The SB-CGC team presented a poster at the recent AACR Annual Meeting titled Cloud-Based Machine Learning for Enhanced Tumor Classification in Cancer Genomics: End-to-End Solution for Whole Slide Imaging Data.
- Collaborator Frank Li, PhD (noted above), also presented at the AACR Annual Meeting. His talk, “A Pan-Cancer Intracellular Tumor Antigen Atlas,” is available through the AACR Annual Meeting site to meeting registrants.
- Collaborative Project Fund Awards (Learn more here.)
- Meghan Ferrall Fairbanks (University of Florida) was awarded a project to build RNA/protein CWL pipelines on SB-CGC, which led to the instructional work noted above.
- Charlie Vaske and Demetris Roumis (Anaconda) have been awarded funds to develop an LLM-integrated project utilizing CRDC resources.
- Frank Li’s Neoverse App (noted above) was funded through the Collaborative Project Awards.
-
ISB Cancer Gateway in the Cloud
Updated Features
- BigQuery Table Search improvements
- ISB-CGC introduced a new filter to distinguish between versioned tables and those that consistently point to the latest version, simplifying search results.
- To enhance the system's scalability, backend queries are now executed in SQL, with all ISB-CGC's BigQuery metadata stored and updated in tables.
- BigQuery Ecosystem-derived data improvements
- New tables have been developed for miRNA, gene expression, and somatic mutation from recently updated r41 and r42 GDC data. The updated data comes from programs including CPTAC, TCGA, and TARGET.
- In addition to publishing new GDC (r41, r42) and PDC (V4.7, V4.9) metadata tables, ISB-CGC has expanded the TCGA and TARGET clinical data from supplemental file table offerings. They now provide 117 additional columns for TARGET in a single table and 427 additional columns for TCGA across seven tables.
- Mitelman Graphical User Interface improvements
- Data exploration capabilities have been improved with the addition of the data browser. Users can now create Circos plots for data subsets or explore the relationship between gene fusions and morphology as well as topography.
Data Releases
- The Mitelman Database: Quarterly data updates have added more than 1,200 new cases, raising the total number of unique cytogenetic aberrations by nearly 800. Additionally, over 200 new gene fusions involving dozens of previously unreported genes were added.
Workshops & Educational Engagement
- The ISB-CGC team presented two seminars demonstrating the use of BigQuery in analyzing Human Tumor Atlas Network (HTAN) data. The presentations were to NCI’s Bioinformatics Training & Education Program (BTEP) group, and recordings of both presentations are available:
- BigQuery Table Search improvements