Genomic Data Commons
(GDC)Overview
The Genomic Data Commons (GDC) is a cancer knowledge network that supports hosting, standardization, and analysis of genomic, clinical, and biospecimen data from cancer research programs. The GDC harmonizes raw sequencing data, identifies and applies state-of-the-art bioinformatics methods for generating mutation calls, structural variants and other high-level data, and provides scalable downloads and web-based analysis tools.
Data in the GDC are structured using the GDC Data Model, with properties and data types defined in the GDC Data Dictionary. The data model is continually updated to accommodate evolving genomic technology and biomedical research, with guidance and input from the research community and external reference standards.
Because of the personal nature of genomic data, some genomic data in the GDC may be controlled access, requiring eRA Commons authentication and dbGaP authorization to access the data. Whether a dataset is open or controlled is determined according to Data Access Policies in a process driven by informed consent of research participants.
Data in the GDC are accessible through many ways, including: the GDC Data Portal, a web-based platform with a graphical user interface to search for and download data; the GDC Data Transfer Tool (DTT), a client-based utility to efficiently download and upload large volumes of data; and the GDC Application Programming Interface (API), a programmatic interface to query, download, upload, and analyze data.
In addition to providing access to data, the GDC provides several analysis tools via the Data Portal Exploration and Analysis features:
- Mutation Frequency Distribution Graph - View the most frequently mutated genes for any cohort and plot frequencies of cases with mutations and copy number variants for a selected gene
- OncoGrid - Visualize combinations of gene mutations and copy number variants for a project or custom cohort
- Survival Analysis - Compare overall survival of any two cohorts, such as patients with and without a mutated gene of interest
- Set Operations - Perform operations on gene, mutation, or case sets by visualizing set similarities and differences in a Venn diagram
- Cohort Comparison - Display the survival analysis of custom case sets and compare characteristics such as gender, vital status and age at diagnosis
- Clinical Data Analysis - Select a clinical variable and view cohort-level survival plots, histograms, box plots, and Q-Q plots
- Protein Viewer - Visualize gene mutations mapped to their protein functional domains
Data Types
The GDC provides data that are processed through a uniform set of bioinformatics pipelines. GDC generated data types and associated file formats for each experimental strategy include:
Experimental Strategy | Data Type | File Format |
---|---|---|
Clinical and Biospecimen | Clinical and Biospecimen Metadata | JSON and Tab-delimited |
Diagnostic and Tissue Slide | Slide Image | SVS |
Genotyping Array | Copy Number Segment | TXT |
Methylation Array | Methylation Beta Value | TXT |
miRNA-Seq | miRNA and Isoform Expression Quantification | TXT |
RNA-Seq | Gene Expression and Splice Junction Quantification | TXT and Tab-delimited |
Targeted Sequencing | Transcript Fusion | Tab-delimited |
WGS | Structural Rearrangements | BED |
WGS | Raw Somatic Mutations | VCF |
WGS | MSISensor (Tumor-Only) | TXT |
WGS, Targeted Sequencing, Genotyping Array | Copy Number Scores | TXT |
WXS, Targeted Sequencing | Raw and Annotated Somatic Variants | VCF |
WXS, Targeted Sequencing | Aggregated and Masked Somatic Mutations | MAF |
WXS, WGS, RNA-Seq, miRNA-Seq, ATAC-Seq | Aligned Reads | BAM |
Datasets
The GDC provides access to datasets from key NCI programs such as:
- The Cancer Genome Atlas (TCGA) - A collaboration between NCI and the National Human Genome Research Institute (NHGRI) that has characterized tumor and normal tissues from 11,000 patients, covering 33 cancer types.
- Therapeutically Applicable Research to Generate Effective Treatments (TARGET) - A consortium of extramural and NCI investigators working to characterize and understand hard-to-treat childhood cancers and translate findings into the clinic.
- Clinical Proteomic Tumor Analysis Consortium (CPTAC) - A national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics.
- Human Cancer Model Initiative (HCMI) - An international consortium that is generating novel, next-generation, tumor-derived culture models complete with genomic and clinical data.
- Cancer Genome Characterization Initiatives (CGCI) - An initiative examining genomes, exomes, and transcriptomes of various types of adult and pediatric cancers.
The GDC also collaborates with organizations external to NCI to provide harmonized data from critical cancer programs such as:
- Foundation Medicine (FM) - Targeted sequencing data from ~18,000 adult patients generated by the Foundation Medicine Inc., molecular information company seeking to match patients with personalized treatment plans.
- Multiple Myeloma Research Foundation (MMRF) - Data from nearly 1,000 patients with extensive molecular and clinical data, including longitudinal information collected over the course of disease for many patients.
- Genomics Evidence Neoplasia Information Exchange (GENIE) - Over 44,000 cases from the international pan-cancer registry continuing to be collected by the American Association for Cancer Research (AACR) initiative.
The GDC has ongoing data releases to make additional data sets available to the cancer research community.
Anatomical Sites
The GDC includes data from multiple organ sites. Major sites include:
- Adrenal Gland
- Bile Duct
- Bladder
- Blood
- Bone
- Bone Marrow
- Brain
- Breast
- Cervix
- Colorectal
- Esophagus
- Eye
- Head and Neck
- Kidney
- Liver
- Lung
- Lymph Nodes
- Nervous System
- Ovary
- Pancreas
- Pleura
- Prostate
- Skin
- Soft Tissue
- Stomach
- Testis
- Thymus
- Thyroid
- Uterus