TCGA Open Data

Note

Important Update 9/19/2024: All URLs are changing. We are enabling public access to all Genomics Data Lake containers. The existing “signed URLs” (shared access signatures) will be retired at: 2024-11-04T00:00:00Z. After this time, the URLs without a query string will continue to work, however the “signed URLs” will no longer work and will return a 403 HTTP status code. Please plan accordingly to access the public URLs without a query string after this date (remove the ‘?’ and trailing characters).

The Cancer Genome Atlas (TCGA), a landmark cancer genomics program, molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types[1]. The TCGA cancer data made available publically are two tiers: open or controlled access.

  • Open access [available on Azure]: This dataset contains deindentified clinical and biospecimen data or summarized data that doesn't contain any individually identifiable information. The data types included are Gene expression, methylation beta values and protein quantification. DNA level datatype includes gene level copy number and masked copy number segment.
  • Controlled access: This dataset is the individual level sequence data and requires approval through dbGap for access.

Note

Microsoft provides Azure Open Datasets on an “as is” basis. Microsoft makes no warranties, express or implied, guarantees or conditions with respect to your use of the datasets. To the extent permitted under your local law, Microsoft disclaims all liability for any damages or losses, including direct, consequential, special, indirect, incidental or punitive, resulting from your use of the datasets.

This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.

Data source

This dataset is a mirror of TCGA Open Data

Data volumes and update frequency

This dataset contains approximately 387 GB

Storage location

This dataset is stored in the East US 2 Azure regions. Allocating compute resources in East US 2 is recommended for affinity.

Data access

East US 2: 'https://datasettcga.blob.core.windows.net/dataset'

SAS Token: ?sp=rl&st=2022-10-07T19:43:37Z&se=2030-10-02T03:43:37Z&spr=https&sv=2021-06-08&sr=c&sig=9YgXjisOpHJNgdeMb5lOOzBhA38PWGM8g2DHjo9A5Cs%3D

Use terms

Data is available without restrictions. For more information and citation details, see the TCGA Program page

Contact

For questions regarding TCGA data and program: https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/contact

Next steps

View the rest of the datasets in the Open Datasets catalog.