1000 Genomes

Note

Important Update 9/19/2024: All URLs are changing. We are enabling public access to all Genomics Data Lake containers. The existing “signed URLs” (shared access signatures) will be retired at: 2024-11-04T00:00:00Z. After this time, the URLs without a query string will continue to work, however the “signed URLs” will no longer work and will return a 403 HTTP status code. Please plan accordingly to access the public URLs without a query string after this date (remove the ‘?’ and trailing characters).

The 1000 Genomes Project ran between 2008 and 2015, to create the largest public catalog of human variation and genotype data. The final data set contains data for 2,504 individuals from 26 populations and 84 million identified variants. For more information, visit the 1000 Genome Project website and these publications:

Pilot Analysis: A map of human genome variation from population-scale sequencing Nature 467, 1061-1073 (28 October 2010)

Phase 1 Analysis: An integrated map of genetic variation from 1,092 human genomes Nature 491, 56-65 (01 November 2012)

Phase 3 Analysis: A global reference for human genetic variation Nature 526, 68-74 (01 October 2015) and An integrated map of structural variation in 2,504 human genomes Nature 526, 75-81

Visit this resource for more information about the relevant data formats.

[NEW]: The dataset is also available in parquet format.

Note

Microsoft provides Azure Open Datasets on an “as is” basis. Microsoft makes no warranties, express or implied, guarantees or conditions with respect to your use of the datasets. To the extent permitted under your local law, Microsoft disclaims all liability for any damages or losses, including direct, consequential, special, indirect, incidental or punitive, resulting from your use of the datasets.

This dataset is provided under the original terms that Microsoft received source data. The dataset may include data sourced from Microsoft.

Data source

This dataset is a mirror of this FTP resource.

Data volumes and update frequency

This dataset contains approximately 815 TB of data. It receives daily updates.

Use Terms

Following the final publications, data from the 1000 Genomes Project is publicly available, without embargo, to anyone for use under the terms provided by the dataset source. Use of the data should be cited per details available in the 1000 Genome Project FAQ resource.

Contact

Scroll down at this resource for the contact information.

Next steps

View the rest of the datasets in the Open Datasets catalog.