SGDP Phased dataset (publicly available samples) ------------------------------------------------ ** Important **: This is the recommended release version for phased data analysis, and replaces the original phased data release (Dec 2016). Last updated: Wed Dec 22 11:59:04 EST 2021 [by Shop Mallick] VERSION HISTORY - [Wed Dec 22 11:51:54 EST 2021]: minor text edits to README - [May 5 2021]: New release. Phased dataset using bcftools and glimpse approach released. This is the recommended data version to use - [Apr 2018]: Bug identified; release notes for initial release modified - [Dec 2016]: Initial release. Not recommended for analysis. DETAILS The newer release here (May 5 2021) is recommended. Initial release version (Dec 2016) was found to have a bug in the processing chain. For details, see: https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/phased_data.knownbugs.not_recommended.please_use_newer_dataset_instead/Readme This dataset was created by Ali Akbari; details of the methodology will be provided in a manuscript in preparation. Samples are imputed against the thousand genomes project dataset (phase 3) [1000 Genomes Project Initial mpileup is constructed using bcftools (version 1.10.2) [Danecek et al, GigaScience 2021], phasing is generated using Glimpse (version 1.0.0) [Rubinacci et al, Nature Genetics 2021]. The imputed file format is BCF with following fields: Generated by imputation tool (glimpse): GT: Phased and imputed genotypes DS: Genotype dosage GP: Genotype posteriors Generated by genotype caller (mpileup): PL: Phred-scaled genotype likelihoods AD: Allelic depths (high-quality bases) REFERENCES [Fan et al. Genome Biology 2019]: "African evolutionary history inferred from whole genome sequence data of 44 indigenous African populations". Fan S, Kelly DE, Beltrame MH, Hansen MEB, Mallick S, Ranciaro A, Hirbo J, Thompson S, Beggs W, Nyambo T, Omar SA, Meskel DW, Belay G, Froment A, Patterson N, Reich D, Tishkoff SA. Genome Biol. 2019. [Danecek et al, GigaScience 2021]: "Twelve years of SAMtools and BCFtools". Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, O'Pollard M, Whitwham A, Keane T, McCarthy SA, Davies RM, Li G. GigaScience, Volume 10, Issue 2, February 2021. https://doi.org/10.1093/gigascience/giab008. [Rubinacci et al, Nature Genetics 2021]: "Efficient phasing and imputation of low-coverage sequencing data using large reference panels". Rubinacci S, Ribeiro D, Hofmeister R, Delaneau O. Nature Genetics 53.1 (2021): 120-126. [1000 Genomes Project Consortium, Nature 2015]: "A global reference for human genetic variation". 1000 Genomes Project Consortium, Auton A, Brooks LD, Durbin RM, Garrison EP, Kang HM, Korbel JO, Marchini JL, McCarthy S, McVean GA, Abecasis GR. Nature. 2015 Oct 1;526(7571):68-74. DOI: 10.1038/nature15393. PMID: 26432245. http://www.internationalgenome.org/data/.