SGDP phased dataset ------------------- ** Important **: A bug was found in the initial release (in this directory). A new phased dataset release (May 5 2021) is now available from: https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/phased_data2021/ Last updated: Wed Dec 22 11:51:54 EST 2021 [by Shop Mallick] VERSION HISTORY - [Wed Dec 22 11:51:54 EST 2021]: minor text edits to README - [May 5 2021]: New release. Phased dataset using bcftools and glimpse approach released. This is the recommended data version to use - [Apr 2018]: Bug identified; release notes for initial release modified - [Dec 2016]: Initial release. Not recommended for analysis. DETAILS (A) [Dec 2016]: Initial release (not recommended - please use newer release (see (B) below). SGDP data was phased and analysed by Iain Mathieson, who also packaged data for sharing. Some caveats include sensitivity of some demographic parameters such as split times to the phasing/filtering. .. DESCRIPTION This directory contains pointers to data phased using three approaches: 1) PS2: 280 fully public SGDP samples, phased using shapeit. All sites phased. Includes homref sites. 2) PS3: 280 fully public sites phased using impute2, with the 1000 Genomes reference panel. All sites phased. Includes homref sites. In general I believe that the phasing is best in PS1 and worst in PS3. Note that sites that are phased in PS2/3 but not in PS1 are likely to be very poorly phased. .. DIRECTORY_STRUCTURE PS?_by_sample: One vcf per-sample per-chromosome. Phased data in: ${SAMPLE}/${SAMPLE}.chr${CHR}.phased.vcf.gz PS?_multisample One vcf per-chromosome with all samples included (only for PS2 and PS3). .. WARNINGS 1) These files were filtered to inclue only bialellic SNPSs. When the reference panel was used, SNPs that did not agree with the reference panel were also excluded 2) These files should be further filtered using the site- and sample- specific filters. 3) Do not use the sample S_Daur-1. It has regions of missing data. 4) Some samples have chromosome 2 truncated; an update will be done at some point. 5) [update from: Apr 2018]: This version is no longer recommended. A bug in the processing chain indicates that heterozygous sites that occur in positions without a chimpanzee allele are incorrectly assigned homozygous reference state. This artficially increases the size of some homozygous chunks and can affect some analysis; for example, it is known to inflate recent effective population sizes in MSMC estimates. See (B) below for the newer version. (B) [May 2021]: New release. This is the recommended data version to use. See: https://sharehost.hms.harvard.edu/genetics/reich_lab/sgdp/phased_data2021/