Data Preparation¶

Below are the requirements and instructions to ensure your data is ready for analysis.

You will need the following:

A file assigning each sample name to a population (see instructions below)
A VCF file (see instructions below)
Mutation rate (per bp per generation) for your species
Length of sequence covered by your data

Example Dataset¶

If you do not have your data you can download the example dataset from here.

This dataset includes genetic data for two populations of orangutans. The VCF file chr1.vcf.gz contains SNPs only for chromosome 1, which has a length of 227,913,704 base pairs. The popmap file (popmap.txt) provides the population assignments for each sample. Two populations are present: Bornean and Sumatran, with 10 samples each. The mutation rate is equal to 1.5 × 10⁻⁸ per base pair per generation.

If you are using the example dataset provided for the workshop, you’re all set! You do not need to follow the steps below. Instead, please let us know you were successful by visiting the feedback section at the end of this page.

File with Population Assignments (`popmap.txt`)¶

For the workshop, you will need a file listing which population each of your samples belongs to. Please create a plain text file called popmap.txt with two columns, separated by a tab, like this:

sample1    PopulationA
sample2    PopulationA
sample3    PopulationB
sample4    PopulationB

The first column is the sample name (must match the sample names used in your VCF file). The second column is the population name (choose any labels you wish, e.g. “Pop1” or real population names).

Tip

If you have many samples and want to generate a list of them from your VCF, use the following command:

bcftools query -l your_file.vcf

bcftools query -l your_file.vcf.gz

This will print one sample name per line, which you can copy and paste into your popmap.txt and then add the corresponding population names.

Note

It is absolutely fine if some samples in your VCF are not listed in your popmap.txt—they will simply be ignored in downstream analyses that require population assignments.

Let Us Know You Were Successful¶

In order for us to understand how many participants are ready for the workshop, we ask you to send us feedback that your data is ready.

Please run the following command for your chr1.vcf:

.vcf

awk '{if(length($4)!=1 || length($5)!=1 || $5 ~ /,/) print $0}' your_final.vcf | wc -l

.vcf.gz

bcftools view -H your_final.vcf.gz | awk '{if(length($4)!=1 || length($5)!=1 || $5 ~ /,/) print $0}' | wc -l

Please send us the output via the following form: Send Your Result

Data Preparation¶

Example Dataset¶

File with Population Assignments (popmap.txt)¶

VCF Preparation Instructions¶

Let Us Know You Were Successful¶

File with Population Assignments (`popmap.txt`)¶