remotecomputer-bioinformatics

Step-by-Step Guide to Obtaining Ancestral Allele Information from the 1000 Genomes Project

January 10, 2025 Off By admin
Shares

Ancestral allele information is crucial for understanding the evolutionary context of genetic variants. This guide provides a detailed protocol for extracting ancestral allele information from the 1000 Genomes Project data, including handling VCF files and using external resources like Ensembl and dbSNP.


Step 1: Understanding Ancestral Alleles

1.1 Definition

  • Ancestral Allele: The allele inferred to be present in the most recent common ancestor of the species being studied.
  • Derived Allele: The allele that has arisen due to mutation since the divergence from the ancestral allele.

1.2 Sources of Ancestral Allele Information

  • Chimpanzee and Other Primates: Often used as proxies, but not always accurate due to mutations in the primate lineage.
  • Phylogenetic Trees: More accurate method using multiple species to infer the ancestral allele.
  • Ensembl Compara: Provides ancestral allele information based on multi-species alignments.

Step 2: Downloading 1000 Genomes VCF Files

2.1 Accessing the Data

  1. Go to the 1000 Genomes Project website.
  2. Navigate to the data portal and select the VCF files for the population or region of interest.

2.2 Downloading VCF Files

  • Use wget or curl to download the VCF files:
    bash
    Copy
    wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Step 3: Extracting Ancestral Allele Information from VCF Files

3.1 Using bcftools

  1. Install bcftools if not already installed:
    bash
    Copy
    sudo apt-get install bcftools
  2. Extract the ancestral allele field (AA) from the VCF file:
    bash
    Copy
    bcftools view -H ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | awk '{print $1, $2, $8}' | grep -oP 'AA=\K[^;]*' > ancestral_alleles.txt

3.2 Handling Missing Data

  • “.” in the AA field: Indicates that the ancestral allele could not be determined.
  • Filtering: Remove or flag entries with missing ancestral allele information.

Step 4: Using Ensembl Compara for Ancestral Alleles

4.1 Accessing Ensembl Compara

  1. Go to the Ensembl website.
  2. Use the BioMart tool to query ancestral allele information.

4.2 Querying Ancestral Alleles

  1. Select the “Ensembl Genes” dataset.
  2. Choose the appropriate species (e.g., Human).
  3. Under “Attributes”, select “Variation” and choose “Ancestral Allele”.
  4. Export the results to a file.

Step 5: Using dbSNP for Ancestral Alleles

5.1 Downloading dbSNP Data

  1. Go to the dbSNP FTP site.
  2. Download the SNPAncestralAllele.bcp.gz and Allele.bcp.gz files.

5.2 Extracting Ancestral Alleles

  1. Decompress the files:
    bash
    Copy
    gunzip SNPAncestralAllele.bcp.gz
    gunzip Allele.bcp.gz
  2. Use a script to parse and match the ancestral alleles with your SNP data.

Step 6: Combining Data Sources

6.1 Merging VCF and Ensembl Data

  • Use a script to merge the ancestral allele information from Ensembl with your VCF data based on genomic coordinates.

6.2 Cross-Referencing with dbSNP

  • Ensure consistency by cross-referencing the ancestral alleles obtained from different sources.

Step 7: Handling Ambiguous Cases

7.1 Multiple Ancestral Alleles

  • Resolution: Use phylogenetic trees or additional species to resolve ambiguities.

7.2 Missing Data

  • Imputation: Use statistical methods to infer missing ancestral alleles, though this should be done cautiously.

Step 8: Example Workflow

8.1 Download and Extract VCF Data

bash
Copy
wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
bcftools view -H ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | awk '{print $1, $2, $8}' | grep -oP 'AA=\K[^;]*' > ancestral_alleles.txt

8.2 Query Ensembl Compara

  1. Use BioMart to export ancestral allele information.
  2. Merge with VCF data using a script.

8.3 Cross-Reference with dbSNP

  1. Download and parse SNPAncestralAllele.bcp and Allele.bcp.
  2. Merge with existing data.

Conclusion

Obtaining ancestral allele information from the 1000 Genomes Project involves extracting data from VCF files, querying Ensembl Compara, and cross-referencing with dbSNP. By following this guide, you can accurately determine ancestral alleles for your SNPs, enhancing your understanding of genetic variation and evolution. Always handle missing data and ambiguities with care, and consider using multiple data sources for robust results.

Shares