AI-computer

How To Analyze Imputed GWAS Data

January 3, 2025 Off By admin
Shares

Analyzing imputed GWAS data involves several steps, including quality control (QC), association analysis, and interpretation of results. Below is a step-by-step guide to help you analyze imputed GWAS data using a combination of tools and scripts.

Step 1: Data Preparation

1.1. Check Data Format

Ensure your data is in the correct format. Imputed GWAS data is often provided as a matrix of allele dosages (e.g., 0, 1, 2 for genotypes, or continuous values for dosages).

1.2. Convert Data to PLINK Format (if necessary)

If your data is not in PLINK format, you can convert it using the following steps:

Using Python to Transpose Data (if needed):

Copy
import pandas as pd

# Load the data
data = pd.read_csv("imputed_data.txt", sep="\t")

# Transpose the data
data_transposed = data.transpose()

# Save the transposed data
data_transposed.to_csv("imputed_data_transposed.txt", sep="\t", index=False)

Convert to PLINK Dosage Format:

PLINK can handle dosage data directly. Ensure your dosage file has the following format:

Copy
FID IID SNP1 SNP2 SNP3 ...
1   1   0.1  1.8  2.0
1   2   1.0  0.5  1.5
...

Use PLINK to convert the dosage file:

bash
Copy
plink --dosage imputed_data_transposed.txt format=1 --fam sample_fam_file.fam --out imputed_data_plink

Step 2: Quality Control (QC)

2.1. Sample QC

  • Sample Call Rate: Remove samples with low call rates.
  • Relatedness: Check for related samples using PLINK.
  • Population Stratification: Perform PCA to identify population outliers.

PLINK Commands for Sample QC:

bash
Copy
# Calculate sample call rates
plink --bfile imputed_data_plink --missing --out sample_call_rate

# Check for relatedness
plink --bfile imputed_data_plink --genome --out relatedness_check

# Perform PCA
plink --bfile imputed_data_plink --pca 10 --out pca_results

2.2. SNP QC

  • MAF Filtering: Remove SNPs with minor allele frequency (MAF) < 0.01.
  • Imputation Quality: Filter SNPs based on imputation quality scores (e.g., Rsq > 0.3).

PLINK Commands for SNP QC:

bash
Copy
# Filter by MAF
plink --bfile imputed_data_plink --maf 0.01 --make-bed --out imputed_data_maf_filtered

# Filter by imputation quality (assuming you have a file with Rsq values)
plink --bfile imputed_data_maf_filtered --extract high_quality_snps.txt --make-bed --out imputed_data_high_quality

Step 3: Association Analysis

3.1. Run Association Analysis

Use PLINK or other tools like SNPTEST or REGENIE for association analysis.

PLINK Command for Association Analysis:

bash
Copy
plink --bfile imputed_data_high_quality --assoc --out association_results

Using SNPTEST for Dosage Data:

bash
Copy
snptest -data imputed_data.sample imputed_data.dose -o snptest_results.txt -frequentist 1

3.2. Adjust for Covariates

Include covariates such as age, sex, and principal components (PCs) in your analysis.

PLINK Command with Covariates:

bash
Copy
plink --bfile imputed_data_high_quality --linear --covar pca_results.eigenvec --covar-name PC1,PC2,PC3 --out association_results_covariates

Step 4: Post-Analysis

4.1. Manhattan Plot and QQ Plot

Visualize your results using Manhattan and QQ plots to check for inflation and significant hits.

Using R to Generate Plots:

R
Copy
library(qqman)

# Load association results
results <- read.table("association_results_covariates.assoc.linear", header=TRUE)

# Generate Manhattan plot
manhattan(results, chr="CHR", bp="BP", p="P", snp="SNP")

# Generate QQ plot
qq(results$P)

4.2. Annotation of Significant SNPs

Annotate significant SNPs using tools like ANNOVAR or online resources like Ensembl.

Using ANNOVAR:

bash
Copy
annotate_variation.pl -buildver hg19 -out annotated_results -dbtype refGene association_results_covariates.assoc.linear

Step 5: Interpretation and Reporting

  • Identify Significant Loci: Focus on SNPs with p-values below the genome-wide significance threshold (e.g., 5e-8).
  • Functional Annotation: Investigate the functional relevance of significant SNPs using databases like GTEx or RegulomeDB.
  • Replication: Consider replicating your findings in an independent cohort.

Recent Tools and Software

  • REGENIE: Efficient for large-scale GWAS with imputed data.
  • SAIGE: Suitable for binary traits and large datasets.
  • LocusZoom: For visualizing GWAS results in specific genomic regions.

Conclusion

Analyzing imputed GWAS data involves several steps, from data preparation and QC to association analysis and interpretation. Using tools like PLINK, SNPTEST, and R, you can efficiently manage and analyze large-scale imputed GWAS data. Always ensure to follow best practices for QC and adjust for potential confounders to obtain reliable results.

Shares