How To Analyze Imputed GWAS Data
January 3, 2025Analyzing imputed GWAS data involves several steps, including quality control (QC), association analysis, and interpretation of results. Below is a step-by-step guide to help you analyze imputed GWAS data using a combination of tools and scripts.
Step 1: Data Preparation
1.1. Check Data Format
Ensure your data is in the correct format. Imputed GWAS data is often provided as a matrix of allele dosages (e.g., 0, 1, 2 for genotypes, or continuous values for dosages).
1.2. Convert Data to PLINK Format (if necessary)
If your data is not in PLINK format, you can convert it using the following steps:
Using Python to Transpose Data (if needed):
import pandas as pd # Load the data data = pd.read_csv("imputed_data.txt", sep="\t") # Transpose the data data_transposed = data.transpose() # Save the transposed data data_transposed.to_csv("imputed_data_transposed.txt", sep="\t", index=False)
Convert to PLINK Dosage Format:
PLINK can handle dosage data directly. Ensure your dosage file has the following format:
FID IID SNP1 SNP2 SNP3 ... 1 1 0.1 1.8 2.0 1 2 1.0 0.5 1.5 ...
Use PLINK to convert the dosage file:
plink --dosage imputed_data_transposed.txt format=1 --fam sample_fam_file.fam --out imputed_data_plink
Step 2: Quality Control (QC)
2.1. Sample QC
- Sample Call Rate: Remove samples with low call rates.
- Relatedness: Check for related samples using PLINK.
- Population Stratification: Perform PCA to identify population outliers.
PLINK Commands for Sample QC:
# Calculate sample call rates plink --bfile imputed_data_plink --missing --out sample_call_rate # Check for relatedness plink --bfile imputed_data_plink --genome --out relatedness_check # Perform PCA plink --bfile imputed_data_plink --pca 10 --out pca_results
2.2. SNP QC
- MAF Filtering: Remove SNPs with minor allele frequency (MAF) < 0.01.
- Imputation Quality: Filter SNPs based on imputation quality scores (e.g., Rsq > 0.3).
PLINK Commands for SNP QC:
# Filter by MAF plink --bfile imputed_data_plink --maf 0.01 --make-bed --out imputed_data_maf_filtered # Filter by imputation quality (assuming you have a file with Rsq values) plink --bfile imputed_data_maf_filtered --extract high_quality_snps.txt --make-bed --out imputed_data_high_quality
Step 3: Association Analysis
3.1. Run Association Analysis
Use PLINK or other tools like SNPTEST or REGENIE for association analysis.
PLINK Command for Association Analysis:
plink --bfile imputed_data_high_quality --assoc --out association_results
Using SNPTEST for Dosage Data:
snptest -data imputed_data.sample imputed_data.dose -o snptest_results.txt -frequentist 1
3.2. Adjust for Covariates
Include covariates such as age, sex, and principal components (PCs) in your analysis.
PLINK Command with Covariates:
plink --bfile imputed_data_high_quality --linear --covar pca_results.eigenvec --covar-name PC1,PC2,PC3 --out association_results_covariates
Step 4: Post-Analysis
4.1. Manhattan Plot and QQ Plot
Visualize your results using Manhattan and QQ plots to check for inflation and significant hits.
Using R to Generate Plots:
library(qqman) # Load association results results <- read.table("association_results_covariates.assoc.linear", header=TRUE) # Generate Manhattan plot manhattan(results, chr="CHR", bp="BP", p="P", snp="SNP") # Generate QQ plot qq(results$P)
4.2. Annotation of Significant SNPs
Annotate significant SNPs using tools like ANNOVAR or online resources like Ensembl.
Using ANNOVAR:
annotate_variation.pl -buildver hg19 -out annotated_results -dbtype refGene association_results_covariates.assoc.linear
Step 5: Interpretation and Reporting
- Identify Significant Loci: Focus on SNPs with p-values below the genome-wide significance threshold (e.g., 5e-8).
- Functional Annotation: Investigate the functional relevance of significant SNPs using databases like GTEx or RegulomeDB.
- Replication: Consider replicating your findings in an independent cohort.
Recent Tools and Software
- REGENIE: Efficient for large-scale GWAS with imputed data.
- SAIGE: Suitable for binary traits and large datasets.
- LocusZoom: For visualizing GWAS results in specific genomic regions.
Conclusion
Analyzing imputed GWAS data involves several steps, from data preparation and QC to association analysis and interpretation. Using tools like PLINK, SNPTEST, and R, you can efficiently manage and analyze large-scale imputed GWAS data. Always ensure to follow best practices for QC and adjust for potential confounders to obtain reliable results.