How To Analyze Imputed GWAS Data

January 3, 2025 Off By admin

Analyzing imputed GWAS data involves several steps, including quality control (QC), association analysis, and interpretation of results. Below is a step-by-step guide to help you analyze imputed GWAS data using a combination of tools and scripts.

Table of Contents

Step 1: Data Preparation

1.1. Check Data Format

Ensure your data is in the correct format. Imputed GWAS data is often provided as a matrix of allele dosages (e.g., 0, 1, 2 for genotypes, or continuous values for dosages).

1.2. Convert Data to PLINK Format (if necessary)

If your data is not in PLINK format, you can convert it using the following steps:

Using Python to Transpose Data (if needed):

import pandas as pd

# Load the data
data = pd.read_csv("imputed_data.txt", sep="\t")

# Transpose the data
data_transposed = data.transpose()

# Save the transposed data
data_transposed.to_csv("imputed_data_transposed.txt", sep="\t", index=False)

Convert to PLINK Dosage Format:

PLINK can handle dosage data directly. Ensure your dosage file has the following format:

FID IID SNP1 SNP2 SNP3 ...
1   1   0.1  1.8  2.0
1   2   1.0  0.5  1.5
...

Use PLINK to convert the dosage file:

plink --dosage imputed_data_transposed.txt format=1 --fam sample_fam_file.fam --out imputed_data_plink

Step 2: Quality Control (QC)

2.1. Sample QC

Sample Call Rate: Remove samples with low call rates.
Relatedness: Check for related samples using PLINK.
Population Stratification: Perform PCA to identify population outliers.

PLINK Commands for Sample QC:

# Calculate sample call rates
plink --bfile imputed_data_plink --missing --out sample_call_rate

# Check for relatedness
plink --bfile imputed_data_plink --genome --out relatedness_check

# Perform PCA
plink --bfile imputed_data_plink --pca 10 --out pca_results

2.2. SNP QC

MAF Filtering: Remove SNPs with minor allele frequency (MAF) < 0.01.
Imputation Quality: Filter SNPs based on imputation quality scores (e.g., Rsq > 0.3).

PLINK Commands for SNP QC:

# Filter by MAF
plink --bfile imputed_data_plink --maf 0.01 --make-bed --out imputed_data_maf_filtered

# Filter by imputation quality (assuming you have a file with Rsq values)
plink --bfile imputed_data_maf_filtered --extract high_quality_snps.txt --make-bed --out imputed_data_high_quality

Step 3: Association Analysis

3.1. Run Association Analysis

Use PLINK or other tools like SNPTEST or REGENIE for association analysis.

PLINK Command for Association Analysis:

plink --bfile imputed_data_high_quality --assoc --out association_results

Using SNPTEST for Dosage Data:

snptest -data imputed_data.sample imputed_data.dose -o snptest_results.txt -frequentist 1

3.2. Adjust for Covariates

Include covariates such as age, sex, and principal components (PCs) in your analysis.

PLINK Command with Covariates:

plink --bfile imputed_data_high_quality --linear --covar pca_results.eigenvec --covar-name PC1,PC2,PC3 --out association_results_covariates

Step 4: Post-Analysis

4.1. Manhattan Plot and QQ Plot

Visualize your results using Manhattan and QQ plots to check for inflation and significant hits.

Using R to Generate Plots:

library(qqman)

# Load association results
results <- read.table("association_results_covariates.assoc.linear", header=TRUE)

# Generate Manhattan plot
manhattan(results, chr="CHR", bp="BP", p="P", snp="SNP")

# Generate QQ plot
qq(results$P)

4.2. Annotation of Significant SNPs

Annotate significant SNPs using tools like ANNOVAR or online resources like Ensembl.

Using ANNOVAR:

annotate_variation.pl -buildver hg19 -out annotated_results -dbtype refGene association_results_covariates.assoc.linear

Step 5: Interpretation and Reporting

Identify Significant Loci: Focus on SNPs with p-values below the genome-wide significance threshold (e.g., 5e-8).
Functional Annotation: Investigate the functional relevance of significant SNPs using databases like GTEx or RegulomeDB.
Replication: Consider replicating your findings in an independent cohort.

Recent Tools and Software

REGENIE: Efficient for large-scale GWAS with imputed data.
SAIGE: Suitable for binary traits and large datasets.
LocusZoom: For visualizing GWAS results in specific genomic regions.

Conclusion

Analyzing imputed GWAS data involves several steps, from data preparation and QC to association analysis and interpretation. Using tools like PLINK, SNPTEST, and R, you can efficiently manage and analyze large-scale imputed GWAS data. Always ensure to follow best practices for QC and adjust for potential confounders to obtain reliable results.