Which Human Reference Genome Should I Use?
January 9, 2025Choosing the right human reference genome is a critical decision in bioinformatics, especially for tasks like alignment, variant calling, and gene expression analysis. The choice depends on your specific research goals, the tools you plan to use, and the compatibility of existing datasets. Below is a step-by-step guide to help you decide which human reference genome to use and how to prepare it for analysis.
1. Understanding the Options
a. Major Releases
- GRCh37 (hg19):
- Released in 2009.
- Widely used in older studies and databases.
- Still relevant for compatibility with legacy data.
- GRCh38 (hg38):
- Released in 2013.
- Includes updates and corrections to GRCh37.
- Recommended for new studies due to improved accuracy and additional sequences.
b. Minor Releases
- Patches (e.g., GRCh38.p12):
- Include fixes and new sequences not yet incorporated into the primary assembly.
- Useful for specific studies but not typically used for general alignment.
c. Sources
- Genome Reference Consortium (GRC): The primary source for the reference genome.
- Ensembl: Provides annotated versions of the reference genome.
- UCSC: Offers versions like hg19 and hg38 with additional annotations.
2. Factors to Consider
a. Compatibility with Existing Data
- Legacy Data: If you are working with older datasets, GRCh37 (hg19) might be necessary.
- New Data: For new studies, GRCh38 (hg38) is recommended.
b. Annotation Availability
- Ensure that the annotations (e.g., gene models, regulatory elements) you need are available for the chosen reference genome.
c. Tool Compatibility
- Some tools and pipelines are optimized for specific reference genomes. Check the documentation of the tools you plan to use.
3. Preparing the Reference Genome for Alignment
a. Downloading the Reference Genome
- GRC: Download from NCBI FTP.
- Ensembl: Download from Ensembl FTP.
b. Subsetting the Sequences
- Primary Assembly: Includes chromosomes 1-22, X, Y, and mitochondrial DNA.
- Unlocalized/Unplaced Sequences: Sequences not assigned to a specific chromosome.
- Alternate Loci: Highly similar sequences that can cause ambiguous mapping.
Command:
samtools faidx GCA_000001405.15_GRCh38_genomic.fna -r subset_ids.txt -o GRCh38_subset.fa
c. Masking the Pseudoautosomal Region (PAR)
- Why: The PAR on chromosome Y is highly similar to the PAR on chromosome X, leading to ambiguous mapping.
- How: Use
bedtools
to mask the PAR on chromosome Y.
Command:
bedtools maskfasta -fi GRCh38_subset.fa -bed parY.bed -fo GRCh38_subset_masked.fa
d. Renaming the Sequences
- Why: Ensure consistent naming conventions for compatibility with tools.
- How: Use
awk
to rename sequences based on the assembly report.
Command:
awk -v FS="\t" 'NR==FNR {header[">"$5] = ">"$1" "$5" "$7" "$10; next} $0 ~ "^>" {$0 = header[$0]}1' GCA_000001405.15_GRCh38_assembly_report.txt GRCh38_subset_masked.fa > GRCh38_alignment.fa
e. Indexing the Reference Genome
- Why: Required for alignment tools like
bwa
andsamtools
. - How: Use
samtools
andbwa
to index the reference genome.
Commands:
samtools faidx GRCh38_alignment.fa bwa index GRCh38_alignment.fa
4. Handling Alternate Loci and Patches
a. Alternate Loci
- Why: Represent alternative sequences for specific regions.
- Considerations: Can cause ambiguous mapping if not handled properly.
- Recommendation: Exclude alternate loci unless specifically needed.
b. Patches
- Why: Include fixes and new sequences not yet in the primary assembly.
- Considerations: Similar to alternate loci, can cause ambiguous mapping.
- Recommendation: Exclude patches unless specifically needed.
5. Practical Tips
a. Use Standard File Formats
b. Document Your Workflow
- Keep detailed records of the steps taken to prepare the reference genome.
c. Validate Your Reference Genome
- Perform sanity checks to ensure the reference genome is correctly prepared.
6. Resources
- Genome Reference Consortium (GRC): GRC Website
- Ensembl: Ensembl Website
- UCSC Genome Browser: UCSC Website
- Biostars: Biostars Q&A
Conclusion
Choosing and preparing the right human reference genome is a foundational step in bioinformatics. By considering factors like compatibility, annotation availability, and tool requirements, you can ensure that your analyses are robust and reproducible. Following best practices for preparing the reference genome will save you time and prevent common pitfalls in downstream analyses.