Genome annotation tutorial

Comprehensive Genome Annotation: A Step-by-Step Guide

September 28, 2023 Off By admin
Shares

Table of Contents

Installing MAKER on Linux

  1. Open a Terminal
    • You can open it by searching for “Terminal” in your application menu or using the shortcut Ctrl + Alt + T.
  2. Install Dependencies
    • You would need to install a number of software dependencies. Here’s a list of commands to install some of them. The exact commands might vary depending on your Linux distribution.
    shell
    sudo apt-get update
    sudo apt-get install -y build-essential gcc perl
  3. Download MAKER
    • Navigate to a directory where you want to download MAKER, then download it using wget.
    shell
    wget http://yandell-lab.org/software/maker.tar.gz
  4. Extract MAKER
    • Extract the downloaded MAKER tarball.
    shell
    tar -zxvf maker.tar.gz
  5. Compile MAKER
    • Navigate to the MAKER directory and compile the software.
    shell
    cd maker/src
    make

Installing MAKER on Windows

For Windows, you can use Windows Subsystem for Linux (WSL) to install Linux-based software. To set up WSL, follow these instructions.

Once WSL is set up, you can follow the Linux instructions mentioned above to install MAKER within the Linux subsystem.

Running MAKER

Once MAKER is installed, you can use it to annotate genomes. Here’s a simplistic guideline to annotate a sample genome using MAKER.

  1. Prepare Input Files
    • Prepare your genome sequence in FASTA format.
    • If available, prepare protein sequences, ESTs, and other evidence in appropriate formats.
  2. Configure MAKER
    • In the MAKER directory, you’ll find a maker_opts.ctl file. This file contains various options and parameters for MAKER. Edit this file to point to your input files and set other parameters as needed.
  3. Run MAKER
    • Once the configuration file is set up, you can run MAKER using the following command in the MAKER directory:
    shell
    maker

Annotation Process

  1. Annotating Eukaryotic Genome
    • For a eukaryotic genome, you would generally include protein evidence and might also include EST evidence. Edit the maker_opts.ctl file to include paths to these evidence files.
  2. Annotating Prokaryotic Genome
    • For prokaryotic genomes, protein evidence is usually sufficient. Edit the maker_opts.ctl to include paths to protein evidence.

To run MAKER with a sample genome sequence for both prokaryotic and eukaryotic genomes, let’s consider a hypothetical scenario where you have sample genome sequences in FASTA format and some hypothetical protein sequences as evidence.

Sample Files:

  • Prokaryotic Genome: prokaryote.fasta
  • Eukaryotic Genome: eukaryote.fasta
  • Protein Evidence: protein.fasta

Step by Step Guide:

1. Configure MAKER

  • Navigate to the MAKER directory and locate the maker_opts.ctl file. Open this file with a text editor to modify it.
    shell
    cd path_to_maker_directory
    nano maker_opts.ctl
  • Modify the following lines in the maker_opts.ctl file as per your scenario. For example:
    shell
    #-----Genome (these are always required)
    genome=prokaryote.fasta #genome sequence (fasta file or fasta embeded in GFF3 file)
    ...
    protein=protein.fasta #protein sequence file (only need one)
    ...

    For the eukaryotic genome, replace prokaryote.fasta with eukaryote.fasta.

2. Run MAKER

  • After configuring the maker_opts.ctl file, save it and close the text editor.
  • Run MAKER using the following command:
    shell
    maker

3. Inspect the Output

  • After MAKER has finished running, you will find the results in the base directory where you have run MAKER. The output annotations are generally in GFF3 format which can be viewed with genome browsers like Apollo.

4. Switching Between Prokaryotic and Eukaryotic Genomes

  • To switch between annotating a prokaryotic genome to a eukaryotic genome, you will need to modify the genome= line in the maker_opts.ctl file to point to the eukaryotic genome fasta file, and re-run MAKER.
    shell
    #-----Genome (these are always required)
    genome=eukaryote.fasta #genome sequence (fasta file or fasta embeded in GFF3 file)
    ...

5. Retraining MAKER (Optional)

  • If needed, after the initial run, you can use the output to retrain MAKER’s gene prediction algorithms for improved accuracy in subsequent runs. Follow the MAKER documentation for the detailed retraining process.

Detailed Step-by-Step Instruction:

Step 1: Configure MAKER

  • Navigate to the MAKER directory
    shell
    cd path_to_maker_directory
  • Open maker_opts.ctl for Editing
    shell
    nano maker_opts.ctl

    Inside the maker_opts.ctl, the key fields you might be interested in are:

    • genome: the path to your genome file.
    • protein: the path to your protein evidence.
    • est: the path to your EST evidence.
    • model_org: the model organism to use as a reference for training the gene predictor.
  • Edit the Key Fields
    shell
    genome=prokaryote.fasta #For the prokaryotic genome
    # OR
    genome=eukaryote.fasta #For the eukaryotic genome

    protein=protein.fasta

  • Save & Exit the Editor
    shell
    CTRL+X then Y and Enter

Step 2: Run MAKER

  • Run MAKER in the Terminal
    shell
    maker

    The process might take a while, depending on the size of the genome and the complexity of the configuration.

Step 3: Review Outputs

  • After MAKER has finished running, outputs are generated in the specified output directory (usually where you ran maker from, unless otherwise specified in maker_opts.ctl).
  • Navigate to the Output Directory and List the Outputs
    shell
    cd path_to_maker_output_directory
    ls

    Look for files with extensions .gff, .fasta, and .tbl, which would contain the annotations, sequences, and feature tables, respectively.

Step 4: Visualize Annotations

  • To visualize the annotations, you can use tools like Apollo or GBrowse. To load the GFF3 and associated FASTA files into Apollo:
    1. Launch Apollo.
    2. Load the Genome Use the “Add Organisms” option and upload your FASTA file.
    3. Add Annotations Use the “Annotations” → “Load Annotations” to load your GFF3 file.

Step 5: Fine-Tuning and Retraining (if needed)

  • Review the generated annotations, compare them with the evidence, and manually adjust mis-predicted gene models in the Apollo browser.
  • If you have manually curated some gene models, you can use them to retrain the gene prediction models and re-run MAKER for improved annotations.

Step 6: Annotation of the Eukaryotic Genome

  • If you now wish to annotate the Eukaryotic Genome, repeat steps 1-5, altering the genome field in the maker_opts.ctl to point to your Eukaryotic genome file.
shell
genome=eukaryote.fasta
  • Then, re-run MAKER, review the output, visualize the annotations, and fine-tune the results as needed.

Advanced Configuration

Step 7: Advanced Configuration in maker_opts.ctl

Once you’re familiar with the basic operations, delve into advanced configurations to refine your results further. Open maker_opts.ctl again.

shell
nano maker_opts.ctl

Key Configurations:

  • augustus_species: Specify species for Augustus.
  • est2genome: Set to 1 to align ESTs to the genome, providing evidence for gene predictions.
  • protein2genome: Set to 1 to align protein sequences to the genome for evidence.
  • rmlib: The path to your repeat library in fasta format.
  • repeatmasker: Set to T if you want to run RepeatMasker.
  • snaphmm: Path to the trained HMM file for SNAP.

Modify configurations as per your needs, save, and exit.

shell
CTRL+X then Y and Enter

Refining and Retraining

Step 8: Refining Predictions

  • After the initial run of MAKER, review the preliminary annotations using Apollo or another suitable genome browser.
  • If certain gene models are incorrect, you can manually refine them in the genome browser.

Step 9: Retraining Augustus and SNAP

  • Based on your refinements, you can retrain Augustus and SNAP for better gene predictions.
  • Use the curated gene models to retrain Augustus and SNAP following their respective manuals and training procedures.
  • Update the snaphmm and augustus_species in maker_opts.ctl to use the newly trained models.

Step 10: Run MAKER with Refined Models

shell
maker

Visualization and Analysis

Step 11: Visualization of Refined Annotations

  • Load the newly generated annotations into a genome browser and verify the quality of the refined gene models.
  • Manually adjust any remaining inaccuracies in the gene models.

Step 12: Analysis of Annotations

  • Examine the annotations for the distribution of gene lengths, exon lengths, intron lengths, etc.
  • Compare the annotated genes with known gene databases to validate the accuracy of the annotations.
  • Perform functional annotation using tools like Blast2GO to assign GO terms, EC numbers, and KEGG pathways to the annotated genes.

Validation and Sharing

Step 13: Validation of Annotations

  • Compare your annotations to existing ones (if available) to validate them.
  • Use tools like BUSCO to assess the completeness of your annotations.

Step 14: Share Annotations

  • Once validated and refined, consider sharing your annotations with the community by submitting them to public databases.
  • Follow the submission guidelines provided by the respective databases.

Miscellaneous

Step 15: Backing Up Results

  • Regularly back up your results and configurations to prevent any loss of data.
shell
tar -czvf maker_results_backup.tar.gz path_to_maker_output_directory

Step 16: Regular Updates

  • Regularly check the MAKER website or repository for updates and new releases.
  • Update MAKER and its dependencies to their latest versions as needed.

Final Words:

  • While the above steps provide a comprehensive approach to using MAKER, every annotation project is unique and may require specific adjustments.
  • Refer to the MAKER Manual regularly for detailed insights and troubleshootings.

Step 7: Detailed Retraining Augustus

7.1 Generate Training Set

After initial MAKER runs, you will have a set of high-quality annotations to serve as a training set. Extract these using maker2zff from the MAKER output.

shell
maker2zff -n -d <maker_output_directory>

7.2 Train Augustus

Follow the Augustus training procedures in its documentation to retrain it with the generated training set.

7.3 Update maker_opts.ctl

After retraining, update the augustus_species parameter in maker_opts.ctl with the new species model you trained.

Step 8: Detailed Repeat Library Creation

8.1 Build Custom Repeat Library

Use tools like RepeatModeler to build a custom repeat library for your genome.

shell
RepeatModeler -database my_genome -engine ncbi -pa 4

8.2 Update maker_opts.ctl

Update the rmlib parameter in maker_opts.ctl to point to your new repeat library.

shell
rmlib=<path_to_your_custom_repeat_library>

Step 9: Run MAKER with Advanced Configurations

9.1 Run MAKER

After setting advanced configurations, run MAKER again. It might take a while depending on the genome size and hardware.

shell
maker

9.2 Review Advanced Output

Review the new output of MAKER with attention to regions where the repeat masking and refined Augustus predictions impact the annotations.

Step 10: Detailed Analysis and Validation

10.1 Functional Annotation

Use tools like InterProScan to perform functional annotation of the predicted genes and assign them GO terms.

shell
interproscan.sh -i <input_protein_sequences> -f tsv -o <output_file>

10.2 Annotation Validation

Validate your annotations using tools like BUSCO or by comparison to a closely related species with well-annotated genomes.

shell
busco -i <input_sequences> -l <lineage_dataset> -o <output_directory> -m <mode>

Step 11: Final Adjustments and Updates

11.1 Final Review and Adjustment

Perform a final review of all annotated sequences and manually adjust any regions of uncertainty or conflicting evidence.

11.2 Update to Latest Versions

Check and update MAKER and its dependencies to their latest versions regularly to benefit from enhancements and bug fixes.

shell
cd maker/src
make update
make install

Step 12: Sharing and Backing Up

12.1 Submit Annotations to Public Databases

Consider submitting your refined and validated annotations to relevant public databases to contribute to the scientific community.

12.2 Regular Backups

Ensure regular backups of your refined annotations, custom repeat libraries, trained models, and configuration files to prevent data loss.

shell
tar -czvf maker_refined_backup.tar.gz <path_to_refined_annotation_and_related_files>

Conclusion

By now, you should have achieved a refined and validated set of annotations using MAKER with both basic and advanced configurations. Always refer back to the official MAKER documentation for any specific or advanced features you might need during the annotation process.

 

Post-Annotation Steps and Insights

Step 13: Downstream Analyses

13.1 Comparative Genomics

After annotations, you may want to compare gene content, synteny, or evolutionary relationships with other genomes.

Tools:

  • MCScanX for synteny analysis.
  • OrthoFinder or OrthoMCL for orthologous gene cluster identification.

13.2 Pathway Analysis

Identify which metabolic pathways are present, absent, or expanded in your genome.

Tools:

13.3 Structural Variant Analysis

Investigate structural variations using your annotations to provide context.

Tools:

  • SnpEff can annotate and predict effects of variants on genes.

Step 14: Advanced Visualization

14.1 Complex Genomic Visualizations

For enhanced visuals or if dealing with large datasets:

  • JBrowse for smooth zooming and panning, or
  • IGV (Integrative Genomics Viewer) for viewing local assemblies.

14.2 Interactive Exploration

Explore data interactively and in a collaborative manner using:

Step 15: Additional Training Data

15.1 External Datasets

Incorporate RNA-Seq data, proteomics data, or any other experimental data to refine gene models.

15.2 Transcript Assembly

Use tools like Trinity or StringTie to assemble RNA-Seq data into transcript sequences. These can be used as additional evidence in MAKER.

Step 16: Population Genomics

16.1 Genomic Variations

Use your annotations to identify SNPs and Indels that can provide insights into population genetics or breeding programs.

Tools:

  • VCFtools or BCFtools for variant analysis.

16.2 Population Structure

Determine the genetic structure of different populations based on your annotated genome.

Tools:

  • STRUCTURE or ADMIXTURE for ancestry inference.

Step 17: Community Involvement

17.1 Community Annotations

Engage the scientific community to manually curate and refine the annotations. Platforms like WebApollo allow collaborative annotations.

17.2 Feedback Loop

Iterate based on community feedback and release updated annotations periodically.

Step 18: Data Sharing and Integration

18.1 Databases

Integrate your annotations and supporting data into genome databases like Chado.

18.2 Public Repositories

Submit your final annotated genomes to repositories such as NCBI GenBank or EBI EMBL.

Step 19: Continuous Monitoring

19.1 Literature

Monitor related research publications to incorporate any new evidence or findings into your annotations.

19.2 Software Updates

Regularly update annotation tools to capitalize on improved algorithms or new features.

Step 20: Documentation and Reporting

20.1 Documentation

Maintain comprehensive documentation of the annotation process, decisions made, tools used, and their versions.

20.2 Reporting

Publish your methodology and findings to share your work and provide the community with insights into your annotation process.

Wrap-Up:

Genome annotation is an iterative and ever-evolving process. The steps and tools highlighted above offer a foundation, but research projects may necessitate specific adjustments. Engaging with the scientific community, continuous learning, and iterative refinement are vital to achieving high-quality, reliable genome annotations.

Step 21: In-Depth Annotation Refinement

21.1 Utilizing Additional Evidence

Incorporate multiple types of experimental evidence, such as proteomic or metabolomic data, to refine your annotations.

21.2 Manual Curation

Engage in detailed manual curation of each predicted gene model, verifying intron-exon boundaries, UTRs, and alternate splicing variants.

Step 22: Advanced Comparative Genomics

22.1 Phylogenomic Analysis

Conduct extensive phylogenomic analyses to study the evolutionary relationships, examining gene family expansions, contractions, and orthologous group evolution.

22.2 Pan-Genome Analysis

Build a pan-genome to study the genomic diversity within a species, focusing on core and accessory genome components.

Step 23: Functional Annotation Refinement

23.1 Enrichment Analysis

Perform Gene Ontology (GO) enrichment and KEGG pathway analysis to interpret the biological significance of your annotated genes.

23.2 Protein-Protein Interaction

Investigate potential protein-protein interactions to get insights into the functional networks within the cell.

Step 24: Community Engagement & Collaboration

24.1 Workshops and Training

Organize workshops and training sessions to facilitate the use of your annotated genome by other researchers and to gather feedback.

24.2 Collaborative Platforms

Leverage platforms like GitHub to make your annotation project collaborative, allowing others to contribute to refinement and curation.

Step 25: Continuous Annotation Improvement

25.1 Iterative Annotation Refinement

Regularly update annotations based on new evidence, community feedback, and advancements in annotation methodologies.

25.2 Annotation Versioning

Maintain version histories of annotations to allow users to trace the modifications and refinements.

Step 26: Enhanced Visualization & Exploration

26.1 Custom Genome Browser

Develop a custom genome browser for your annotated genome, offering advanced features and facilitating exploration for other researchers.

26.2 Multi-Omics Integration

Integrate multi-omics data into the genome browser, enabling researchers to overlay various types of data onto the genome.

Step 27: Advanced Analysis Integrations

27.1 Integrated “Omics” Analysis

Integrate your annotations with transcriptomic, proteomic, and metabolomic data to conduct a comprehensive analysis of the organism.

27.2 Systems Biology Modeling

Use the refined annotations to build systems biology models, facilitating the understanding of the organism at a systems level.

Conclusion:

The journey of genome annotation doesn’t end with just predicting genes; it’s a continuous iterative process of refinement and collaboration, enriched by multiple layers of analyses and diverse types of biological data. By engaging deeply with each step and actively involving the scientific community, you can ensure that your genome annotations are robust, comprehensive, and valuable to ongoing and future biological research.

Shares