Comprehensive Genome Annotation: A Step-by-Step Guide
September 28, 2023Table of Contents
Installing MAKER on Linux
- Open a Terminal
- You can open it by searching for “Terminal” in your application menu or using the shortcut
Ctrl + Alt + T
.
- You can open it by searching for “Terminal” in your application menu or using the shortcut
- Install Dependencies
- You would need to install a number of software dependencies. Here’s a list of commands to install some of them. The exact commands might vary depending on your Linux distribution.
shellsudo apt-get update
sudo apt-get install -y build-essential gcc perl
- Download MAKER
- Navigate to a directory where you want to download MAKER, then download it using
wget
.
shellwget http://yandell-lab.org/software/maker.tar.gz
- Navigate to a directory where you want to download MAKER, then download it using
- Extract MAKER
- Extract the downloaded MAKER tarball.
shelltar -zxvf maker.tar.gz
- Compile MAKER
- Navigate to the MAKER directory and compile the software.
shellcd maker/src
make
Installing MAKER on Windows
For Windows, you can use Windows Subsystem for Linux (WSL) to install Linux-based software. To set up WSL, follow these instructions.
Once WSL is set up, you can follow the Linux instructions mentioned above to install MAKER within the Linux subsystem.
Running MAKER
Once MAKER is installed, you can use it to annotate genomes. Here’s a simplistic guideline to annotate a sample genome using MAKER.
- Prepare Input Files
- Configure MAKER
- In the MAKER directory, you’ll find a
maker_opts.ctl
file. This file contains various options and parameters for MAKER. Edit this file to point to your input files and set other parameters as needed.
- In the MAKER directory, you’ll find a
- Run MAKER
- Once the configuration file is set up, you can run MAKER using the following command in the MAKER directory:
shellmaker
Annotation Process
- Annotating Eukaryotic Genome
- For a eukaryotic genome, you would generally include protein evidence and might also include EST evidence. Edit the
maker_opts.ctl
file to include paths to these evidence files.
- For a eukaryotic genome, you would generally include protein evidence and might also include EST evidence. Edit the
- Annotating Prokaryotic Genome
- For prokaryotic genomes, protein evidence is usually sufficient. Edit the
maker_opts.ctl
to include paths to protein evidence.
- For prokaryotic genomes, protein evidence is usually sufficient. Edit the
To run MAKER with a sample genome sequence for both prokaryotic and eukaryotic genomes, let’s consider a hypothetical scenario where you have sample genome sequences in FASTA format and some hypothetical protein sequences as evidence.
Sample Files:
- Prokaryotic Genome:
prokaryote.fasta
- Eukaryotic Genome:
eukaryote.fasta
- Protein Evidence:
protein.fasta
Step by Step Guide:
1. Configure MAKER
- Navigate to the MAKER directory and locate the
maker_opts.ctl
file. Open this file with a text editor to modify it.shellcd path_to_maker_directory
nano maker_opts.ctl
- Modify the following lines in the
maker_opts.ctl
file as per your scenario. For example:shell-----Genome (these are always required)
genome=prokaryote.fasta #genome sequence (fasta file or fasta embeded in GFF3 file)
...
protein=protein.fasta #protein sequence file (only need one)
...
For the eukaryotic genome, replace
prokaryote.fasta
witheukaryote.fasta
.
2. Run MAKER
- After configuring the
maker_opts.ctl
file, save it and close the text editor. - Run MAKER using the following command:shell
maker
3. Inspect the Output
- After MAKER has finished running, you will find the results in the base directory where you have run MAKER. The output annotations are generally in GFF3 format which can be viewed with genome browsers like Apollo.
4. Switching Between Prokaryotic and Eukaryotic Genomes
- To switch between annotating a prokaryotic genome to a eukaryotic genome, you will need to modify the
genome=
line in themaker_opts.ctl
file to point to the eukaryotic genome fasta file, and re-run MAKER.shell-----Genome (these are always required)
genome=eukaryote.fasta #genome sequence (fasta file or fasta embeded in GFF3 file)
...
5. Retraining MAKER (Optional)
- If needed, after the initial run, you can use the output to retrain MAKER’s gene prediction algorithms for improved accuracy in subsequent runs. Follow the MAKER documentation for the detailed retraining process.
Detailed Step-by-Step Instruction:
Step 1: Configure MAKER
- Navigate to the MAKER directoryshell
cd path_to_maker_directory
- Open
maker_opts.ctl
for Editingshellnano maker_opts.ctl
Inside the
maker_opts.ctl
, the key fields you might be interested in are:genome
: the path to your genome file.protein
: the path to your protein evidence.est
: the path to your EST evidence.model_org
: the model organism to use as a reference for training the gene predictor.
- Edit the Key Fieldsshell
genome=prokaryote.fasta #For the prokaryotic genome
OR
genome=eukaryote.fasta #For the eukaryotic genomeprotein=protein.fasta
- Save & Exit the Editorshell
CTRL+X then Y and Enter
Step 2: Run MAKER
- Run MAKER in the Terminalshell
maker
The process might take a while, depending on the size of the genome and the complexity of the configuration.
Step 3: Review Outputs
- After MAKER has finished running, outputs are generated in the specified output directory (usually where you ran
maker
from, unless otherwise specified inmaker_opts.ctl
). - Navigate to the Output Directory and List the Outputsshell
cd path_to_maker_output_directory
ls
Look for files with extensions
.gff
,.fasta
, and.tbl
, which would contain the annotations, sequences, and feature tables, respectively.
Step 4: Visualize Annotations
- To visualize the annotations, you can use tools like Apollo or GBrowse. To load the GFF3 and associated FASTA files into Apollo:
- Launch Apollo.
- Load the Genome Use the “Add Organisms” option and upload your FASTA file.
- Add Annotations Use the “Annotations” → “Load Annotations” to load your GFF3 file.
Step 5: Fine-Tuning and Retraining (if needed)
- Review the generated annotations, compare them with the evidence, and manually adjust mis-predicted gene models in the Apollo browser.
- If you have manually curated some gene models, you can use them to retrain the gene prediction models and re-run MAKER for improved annotations.
Step 6: Annotation of the Eukaryotic Genome
- If you now wish to annotate the Eukaryotic Genome, repeat steps 1-5, altering the
genome
field in themaker_opts.ctl
to point to your Eukaryotic genome file.
genome=eukaryote.fasta
- Then, re-run MAKER, review the output, visualize the annotations, and fine-tune the results as needed.
Advanced Configuration
Step 7: Advanced Configuration in maker_opts.ctl
Once you’re familiar with the basic operations, delve into advanced configurations to refine your results further. Open maker_opts.ctl
again.
nano maker_opts.ctl
Key Configurations:
augustus_species
: Specify species for Augustus.est2genome
: Set to 1 to align ESTs to the genome, providing evidence for gene predictions.protein2genome
: Set to 1 to align protein sequences to the genome for evidence.rmlib
: The path to your repeat library in fasta format.repeatmasker
: Set to T if you want to run RepeatMasker.snaphmm
: Path to the trained HMM file for SNAP.
Modify configurations as per your needs, save, and exit.
CTRL+X then Y and Enter
Refining and Retraining
Step 8: Refining Predictions
- After the initial run of MAKER, review the preliminary annotations using Apollo or another suitable genome browser.
- If certain gene models are incorrect, you can manually refine them in the genome browser.
Step 9: Retraining Augustus and SNAP
- Based on your refinements, you can retrain Augustus and SNAP for better gene predictions.
- Use the curated gene models to retrain Augustus and SNAP following their respective manuals and training procedures.
- Update the
snaphmm
andaugustus_species
inmaker_opts.ctl
to use the newly trained models.
Step 10: Run MAKER with Refined Models
maker
Visualization and Analysis
Step 11: Visualization of Refined Annotations
- Load the newly generated annotations into a genome browser and verify the quality of the refined gene models.
- Manually adjust any remaining inaccuracies in the gene models.
Step 12: Analysis of Annotations
- Examine the annotations for the distribution of gene lengths, exon lengths, intron lengths, etc.
- Compare the annotated genes with known gene databases to validate the accuracy of the annotations.
- Perform functional annotation using tools like Blast2GO to assign GO terms, EC numbers, and KEGG pathways to the annotated genes.
Validation and Sharing
Step 13: Validation of Annotations
- Compare your annotations to existing ones (if available) to validate them.
- Use tools like BUSCO to assess the completeness of your annotations.
Step 14: Share Annotations
- Once validated and refined, consider sharing your annotations with the community by submitting them to public databases.
- Follow the submission guidelines provided by the respective databases.
Miscellaneous
Step 15: Backing Up Results
- Regularly back up your results and configurations to prevent any loss of data.
tar -czvf maker_results_backup.tar.gz path_to_maker_output_directory
Step 16: Regular Updates
- Regularly check the MAKER website or repository for updates and new releases.
- Update MAKER and its dependencies to their latest versions as needed.
Final Words:
- While the above steps provide a comprehensive approach to using MAKER, every annotation project is unique and may require specific adjustments.
- Refer to the MAKER Manual regularly for detailed insights and troubleshootings.
Step 7: Detailed Retraining Augustus
7.1 Generate Training Set
After initial MAKER runs, you will have a set of high-quality annotations to serve as a training set. Extract these using maker2zff
from the MAKER output.
maker2zff -n -d <maker_output_directory>
7.2 Train Augustus
Follow the Augustus training procedures in its documentation to retrain it with the generated training set.
7.3 Update maker_opts.ctl
After retraining, update the augustus_species
parameter in maker_opts.ctl
with the new species model you trained.
Step 8: Detailed Repeat Library Creation
8.1 Build Custom Repeat Library
Use tools like RepeatModeler to build a custom repeat library for your genome.
RepeatModeler -database my_genome -engine ncbi -pa 4
8.2 Update maker_opts.ctl
Update the rmlib
parameter in maker_opts.ctl
to point to your new repeat library.
rmlib=<path_to_your_custom_repeat_library>
Step 9: Run MAKER with Advanced Configurations
9.1 Run MAKER
After setting advanced configurations, run MAKER again. It might take a while depending on the genome size and hardware.
maker
9.2 Review Advanced Output
Review the new output of MAKER with attention to regions where the repeat masking and refined Augustus predictions impact the annotations.
Step 10: Detailed Analysis and Validation
10.1 Functional Annotation
Use tools like InterProScan to perform functional annotation of the predicted genes and assign them GO terms.
interproscan.sh -i <input_protein_sequences> -f tsv -o <output_file>
10.2 Annotation Validation
Validate your annotations using tools like BUSCO or by comparison to a closely related species with well-annotated genomes.
busco -i <input_sequences> -l <lineage_dataset> -o <output_directory> -m <mode>
Step 11: Final Adjustments and Updates
11.1 Final Review and Adjustment
Perform a final review of all annotated sequences and manually adjust any regions of uncertainty or conflicting evidence.
11.2 Update to Latest Versions
Check and update MAKER and its dependencies to their latest versions regularly to benefit from enhancements and bug fixes.
cd maker/src
make update
make install
Step 12: Sharing and Backing Up
12.1 Submit Annotations to Public Databases
Consider submitting your refined and validated annotations to relevant public databases to contribute to the scientific community.
12.2 Regular Backups
Ensure regular backups of your refined annotations, custom repeat libraries, trained models, and configuration files to prevent data loss.
tar -czvf maker_refined_backup.tar.gz <path_to_refined_annotation_and_related_files>
Conclusion
By now, you should have achieved a refined and validated set of annotations using MAKER with both basic and advanced configurations. Always refer back to the official MAKER documentation for any specific or advanced features you might need during the annotation process.
Post-Annotation Steps and Insights
Step 13: Downstream Analyses
13.1 Comparative Genomics
After annotations, you may want to compare gene content, synteny, or evolutionary relationships with other genomes.
Tools:
MCScanX
for synteny analysis.OrthoFinder
orOrthoMCL
for orthologous gene cluster identification.
13.2 Pathway Analysis
Identify which metabolic pathways are present, absent, or expanded in your genome.
Tools:
KEGG
for pathway mapping.Pathway Tools
to construct a metabolic model.
13.3 Structural Variant Analysis
Investigate structural variations using your annotations to provide context.
Tools:
SnpEff
can annotate and predict effects of variants on genes.
Step 14: Advanced Visualization
14.1 Complex Genomic Visualizations
For enhanced visuals or if dealing with large datasets:
JBrowse
for smooth zooming and panning, orIGV
(Integrative Genomics Viewer) for viewing local assemblies.
14.2 Interactive Exploration
Explore data interactively and in a collaborative manner using:
UCSC Genome Browser
which also provides a platform for sharing with the broader community.
Step 15: Additional Training Data
15.1 External Datasets
Incorporate RNA-Seq data, proteomics data, or any other experimental data to refine gene models.
15.2 Transcript Assembly
Use tools like Trinity
or StringTie
to assemble RNA-Seq data into transcript sequences. These can be used as additional evidence in MAKER.
Step 16: Population Genomics
16.1 Genomic Variations
Use your annotations to identify SNPs and Indels that can provide insights into population genetics or breeding programs.
Tools:
VCFtools
orBCFtools
for variant analysis.
16.2 Population Structure
Determine the genetic structure of different populations based on your annotated genome.
Tools:
STRUCTURE
orADMIXTURE
for ancestry inference.
Step 17: Community Involvement
17.1 Community Annotations
Engage the scientific community to manually curate and refine the annotations. Platforms like WebApollo
allow collaborative annotations.
17.2 Feedback Loop
Iterate based on community feedback and release updated annotations periodically.
Step 18: Data Sharing and Integration
18.1 Databases
Integrate your annotations and supporting data into genome databases like Chado
.
18.2 Public Repositories
Submit your final annotated genomes to repositories such as NCBI GenBank
or EBI EMBL
.
Step 19: Continuous Monitoring
19.1 Literature
Monitor related research publications to incorporate any new evidence or findings into your annotations.
19.2 Software Updates
Regularly update annotation tools to capitalize on improved algorithms or new features.
Step 20: Documentation and Reporting
20.1 Documentation
Maintain comprehensive documentation of the annotation process, decisions made, tools used, and their versions.
20.2 Reporting
Publish your methodology and findings to share your work and provide the community with insights into your annotation process.
Wrap-Up:
Genome annotation is an iterative and ever-evolving process. The steps and tools highlighted above offer a foundation, but research projects may necessitate specific adjustments. Engaging with the scientific community, continuous learning, and iterative refinement are vital to achieving high-quality, reliable genome annotations.
Step 21: In-Depth Annotation Refinement
21.1 Utilizing Additional Evidence
Incorporate multiple types of experimental evidence, such as proteomic or metabolomic data, to refine your annotations.
21.2 Manual Curation
Engage in detailed manual curation of each predicted gene model, verifying intron-exon boundaries, UTRs, and alternate splicing variants.
Step 22: Advanced Comparative Genomics
22.1 Phylogenomic Analysis
Conduct extensive phylogenomic analyses to study the evolutionary relationships, examining gene family expansions, contractions, and orthologous group evolution.
22.2 Pan-Genome Analysis
Build a pan-genome to study the genomic diversity within a species, focusing on core and accessory genome components.
Step 23: Functional Annotation Refinement
23.1 Enrichment Analysis
Perform Gene Ontology (GO) enrichment and KEGG pathway analysis to interpret the biological significance of your annotated genes.
23.2 Protein-Protein Interaction
Investigate potential protein-protein interactions to get insights into the functional networks within the cell.
Step 24: Community Engagement & Collaboration
24.1 Workshops and Training
Organize workshops and training sessions to facilitate the use of your annotated genome by other researchers and to gather feedback.
24.2 Collaborative Platforms
Leverage platforms like GitHub to make your annotation project collaborative, allowing others to contribute to refinement and curation.
Step 25: Continuous Annotation Improvement
25.1 Iterative Annotation Refinement
Regularly update annotations based on new evidence, community feedback, and advancements in annotation methodologies.
25.2 Annotation Versioning
Maintain version histories of annotations to allow users to trace the modifications and refinements.
Step 26: Enhanced Visualization & Exploration
26.1 Custom Genome Browser
Develop a custom genome browser for your annotated genome, offering advanced features and facilitating exploration for other researchers.
26.2 Multi-Omics Integration
Integrate multi-omics data into the genome browser, enabling researchers to overlay various types of data onto the genome.
Step 27: Advanced Analysis Integrations
27.1 Integrated “Omics” Analysis
Integrate your annotations with transcriptomic, proteomic, and metabolomic data to conduct a comprehensive analysis of the organism.
27.2 Systems Biology Modeling
Use the refined annotations to build systems biology models, facilitating the understanding of the organism at a systems level.
Conclusion:
The journey of genome annotation doesn’t end with just predicting genes; it’s a continuous iterative process of refinement and collaboration, enriched by multiple layers of analyses and diverse types of biological data. By engaging deeply with each step and actively involving the scientific community, you can ensure that your genome annotations are robust, comprehensive, and valuable to ongoing and future biological research.