Comparative Genomic Analysis: A Step-by-Step Guide to Identifying and Comparing Human and Mouse Genes Using NCBI and Command-Line Tools
September 27, 2023The biologist wants to compare human genes with mouse genes to identify common and unique genes between the two species. They are not familiar with the process of downloading and analyzing gene data and seek assistance in conducting this study using NCBI resources and command-line tools in a Linux environment.
Exercise Summary:
Objective:
- Download the gene information of both human and mouse from NCBI.
- Extract relevant gene information from the downloaded data.
- Identify common and unique genes between human and mouse.
- Optionally, retrieve and compare gene sequences.
Downloading and comparing genes involves multiple steps and tools. The National Center for Biotechnology Information (NCBI) is a great resource for obtaining genetic data, and you can utilize the NCBI’s Entrez Direct command-line tools to download gene information.
Step 1: Install Entrez Direct
conda create -n entrez python=3
conda activate entrez
conda install -c bioconda entrez-direct
Step 2: Download Gene Information
Once installed, use EDirect to download human and mouse gene information.
For Human Genes:
esearch -db gene -query "Homo sapiens[ORGN]" | efetch -format docsum > human_genes.xml
For Mouse Genes:
esearch -db gene -query "Mus musculus[ORGN]" | efetch -format docsum > mouse_genes.xml
Step 3: Parse Gene Information
You will likely need to parse the XML files to extract relevant information using Python, Perl, or another scripting language. Here is a Python example to extract gene IDs and symbols from the XML file.
import xml.etree.ElementTree as ETdef parse_genes(file):
tree = ET.parse(file)
root = tree.getroot()
genes = {}
for gene in root.findall(".//DocumentSummary"):
gene_id = gene.find("./GeneID").text
symbol = gene.find("./Name").text
genes[gene_id] = symbol
return genes
human_genes = parse_genes('human_genes.xml')
mouse_genes = parse_genes('mouse_genes.xml')
Step 4: Comparison
You can now compare the parsed gene symbols between human and mouse to find common and unique genes.
common_genes = set(human_genes.values()) & set(mouse_genes.values())
unique_human_genes = set(human_genes.values()) - set(mouse_genes.values())
unique_mouse_genes = set(mouse_genes.values()) - set(human_genes.values())# Output the results
print("Common Genes:", common_genes)
print("Unique Human Genes:", unique_human_genes)
print("Unique Mouse Genes:", unique_mouse_genes)
Step 5: Sequence Retrieval and Comparison
If you want to compare the gene sequences, you will have to fetch the sequence data using the gene IDs. Here’s how you can do it with EDirect:
efetch -db nucleotide -id [GeneID] -format fasta > sequence.fasta
Replace [GeneID]
with the actual gene ID.
Once you have the sequence data, you can perform sequence alignment and comparison using tools like BLAST or other bioinformatics tools.
Additional Notes:
- For a large number of genes, you might want to consider using the NCBI’s Batch Entrez or other API services.
- Consider the restrictions and limitations that NCBI puts on bulk downloading. Ensure that your activities are in compliance with the NCBI’s usage policies.
For more complex and specific bioinformatics analyses, you might want to use other specialized bioinformatics tools and databases like Ensembl BioMart, and consider programming environments like Biopython or Bioconductor in R for more specialized analyses and comparisons.