Comparative Genomic Analysis: A Step-by-Step Guide to Identifying and Comparing Human and Mouse Genes Using NCBI and Command-Line Tools

September 27, 2023 Off By admin

The biologist wants to compare human genes with mouse genes to identify common and unique genes between the two species. They are not familiar with the process of downloading and analyzing gene data and seek assistance in conducting this study using NCBI resources and command-line tools in a Linux environment.

Table of Contents

Exercise Summary:

Objective:

- Download the gene information of both human and mouse from NCBI.
- Extract relevant gene information from the downloaded data.
- Identify common and unique genes between human and mouse.
- Optionally, retrieve and compare gene sequences.

Step 1: Install Entrez Direct

bash

conda create -n entrez python=3
 conda activate entrez
 conda install -c bioconda entrez-direct

Step 2: Download Gene Information

Once installed, use EDirect to download human and mouse gene information.

For Human Genes:

bash

esearch -db gene -query "Homo sapiens[ORGN]" | efetch -format docsum > human_genes.xml

For Mouse Genes:

bash

esearch -db gene -query "Mus musculus[ORGN]" | efetch -format docsum > mouse_genes.xml

Step 3: Parse Gene Information

You will likely need to parse the XML files to extract relevant information using Python, Perl, or another scripting language. Here is a Python example to extract gene IDs and symbols from the XML file.

python

import xml.etree.ElementTree as ET
def parse_genes(file):
 tree = ET.parse(file)
 root = tree.getroot()
 genes = {}
 for gene in root.findall(".//DocumentSummary"):
 gene_id = gene.find("./GeneID").text
 symbol = gene.find("./Name").text
 genes[gene_id] = symbol
 return genes

human_genes = parse_genes('human_genes.xml') mouse_genes = parse_genes('mouse_genes.xml')

Step 4: Comparison

You can now compare the parsed gene symbols between human and mouse to find common and unique genes.

python

common_genes = set(human_genes.values()) & set(mouse_genes.values())
 unique_human_genes = set(human_genes.values()) - set(mouse_genes.values())
 unique_mouse_genes = set(mouse_genes.values()) - set(human_genes.values())

# Output the results print("Common Genes:", common_genes) print("Unique Human Genes:", unique_human_genes) print("Unique Mouse Genes:", unique_mouse_genes)

Step 5: Sequence Retrieval and Comparison

If you want to compare the gene sequences, you will have to fetch the sequence data using the gene IDs. Here’s how you can do it with EDirect:

bash

efetch -db nucleotide -id [GeneID] -format fasta > sequence.fasta

Replace [GeneID] with the actual gene ID.

Once you have the sequence data, you can perform sequence alignment and comparison using tools like BLAST or other bioinformatics tools.

Additional Notes:

For a large number of genes, you might want to consider using the NCBI’s Batch Entrez or other API services.
Consider the restrictions and limitations that NCBI puts on bulk downloading. Ensure that your activities are in compliance with the NCBI’s usage policies.

For more complex and specific bioinformatics analyses, you might want to use other specialized bioinformatics tools and databases like Ensembl BioMart, and consider programming environments like Biopython or Bioconductor in R for more specialized analyses and comparisons.