Genome analysis tools

Comparative Genomic Analysis: A Step-by-Step Guide to Identifying and Comparing Human and Mouse Genes Using NCBI and Command-Line Tools

September 27, 2023 Off By admin
Shares

The biologist wants to compare human genes with mouse genes to identify common and unique genes between the two species. They are not familiar with the process of downloading and analyzing gene data and seek assistance in conducting this study using NCBI resources and command-line tools in a Linux environment.

Exercise Summary:

Objective:

    • Download the gene information of both human and mouse from NCBI.
    • Extract relevant gene information from the downloaded data.
    • Identify common and unique genes between human and mouse.
    • Optionally, retrieve and compare gene sequences.

Downloading and comparing genes involves multiple steps and tools. The National Center for Biotechnology Information (NCBI) is a great resource for obtaining genetic data, and you can utilize the NCBI’s Entrez Direct command-line tools to download gene information.

Step 1: Install Entrez Direct

bash
conda create -n entrez python=3
conda activate entrez
conda install -c bioconda entrez-direct

Step 2: Download Gene Information

Once installed, use EDirect to download human and mouse gene information.

For Human Genes:

bash
esearch -db gene -query "Homo sapiens[ORGN]" | efetch -format docsum > human_genes.xml

For Mouse Genes:

bash
esearch -db gene -query "Mus musculus[ORGN]" | efetch -format docsum > mouse_genes.xml

Step 3: Parse Gene Information

You will likely need to parse the XML files to extract relevant information using Python, Perl, or another scripting language. Here is a Python example to extract gene IDs and symbols from the XML file.

python
import xml.etree.ElementTree as ET

def parse_genes(file):
tree = ET.parse(file)
root = tree.getroot()
genes = {}
for gene in root.findall(".//DocumentSummary"):
gene_id = gene.find("./GeneID").text
symbol = gene.find("./Name").text
genes[gene_id] = symbol
return genes

human_genes = parse_genes('human_genes.xml')
mouse_genes = parse_genes('mouse_genes.xml')

Step 4: Comparison

You can now compare the parsed gene symbols between human and mouse to find common and unique genes.

python
common_genes = set(human_genes.values()) & set(mouse_genes.values())
unique_human_genes = set(human_genes.values()) - set(mouse_genes.values())
unique_mouse_genes = set(mouse_genes.values()) - set(human_genes.values())

# Output the results
print("Common Genes:", common_genes)
print("Unique Human Genes:", unique_human_genes)
print("Unique Mouse Genes:", unique_mouse_genes)

Step 5: Sequence Retrieval and Comparison

If you want to compare the gene sequences, you will have to fetch the sequence data using the gene IDs. Here’s how you can do it with EDirect:

bash
efetch -db nucleotide -id [GeneID] -format fasta > sequence.fasta

Replace [GeneID] with the actual gene ID.

Once you have the sequence data, you can perform sequence alignment and comparison using tools like BLAST or other bioinformatics tools.

Additional Notes:

  • For a large number of genes, you might want to consider using the NCBI’s Batch Entrez or other API services.
  • Consider the restrictions and limitations that NCBI puts on bulk downloading. Ensure that your activities are in compliance with the NCBI’s usage policies.

For more complex and specific bioinformatics analyses, you might want to use other specialized bioinformatics tools and databases like Ensembl BioMart, and consider programming environments like Biopython or Bioconductor in R for more specialized analyses and comparisons.

Shares