cancer-omicstutorials

Step-by-Step Guide to Creating a Database of Tumor Suppressors and Oncogenes

December 28, 2024 Off By admin
Shares

Introduction

A comprehensive database of tumor suppressors and oncogenes is a crucial resource for cancer research. These databases provide insights into the molecular mechanisms of cancer initiation, progression, and metastasis. They also facilitate the identification of novel therapeutic targets and biomarkers. This guide will help you build and utilize such a database, which is useful for bioinformaticians and researchers involved in cancer genomics.

Step 1: Understand the Key Concepts

  • Oncogenes: Genes that, when mutated or expressed at high levels, can promote cancer. They often drive cell proliferation.
  • Tumor Suppressors: Genes that inhibit tumor formation. Mutations or deletions in tumor suppressor genes can result in uncontrolled cell division.
  • Key Features: A good database should include gene annotations, mutation types (such as missense, nonsense, and frame-shift mutations), and associated cancer types.

Step 2: Identify Trusted Data Sources

Several trusted data sources compile information on tumor suppressors and oncogenes:

  • Cancer Gene Census: A list of genes curated based on experimental evidence for cancer association.
  • TSGene: An annotation database of tumor suppressor genes.
  • COSMIC (Catalogue of Somatic Mutations in Cancer): Provides detailed mutation data across various cancer types.
  • cBioPortal: A data portal that includes oncogenes and tumor suppressors from various cancer studies.
  • OncoKB: A knowledge base for cancer mutations and their clinical significance.

Step 3: Text Mining and Database Construction

To construct a robust database, data can be mined from publications, clinical studies, and genomic databases. For example:

  • Text Mining Approach: Use tools such as BioBERT or DeepPubMed to extract gene-related cancer data from scientific literature.
  • Automated Curation: Using Natural Language Processing (NLP), extract gene names, their functional roles, and associated mutations from research papers. Text mining approaches allow for systematic identification of genes related to cancer.

Tools and Resources for Text Mining:

  1. Text Mining with PubMed: Use PubMed search queries combined with keywords like “tumor suppressor,” “oncogene,” or specific cancer types to extract gene-related data.
    • Example query: "tumor suppressor AND mutation AND cancer".
  2. Using the NCBI E-utilities:
    • Use esearch to search PubMed for relevant articles.
    • Use efetch to download the data into a structured format (e.g., XML).

Example esearch query in UNIX:

bash
esearch -db pubmed -query "tumor suppressor AND mutation AND cancer" | efetch -format xml > cancer_genes.xml
  1. Database Schema Design: Your database should have tables for:
    • Gene Information: Gene name, symbol, description, function.
    • Cancer Association: Cancer type, mutation type (e.g., deletion, amplification), evidence level.
    • Pathway Involvement: DNA repair, apoptosis, cell cycle regulation, etc.
    • Literature Evidence: Paper references, experimental validation.

Step 4: Data Integration and Validation

Integrate data from various sources (e.g., COSMIC, TSGene) into a unified database:

  1. Database Integration: Write Perl or Python scripts to parse and merge data from different sources into a relational database (e.g., MySQL, PostgreSQL).
    • Use regular expressions to extract relevant data.
    • Ensure that gene symbols, mutation types, and cancer types are consistently formatted across data sources.

Example Perl script for parsing a TSV file:

perl
#!/usr/bin/perl
use Text::CSV;

my $csv = Text::CSV->new({ binary => 1, auto_diag => 1 });
open my $fh, "<", "cancer_genes.tsv" or die "cancer_genes.tsv: $!";
while (my $row = $csv->getline($fh)) {
# Parse each row and insert into database
my ($gene_name, $cancer_type, $mutation_type) = @$row;
# Insert into database (example SQL)
my $sql = "INSERT INTO cancer_genes (gene_name, cancer_type, mutation_type) VALUES ('$gene_name', '$cancer_type', '$mutation_type')";
# Database connection and insertion logic here
}
close $fh;

  1. Validation and Quality Control: Ensure data integrity by validating gene names, mutation types, and related cancer types. Cross-reference with external databases like UniProt, Ensembl, and NCBI Gene for accurate annotations.

Step 5: Develop a Query Interface

Create a web interface or a query system for users to interact with the database. This can be achieved using web frameworks like Flask (Python) or CGI (Perl) with SQL queries for retrieval.

Example MySQL query to retrieve oncogenes:

sql
SELECT gene_name, mutation_type, cancer_type
FROM cancer_genes
WHERE mutation_type = 'amplification';

Step 6: Application and Use Cases

Step 7: Updates and Maintenance

To keep the database up-to-date:

  • Regularly Mine New Literature: Use automated scripts for monthly literature mining to capture emerging genes and mutations.
  • Incorporate New Data from Portals: Integrate new data from COSMIC, TCGA, and other cancer mutation databases as they become available.
  • Community Contributions: Allow external contributions through an open-source model to keep the database current.

Conclusion

Building a comprehensive database of tumor suppressors and oncogenes is a powerful resource for cancer research. By systematically integrating and curating data from multiple sources, using text mining, and maintaining regular updates, the database can provide invaluable insights into the molecular mechanisms of cancer. Researchers and clinicians can leverage this database for understanding cancer biology and developing targeted therapies.

Shares