A Step-by-Step Tutorial to Analyzing and Interpreting Next-Generation Sequencing Data
September 27, 2023Step 1: Downloading Sample Dataset
a. Download a Viral Genome
Visit NCBI’s Virus Database, and search for a simple viral genome, e.g., PhiX174. Download its genome in FASTA format and note the path.
Step 2: Setup Your Environment
a. Install the Required Software
b. Create a Working Directory
mkdir ngs_analysis
cd ngs_analysis
c. Move the Downloaded Genome to the Working Directory
mv /path_to_your_downloaded_file/phiX174.fasta .
Step 3: Sequence Alignment
a. Build the Bowtie2 Index
bowtie2-build phiX174.fasta phiX174_index
b. Download Sample Reads
For the tutorial’s purpose, we will assume you have sample reads in FASTQ format named sample.fastq
. Move it to your working directory.
c. Run the Bowtie2 Alignment
bowtie2 -x phiX174_index -U sample.fastq -S output.sam
Step 4: Process Alignment Files
a. Convert SAM to BAM
samtools view -bS output.sam > output.bam
b. Sort and Index the BAM File
samtools sort output.bam -o sorted_output.bam
samtools index sorted_output.bam
We will also go through the variant analysis in more detail and explore how to use Python or R to generate meaningful plots.
Step 5: Variant Analysis and Visualization
a. Variant Filtering
After calling variants using freebayes
, the next step is to filter the variants in the VCF file. This can be done using vcffilter
, a tool that comes with vcflib
. Install it as follows:
sudo apt-get install vcflib
Now, filter the variants:
vcffilter -f "QUAL > 20" variants.vcf > filtered_variants.vcf
This command will keep only variants with a quality score above 20.
Step 6: Interpretation of Variants
a. Further Exploration of Variants
After annotating your variants, you should further explore them based on their annotations to understand their biological impact.
Using Python:
# Assuming you have a 'TYPE' column specifying the variant type
variant_types = df['TYPE'].value_counts()
print(variant_types)# If you have an 'EFFECT' column specifying the effect of variants
variant_effects = df['EFFECT'].value_counts()
print(variant_effects)
Using R:
# Assuming vcf object holds the variant data
variantTypes <- table(elementMetadata(vcf)$TYPE)
print(variantTypes)variantEffects <- table(elementMetadata(vcf)$EFFECT)
print(variantEffects)
Step 7: Advanced Visualization
a. Visualization of Variant Effects and Types
You can visualize the different types and effects of variants through bar plots or pie charts.
Using Python:
import matplotlib.pyplot as plt# Visualizing variant types
variant_types.plot(kind='bar', color='teal', alpha=0.7)
plt.title('Variant Types')
plt.ylabel('Count')
plt.show()
# Visualizing variant effects
variant_effects.plot(kind='pie', autopct='%1.1f%%', startangle=140, shadow=True)
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title('Variant Effects')
plt.show()
Using R:
library(ggplot2)# Visualizing variant types
ggplot(data=data.frame(variantTypes), aes(x=Var1, y=Freq)) +
geom_bar(stat='identity', fill='teal', alpha=0.7) +
labs(title="Variant Types", x="Type", y="Count") + theme_minimal()
# Visualizing variant effects
library(plotly)
effects_df <- as.data.frame(variantEffects)
plot_ly(effects_df, labels = ~Var1, values = ~Freq, type = 'pie') %>%
layout(title = 'Variant Effects', showlegend = T)
Step 8: Biological Interpretation
a. Pathway Analysis and Clinical Relevance:
i. DAVID:
- List your variant genes and upload them to DAVID for enrichment analysis.
- Study the enriched pathways or biological processes in your gene list, which might give insight into the biological context of your findings.
ii. ClinVar and gnomAD:
- Use ClinVar to find out if any of your variants are known to be associated with any disease.
- Use gnomAD to study the population frequency of your variants.
Step 9: Comprehensive Report Creation
a. Organize Your Findings:
- Introduction: Background of your study, and the importance of the study.
- Objective: Clearly define what you intend to achieve with this study.
- Methods: Clearly describe every step, the tools used, and their parameters.
- Results: Use tables, charts, graphs to represent your findings.
- Discussion: Explain the importance of your findings and their implications.
b. Utilize Visual Aids:
- Use appropriate visual aids like tables, charts, and graphs to represent your findings effectively.
- Properly label each visual aid and reference them appropriately in the text.
c. Review and Edit:
- Review the report multiple times to ensure clarity, coherence, and logical flow.
- Edit ruthlessly to ensure brevity while maintaining essence and clarity.
Step 10: Review and Finalize Report
a. Peer Review:
- Get colleagues or mentors to review your report.
- Incorporate the feedback and finalize the report.
b. Submission-Ready:
- Make sure the report is in the required format if you are aiming for publication.
- Make necessary adjustments as per journal guidelines.
Step 11: Share Your Findings
a. Presentation:
- Create a concise and clear presentation, highlighting the key aspects of your work.
- Practice it to ensure smooth delivery.
b. Publications:
- Select a suitable journal.
- Prepare your manuscript as per the journal’s guideline and submit it.
c. Data Sharing:
Step 12: Reviewers’ Comments and Publication
a. Address Reviewers’ Comments:
- Once submitted, you may receive comments and suggestions from peer reviewers.
- Address all comments diligently, make necessary changes, and resubmit the manuscript.
b. Publication:
- Once accepted, go through the proof carefully before it goes to publication.
- Share your published work with your peers and promote it for maximum reach.
Example of Python Code for Visuals:
import matplotlib.pyplot as plt# Example: Visualizing variant impact categories
impact_categories = df['IMPACT'].value_counts()
impact_categories.plot(kind='barh', color='skyblue')
plt.title('Variant Impact Categories')
plt.xlabel('Count')
plt.ylabel('Impact Category')
plt.show()
Example of R Code for Visuals:
library(ggplot2)# Example: Visualizing variant impact categories
impactCategories <- table(elementMetadata(vcf)$IMPACT)
ggplot(data=as.data.frame(impactCategories), aes(x=Var1, y=Freq)) +
geom_bar(stat='identity', fill='skyblue') +
labs(title="Variant Impact Categories", x="Impact Category", y="Count") + theme_minimal()
Remember, the steps involved may vary depending on the specifics of the study and the requirements of the journal if you are aiming to publish your work.