Step-by-Step Guide to Download Raw Sequence Data from GEO/SRA
December 27, 2024This guide outlines the process of downloading raw sequencing data from GEO/SRA using the NCBI SRA Toolkit. We will also cover troubleshooting common issues and tips for efficient downloading.
Prerequisites
- Install the SRA Toolkit:
- Download the latest version of the SRA Toolkit: NCBI SRA Toolkit.
- Follow the installation instructions for your operating system.
- Install Entrez Direct (optional):
- Entrez Direct simplifies querying SRA databases. Installation guide: Entrez Direct.
- Environment Setup:
- Add the toolkit binaries to your
PATH
:
- Add the toolkit binaries to your
- Verify Installation:
- Run
vdb-config -i
to set up default download directories or change them to a directory with sufficient space. - Test the installation:
- Run
Steps to Download Data
1. Identify the Accession Number
- Go to GEO: GEO
- Search for your dataset (e.g., GSE48215).
- Navigate to the sub-series or sample page (e.g., GSM1173000).
- Find the corresponding SRA links (e.g., SRP026538 or SRX317818).
2. Use prefetch
to Download .sra
Files
- Run the following command to download
.sra
files:- By default, files are saved to
/home/<USER>/ncbi/public/sra
.
- By default, files are saved to
Note: If your home directory lacks space, update the cache location:
Change the “Workspace Name” to a directory on a larger disk.
3. Convert .sra
to .fastq
- Use
fastq-dump
:- The
--split-files
option generates separate files for paired-end reads (e.g.,_1.fastq
and_2.fastq
).
- The
Example:
4. Download Multiple Runs
- Use
esearch
andefetch
withxargs
for batch downloads: - Convert all downloaded
.sra
files to.fastq
:
5. Using Docker for SRA Toolkit
- Use a pre-built Docker container:
6. Alternative: Direct Download from ENA
- If you encounter issues with SRA Toolkit, download directly from ENA:
- Visit: EBI ENA
- Search for the dataset (e.g., SRX317818).
- Download
.fastq
files via FTP:
7. Troubleshooting Common Errors
- UTF-8 Character Error:
- Ensure no special characters in file paths. Re-run:
- No Accession to Process:
- Check the
.sra
file exists and paths are correct.
- Check the
- Command Not Found:
- Verify
PATH
includes the SRA Toolkit binaries.
- Verify
8. Automate Download with a Script
9. Further Processing
- Align
.fastq
files using an aligner (e.g., STAR, BWA, TopHat). - Perform variant calling or downstream analysis.
10. Recommended Reading
By following this guide, you can efficiently download and prepare sequencing data for downstream analysis.