Step-by-Step Guide to Download Raw Sequence Data from GEO/SRA

December 27, 2024 Off By admin

This guide outlines the process of downloading raw sequencing data from GEO/SRA using the NCBI SRA Toolkit. We will also cover troubleshooting common issues and tips for efficient downloading.

Table of Contents

Prerequisites

Install the SRA Toolkit:
- Download the latest version of the SRA Toolkit: NCBI SRA Toolkit.
- Follow the installation instructions for your operating system.
Install Entrez Direct (optional):
- Entrez Direct simplifies querying SRA databases. Installation guide: Entrez Direct.
Environment Setup:
- Add the toolkit binaries to your PATH:
  bash
  export PATH=$PATH:/path/to/sratoolkit/bin
Verify Installation:
- Run vdb-config -i to set up default download directories or change them to a directory with sufficient space.
- Test the installation:
  bash
  prefetch --version fastq-dump --version

Steps to Download Data

1. Identify the Accession Number

Go to GEO: GEO
Search for your dataset (e.g., GSE48215).
Navigate to the sub-series or sample page (e.g., GSM1173000).
Find the corresponding SRA links (e.g., SRP026538 or SRX317818).

2. Use `prefetch` to Download `.sra` Files

Run the following command to download .sra files:
bash
prefetch -v SRR925811
- By default, files are saved to /home/<USER>/ncbi/public/sra.

Note: If your home directory lacks space, update the cache location:

Change the “Workspace Name” to a directory on a larger disk.

3. Convert `.sra` to `.fastq`

Use fastq-dump:
bash
fastq-dump --outdir /path/to/output/ --split-files /home/<USER>/ncbi/public/sra/SRR925811.sra
- The --split-files option generates separate files for paired-end reads (e.g., _1.fastq and _2.fastq).

Example:

4. Download Multiple Runs

Use esearch and efetch with xargs for batch downloads:
bash
esearch -db sra -query "PRJNA40075" | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | xargs -I {} prefetch {}
Convert all downloaded .sra files to .fastq:
bash
for file in /path/to/sra/*.sra; do fastq-dump --split-files "$file" done

5. Using Docker for SRA Toolkit

Use a pre-built Docker container:
bash
docker run -v $PWD:$PWD quay.io/biocontainers/sra-tools:3.1.1--h4304569_2 prefetch SRR28520745 --output-directory $PWD/ docker run -v $PWD:$PWD quay.io/biocontainers/sra-tools:3.1.1--h4304569_2 fastq-dump --outdir $PWD/SRR28520745/ --split-files $PWD/SRR28520745/SRR28520745.sra

6. Alternative: Direct Download from ENA

If you encounter issues with SRA Toolkit, download directly from ENA:
- Visit: EBI ENA
- Search for the dataset (e.g., SRX317818).
- Download .fastq files via FTP:
  bash
  wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR925/SRR925811/SRR925811_1.fastq.gz wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR925/SRR925811/SRR925811_2.fastq.gz

7. Troubleshooting Common Errors

UTF-8 Character Error:
- Ensure no special characters in file paths. Re-run:
  bash
  fastq-dump --outdir /clean/path/ --split-files /path/to/SRR6294675.sra
No Accession to Process:
- Check the .sra file exists and paths are correct.
Command Not Found:
- Verify PATH includes the SRA Toolkit binaries.