bioinformatics-fileformat-basics

Step-by-Step Guide to Download Raw Sequence Data from GEO/SRA

December 27, 2024 Off By admin
Shares

This guide outlines the process of downloading raw sequencing data from GEO/SRA using the NCBI SRA Toolkit. We will also cover troubleshooting common issues and tips for efficient downloading.


Prerequisites

  1. Install the SRA Toolkit:
    • Download the latest version of the SRA Toolkit: NCBI SRA Toolkit.
    • Follow the installation instructions for your operating system.
  2. Install Entrez Direct (optional):
  3. Environment Setup:
    • Add the toolkit binaries to your PATH:
      bash
      export PATH=$PATH:/path/to/sratoolkit/bin
  4. Verify Installation:
    • Run vdb-config -i to set up default download directories or change them to a directory with sufficient space.
    • Test the installation:
      bash
      prefetch --version
      fastq-dump --version

Steps to Download Data

1. Identify the Accession Number

  • Go to GEO: GEO
  • Search for your dataset (e.g., GSE48215).
  • Navigate to the sub-series or sample page (e.g., GSM1173000).
  • Find the corresponding SRA links (e.g., SRP026538 or SRX317818).

2. Use prefetch to Download .sra Files

  • Run the following command to download .sra files:
    bash
    prefetch -v SRR925811
    • By default, files are saved to /home/<USER>/ncbi/public/sra.

Note: If your home directory lacks space, update the cache location:

bash
vdb-config -i

Change the “Workspace Name” to a directory on a larger disk.


3. Convert .sra to .fastq

  • Use fastq-dump:
    bash
    fastq-dump --outdir /path/to/output/ --split-files /home/<USER>/ncbi/public/sra/SRR925811.sra
    • The --split-files option generates separate files for paired-end reads (e.g., _1.fastq and _2.fastq).

Example:

bash
fastq-dump --outdir ./data/ --split-files ./sra/SRR925811.sra

4. Download Multiple Runs

  • Use esearch and efetch with xargs for batch downloads:
    bash
    esearch -db sra -query "PRJNA40075" | efetch --format runinfo | cut -d ',' -f 1 | grep SRR | xargs -I {} prefetch {}
  • Convert all downloaded .sra files to .fastq:
    bash
    for file in /path/to/sra/*.sra; do
    fastq-dump --split-files "$file"
    done

5. Using Docker for SRA Toolkit

  • Use a pre-built Docker container:
    bash
    docker run -v $PWD:$PWD quay.io/biocontainers/sra-tools:3.1.1--h4304569_2 prefetch SRR28520745 --output-directory $PWD/
    docker run -v $PWD:$PWD quay.io/biocontainers/sra-tools:3.1.1--h4304569_2 fastq-dump --outdir $PWD/SRR28520745/ --split-files $PWD/SRR28520745/SRR28520745.sra

6. Alternative: Direct Download from ENA

  • If you encounter issues with SRA Toolkit, download directly from ENA:
    • Visit: EBI ENA
    • Search for the dataset (e.g., SRX317818).
    • Download .fastq files via FTP:
      bash
      wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR925/SRR925811/SRR925811_1.fastq.gz
      wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR925/SRR925811/SRR925811_2.fastq.gz

7. Troubleshooting Common Errors

  • UTF-8 Character Error:
    • Ensure no special characters in file paths. Re-run:
      bash
      fastq-dump --outdir /clean/path/ --split-files /path/to/SRR6294675.sra
  • No Accession to Process:
    • Check the .sra file exists and paths are correct.
  • Command Not Found:
    • Verify PATH includes the SRA Toolkit binaries.

8. Automate Download with a Script

bash
#!/bin/bash

# Input file: list.txt (format: sample_id SRX_accession)
input_file="download_list.txt"

while IFS=$'\t' read -r sample srx; do
echo "Processing $sample ($srx)"
srr=$(esearch -db sra -query "$srx" | efetch --format runinfo | cut -d ',' -f 1 | grep SRR)
prefetch "$srr"
fastq-dump --split-files "$srr"
done < "$input_file"


9. Further Processing


10. Recommended Reading


By following this guide, you can efficiently download and prepare sequencing data for downstream analysis.

Shares