computer-bioinformatics

Step-by-Step Guide to Downloading TCGA Data from GDC

January 10, 2025 Off By admin
Shares

The Genomic Data Commons (GDC) is the primary repository for The Cancer Genome Atlas (TCGA) data. This guide provides a detailed protocol for downloading TCGA data from the GDC using the GDC Data Transfer Tool and other methods.


Step 1: Access the GDC Data Portal

  1. Go to the GDC Data Portal.
  2. Use the search and filtering options to find the datasets you need. You can filter by project (e.g., TCGA), data category (e.g., DNA methylation, gene expression), and data type (e.g., raw data, processed data).

Step 2: Obtain a Manifest File

  1. After selecting the files you want to download, add them to your cart.
  2. Click on the “Cart” icon in the top-right corner.
  3. Click “Download Manifest” to download a manifest file (gdc_manifest.txt). This file contains the UUIDs of the selected files and is used to guide the download process.

Step 3: Install the GDC Data Transfer Tool

  1. Download the GDC Data Transfer Tool from the GDC website.
  2. Extract the downloaded file to a directory of your choice.
  3. Add the directory to your system’s PATH for easy access.

Step 4: Download Data Using the Manifest File

  1. Open a terminal or command prompt.
  2. Navigate to the directory where the manifest file is located.
  3. Run the following command to download the data:
    bash
    Copy
    gdc-client download -m gdc_manifest.txt

    This will download all the files listed in the manifest file to the current directory.


Step 5: Download Controlled-Access Data

  1. To download controlled-access data, you need an authentication token.
  2. Log in to the GDC Data Portal and download your token from the “Download Token” link under your username.
  3. Save the token file (e.g., gdc-user-token.txt) in a secure location.
  4. Run the following command to download controlled-access data:
    bash
    Copy
    gdc-client download -m gdc_manifest.txt -t gdc-user-token.txt

Step 6: Download Data Using UUIDs

If you know the UUIDs of the files you want to download, you can download them directly using the following command:

bash
Copy
gdc-client download <UUID>

Replace <UUID> with the actual UUID of the file.


Step 7: Troubleshooting Common Issues

7.1 GLIBC Version Error

  • Error./gdc-client: /lib64/libc.so.6: version 'GLIBC_2.14' not found
  • Solution: Upgrade your system’s GLIBC or use a different system (e.g., Ubuntu) that supports the required version.

7.2 Internet Connection Issues

  • Error: Download fails due to internet problems.
  • Solution: Retry the download. The GDC Data Transfer Tool supports resuming interrupted downloads.

7.3 Permission Issues

  • Error: Permission denied when running gdc-client.
  • Solution: Use chmod to make the file executable:
    bash
    Copy
    chmod +x gdc-client

Step 8: Using the GenomicDataCommons R Package

For a programmatic approach, you can use the GenomicDataCommons R package to query and download TCGA data.

8.1 Install the Package

R
Copy
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("GenomicDataCommons")

8.2 Query and Download Data

R
Copy
library(GenomicDataCommons)
library(magrittr)

# Build a manifest for gene expression data from ovarian cancer
ge_manifest <- files() %>%
    filter(~ cases.project.project_id == 'TCGA-OV' &
           type == 'gene_expression' &
           analysis.workflow_type == 'HTSeq - Counts') %>%
    manifest()

# Download the files
destdir <- tempdir()
fnames <- lapply(ge_manifest$id, gdcdata, destination_dir = destdir, overwrite = TRUE, progress = FALSE)

Step 9: Merging and Processing Downloaded Data

After downloading, you may need to merge and process the data. For example, to merge gene expression files:

9.1 Using R

R
Copy
library(data.table)

# List all downloaded files
files <- list.files(path = destdir, pattern = "*.htseq.counts.gz", full.names = TRUE)

# Read and merge files
data_list <- lapply(files, fread)
merged_data <- Reduce(function(x, y) merge(x, y, by = "V1"), data_list)

# Save merged data
fwrite(merged_data, "merged_gene_expression.csv")

9.2 Using Python

Copy
import pandas as pd
import glob

# List all downloaded files
files = glob.glob("path/to/downloaded/files/*.htseq.counts.gz")

# Read and merge files
data_list = [pd.read_csv(f, sep="\t", header=None, names=["Gene", f.split("/")[-1].split(".")[0]]) for f in files]
merged_data = pd.concat(data_list, axis=1)

# Save merged data
merged_data.to_csv("merged_gene_expression.csv", index=False)

Conclusion

Downloading TCGA data from the GDC can be done efficiently using the GDC Data Transfer Tool or programmatically with the GenomicDataCommons R package. By following this guide, you can easily retrieve and process the data you need for your research. Whether you prefer command-line tools or scripting, this protocol provides a comprehensive approach to accessing TCGA data.

Shares