Step-by-Step Guide to Downloading TCGA Data from GDC
January 10, 2025The Genomic Data Commons (GDC) is the primary repository for The Cancer Genome Atlas (TCGA) data. This guide provides a detailed protocol for downloading TCGA data from the GDC using the GDC Data Transfer Tool and other methods.
Step 1: Access the GDC Data Portal
- Go to the GDC Data Portal.
- Use the search and filtering options to find the datasets you need. You can filter by project (e.g., TCGA), data category (e.g., DNA methylation, gene expression), and data type (e.g., raw data, processed data).
Step 2: Obtain a Manifest File
- After selecting the files you want to download, add them to your cart.
- Click on the “Cart” icon in the top-right corner.
- Click “Download Manifest” to download a manifest file (
gdc_manifest.txt
). This file contains the UUIDs of the selected files and is used to guide the download process.
Step 3: Install the GDC Data Transfer Tool
- Download the GDC Data Transfer Tool from the GDC website.
- Extract the downloaded file to a directory of your choice.
- Add the directory to your system’s PATH for easy access.
Step 4: Download Data Using the Manifest File
- Open a terminal or command prompt.
- Navigate to the directory where the manifest file is located.
- Run the following command to download the data:
gdc-client download -m gdc_manifest.txt
This will download all the files listed in the manifest file to the current directory.
Step 5: Download Controlled-Access Data
- To download controlled-access data, you need an authentication token.
- Log in to the GDC Data Portal and download your token from the “Download Token” link under your username.
- Save the token file (e.g.,
gdc-user-token.txt
) in a secure location. - Run the following command to download controlled-access data:
gdc-client download -m gdc_manifest.txt -t gdc-user-token.txt
Step 6: Download Data Using UUIDs
If you know the UUIDs of the files you want to download, you can download them directly using the following command:
gdc-client download <UUID>
Replace <UUID>
with the actual UUID of the file.
Step 7: Troubleshooting Common Issues
7.1 GLIBC Version Error
- Error:
./gdc-client: /lib64/libc.so.6: version 'GLIBC_2.14' not found
- Solution: Upgrade your system’s GLIBC or use a different system (e.g., Ubuntu) that supports the required version.
7.2 Internet Connection Issues
- Error: Download fails due to internet problems.
- Solution: Retry the download. The GDC Data Transfer Tool supports resuming interrupted downloads.
7.3 Permission Issues
- Error: Permission denied when running
gdc-client
. - Solution: Use
chmod
to make the file executable:chmod +x gdc-client
Step 8: Using the GenomicDataCommons R Package
For a programmatic approach, you can use the GenomicDataCommons
R package to query and download TCGA data.
8.1 Install the Package
if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("GenomicDataCommons")
8.2 Query and Download Data
library(GenomicDataCommons) library(magrittr) # Build a manifest for gene expression data from ovarian cancer ge_manifest <- files() %>% filter(~ cases.project.project_id == 'TCGA-OV' & type == 'gene_expression' & analysis.workflow_type == 'HTSeq - Counts') %>% manifest() # Download the files destdir <- tempdir() fnames <- lapply(ge_manifest$id, gdcdata, destination_dir = destdir, overwrite = TRUE, progress = FALSE)
Step 9: Merging and Processing Downloaded Data
After downloading, you may need to merge and process the data. For example, to merge gene expression files:
9.1 Using R
library(data.table) # List all downloaded files files <- list.files(path = destdir, pattern = "*.htseq.counts.gz", full.names = TRUE) # Read and merge files data_list <- lapply(files, fread) merged_data <- Reduce(function(x, y) merge(x, y, by = "V1"), data_list) # Save merged data fwrite(merged_data, "merged_gene_expression.csv")
9.2 Using Python
import pandas as pd import glob # List all downloaded files files = glob.glob("path/to/downloaded/files/*.htseq.counts.gz") # Read and merge files data_list = [pd.read_csv(f, sep="\t", header=None, names=["Gene", f.split("/")[-1].split(".")[0]]) for f in files] merged_data = pd.concat(data_list, axis=1) # Save merged data merged_data.to_csv("merged_gene_expression.csv", index=False)
Conclusion
Downloading TCGA data from the GDC can be done efficiently using the GDC Data Transfer Tool or programmatically with the GenomicDataCommons
R package. By following this guide, you can easily retrieve and process the data you need for your research. Whether you prefer command-line tools or scripting, this protocol provides a comprehensive approach to accessing TCGA data.