Step-by-Step Guide to Downloading TCGA Data from GDC

January 10, 2025 Off By admin

The Genomic Data Commons (GDC) is the primary repository for The Cancer Genome Atlas (TCGA) data. This guide provides a detailed protocol for downloading TCGA data from the GDC using the GDC Data Transfer Tool and other methods.

Table of Contents

Step 1: Access the GDC Data Portal

Go to the GDC Data Portal.
Use the search and filtering options to find the datasets you need. You can filter by project (e.g., TCGA), data category (e.g., DNA methylation, gene expression), and data type (e.g., raw data, processed data).

Step 2: Obtain a Manifest File

After selecting the files you want to download, add them to your cart.
Click on the “Cart” icon in the top-right corner.
Click “Download Manifest” to download a manifest file (gdc_manifest.txt). This file contains the UUIDs of the selected files and is used to guide the download process.

Step 3: Install the GDC Data Transfer Tool

Download the GDC Data Transfer Tool from the GDC website.
Extract the downloaded file to a directory of your choice.
Add the directory to your system’s PATH for easy access.

Step 4: Download Data Using the Manifest File

Open a terminal or command prompt.
Navigate to the directory where the manifest file is located.
Run the following command to download the data:
bash
Copy
```
gdc-client download -m gdc_manifest.txt
```
This will download all the files listed in the manifest file to the current directory.

Step 5: Download Controlled-Access Data

To download controlled-access data, you need an authentication token.
Log in to the GDC Data Portal and download your token from the “Download Token” link under your username.
Save the token file (e.g., gdc-user-token.txt) in a secure location.
Run the following command to download controlled-access data:
bash
Copy
```
gdc-client download -m gdc_manifest.txt -t gdc-user-token.txt
```

Step 6: Download Data Using UUIDs

If you know the UUIDs of the files you want to download, you can download them directly using the following command:

gdc-client download <UUID>

Replace <UUID> with the actual UUID of the file.

Step 7: Troubleshooting Common Issues

7.1 GLIBC Version Error

Error: ./gdc-client: /lib64/libc.so.6: version 'GLIBC_2.14' not found
Solution: Upgrade your system’s GLIBC or use a different system (e.g., Ubuntu) that supports the required version.

7.2 Internet Connection Issues

Error: Download fails due to internet problems.
Solution: Retry the download. The GDC Data Transfer Tool supports resuming interrupted downloads.

7.3 Permission Issues

Error: Permission denied when running gdc-client.
Solution: Use chmod to make the file executable:
bash
Copy
```
chmod +x gdc-client
```

Step 8: Using the GenomicDataCommons R Package

For a programmatic approach, you can use the GenomicDataCommons R package to query and download TCGA data.

8.1 Install the Package

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("GenomicDataCommons")

8.2 Query and Download Data

library(GenomicDataCommons)
library(magrittr)

# Build a manifest for gene expression data from ovarian cancer
ge_manifest <- files() %>%
    filter(~ cases.project.project_id == 'TCGA-OV' &
           type == 'gene_expression' &
           analysis.workflow_type == 'HTSeq - Counts') %>%
    manifest()

# Download the files
destdir <- tempdir()
fnames <- lapply(ge_manifest$id, gdcdata, destination_dir = destdir, overwrite = TRUE, progress = FALSE)

Step 9: Merging and Processing Downloaded Data

After downloading, you may need to merge and process the data. For example, to merge gene expression files:

9.1 Using R

library(data.table)

# List all downloaded files
files <- list.files(path = destdir, pattern = "*.htseq.counts.gz", full.names = TRUE)

# Read and merge files
data_list <- lapply(files, fread)
merged_data <- Reduce(function(x, y) merge(x, y, by = "V1"), data_list)

# Save merged data
fwrite(merged_data, "merged_gene_expression.csv")

9.2 Using Python

import pandas as pd
import glob

# List all downloaded files
files = glob.glob("path/to/downloaded/files/*.htseq.counts.gz")

# Read and merge files
data_list = [pd.read_csv(f, sep="\t", header=None, names=["Gene", f.split("/")[-1].split(".")[0]]) for f in files]
merged_data = pd.concat(data_list, axis=1)

# Save merged data
merged_data.to_csv("merged_gene_expression.csv", index=False)

Conclusion

Downloading TCGA data from the GDC can be done efficiently using the GDC Data Transfer Tool or programmatically with the GenomicDataCommons R package. By following this guide, you can easily retrieve and process the data you need for your research. Whether you prefer command-line tools or scripting, this protocol provides a comprehensive approach to accessing TCGA data.