Managing Data Lists: Removing Duplicate Values with Excel- Tutorial
March 13, 2024Table of Contents
Managing lists of data using Excel and online tools
Websites:
EBI BioMart: http://useast.ensembl.org/index.html then click on the BioMart link
Tutorial 1: Removing duplicate values from a list using Excel
It is not uncommon to have a list of data that contains redundant values. Genes with multiple transcript isoforms is one example. If you are only interested in the genes and not the different transcripts, then you will probably want to filter the list to remove the redundant values. I did a search of the UCSC human genome browser with the query “colon cancer” and got back >500 matches. I created an Excel file listing the first 500 matches. You can download this data from the Exercise 1 home page by clicking on the link ListofGenesfromUCSC.xlsx. The file has a single column called Gene Name. Once you’ve downloaded the text file, do the following:
Advanced filter:
Select the column of gene names
Click on the Data menu and select Advanced filter (if you get a warning about being unable to determine which row contains column labels and you have a column header in row 1, just click OK).
Check the radio button “Copy to another location”
This should move our mouse to the “Copy to” text box. Select a column (not Columns A) Check the box “Unique records only”
Click the OK button.
This should produce a list of 156 genes from the original 500 genes. Why duplicate genes?
A single gene ID can have multiple transcript isoforms and thus when you search a
database for gene names or a geneID you may get multiple records back depending on how the database is configured.
Tutorial 2: Managing and editing text data
An issue common to gene names or gene identifiers is slight variations that can prevent their identification via a database lookup. For example, as gene or transcript records are reviewed by curators, they are often given an appended number such as NM_0012345.1 or NM_0012345.3 indicating which version they are. The base identifier of NM_0012345 is the same between them but if one list has the base number and a second list has the appended version number, the database lookup or Excel lookup won’t recognize the two as being the same record.
For this part of the tutorial, download the Excel file titled ExpressionData.xlsx from the Exercise 2 homepage. The ExpressionData.xlsx file has two columns. The first column has Ensembl GeneIDs with the version number. The second column contains gene expression information in the form of Log2 ratio of treatment/control. In order to find additional information about this list of genes, you want to use only the stable GeneID, not the version.
To correct this, you will use a text-related function called LEFT to change the GeneIDs in the
ExpressionData.xlsx file to just the stable GeneID.
- Insert a column to the left of the GeneID column in the ExpressionData.xlsx file.
- In cell A2, type = and select the LEFT function
- Select cell B2 for the text box in the FormulaBuilder dialog box
- Tab to the num_chars box and type in 15
- This should return the ENSG## up to the “.”
- Select the newly generated ID in A2, then copy down to the end of the column. Type Ctrl-D
to copy the function down through the entire selection.
- Then Edit->copy the newly generated IDs and use Edit->Paste->Special->Values to replace the formula with values.
- Save the changes to the file and you will use this file to find additional information about the genes using the EBI BioMart website.
Tutorial 3: Extracting data using EBI BioMart
In this part of the tutorial, you will use the EBI BioMart interface to pull in gene identifiers from different data sources. The BioMart interface has greater functionality for list-based searching and you can readily pull sequence data for a list of genes; something that is more difficult to do using the NCBI interface.
When you click on the weblink above, it should open a page the is a link to various tutorials for using BioMart. Click on the link BioMart data mining tool located in the first paragraph of that page. It will open a pretty simple interface that starts with choosing the database to search. In this case, change the drop-down menu to Ensemble Genes 106. The select Human genes (GRCh38.p13) as the Dataset. You then have to set the options for the Filters and Attributes. Filters sets parameters around the data to look for, either a region on a chromosome or a list of genes as input, ect. Click on the Filters link and it will display a window from which you can make
selections to restrict the search criteria. In this case, expand the GENE section and check the box next to Input external references ID list as shown in Figure 1.
Fig. 1: Input IDs for BioMart search
Next, you can select the attributes or what data you want returned from the search. For this exercise, you will request the Gene stable ID so you can match to your other data set as well as Gene name, NCBI gene ID and the HGNC symbol. Expand the GENE section and unselect everything but the Gene stable ID. Then select Gene name in the 2nd column. Then scroll down and expand the EXTERNAL section. Here, select the NCBI gene (formerly Entrezgene) ID in the right column. Once that is done,
click on the Results button at the top left side of the page. The page should look like that shown in Fig. 2. You can use the GO button to download the data. TSV stands for tab separated values and can be imported into Excel.
Fig 2: Results from BioMart search
Save this file in Excel
format as BioMartExport.xlsx
Tutorial 4: Comparing two lists of data using common identifiers
You now have two separate worksheets: ExpressionData.xlsx with the log2FC data and the BioMartExport.xlsx. These two files share one column of data (Ensembl Gene IDs). If you want to merge those two lists, you need a method to ensure that the data brought in matches the correct ID. This is a common problem in bioinformatics or information processing of any kind is having multiple lists of data that you want to compare to each other. In Excel is a function called VLOOKUP that makes this easy to do. It is also useful for transferring data from 1 worksheet to another. For this part of the tutorial, you will use VLOOKUP to transfer the Log2FC data from the ExpressionData.xlsx sheet to a worksheet in the BioMartExport.xlsx file.
Steps:
Open You can delete the column from ExpressionData.xlsx that had the Ensembl GeneIDs with version number in them. If it is not already open, open the BioMartExport.xlsx file also.
- In the BioMartExport file, insert a column between columns 1 and 2.
- In the second row of column 2 (cell B2), type and “=” sign. Then go to the drop-down menu in the upper left of the worksheet, find the function “VLOOKUP” and select it. If you do not see VLOOKUP on the main menu, scroll down to “more functions” which opens a dialog box with all of the available Excel functions. Under “lookup and reference” you will find VLOOKUP.
- Once you’ve inserted the function, you must fill out the arguments for the function using the dialog box that opens up. Select cell A2 as the lookup value.
- Then click into the box “Table_array”. Go up to the window menu and select
ExpressionData.xlsx as shown in Figure 4. This will activate the file.
- Select the first 2 columns of ExpressionData.xlsx
- Tab or click on the box “Col_index_num.” This tells the argument which column of data to bring over to the first worksheet. Type in a 2.
- In the final box, “Range_lookup,” type “false”. If A2 in the BioMartExport worksheet matches A2 in ExpressionData worksheet, then the value from column 2 of ExpressionData will be entered into cell B2 of BioMartExport. If the 2 cells do not match, it will fill in “N/A”.
- To fill in the rest of the column, select from cell B2 through then end of the data and under the Edit menu, select Fill Down or use the keyboard shortcut of “Ctl+D”.
When you are done, your BioMartExport worksheet should look like that shown in Figure 3:
At this point, the data in column 2 is still linked to the ExpressionData.xlsx file. You can see this if you click on one of the Log2FC values in column 2 and look at what is displayed in the text box at the top of the sheet. The cell is also highlighted in Fig. 3 and you can see the formula in the text box. You do not want to leave your file like that, otherwise every time you open it will go through the data lookup function again. To avoid this, select the entire column, copy it and then do an Edit->Paste Special and select “values” in the “Paste special” dialog box. This will replace the
Figure 3: BioMartExport worksheet with Log2FC data brought in via VLOOKUP.
function with the value of the function. After you complete that, click on a gene name. You should see just the Log2FC value displayed in the text box at the top. Add a column headers as appropriate and save your file.
When I’m working with a GeneList I will keep a Master copy of the genes with expression data and any other data I add to it. Ideally, you would have a database with all of the potential data sources as separate tables and do a look-up in real time, but that is beyond the scope of this course.