SED

SED Text Processing for Bioinformatics: An Introduction

March 11, 2024 Off By admin
Shares

Introduction to SED

SED, which stands for Stream Editor, is a powerful utility in Unix and Linux systems used for parsing and transforming text. In the context of bioinformatics, SED is often used for text manipulation tasks such as filtering, formatting, and summarizing large sequence data files.

For instance, SED can be used to convert a FASTQ file to a FASTA file, which is a common format for DNA sequences in bioinformatics. Here’s an example command:

bash

1sed -n '1~4p;2~4p' FASTQ_FILE | sed 's/^@/>/g' > FASTA_FILE

This command selects every 4th line from the FASTQ file (which contains the sequence identifier and description) and prepends a “>” symbol to it, which is the format required for FASTA files.

SED is also useful for searching and replacing text in large files, which is a common task in bioinformatics data analysis. For example, you can use SED to replace all occurrences of “chr” with “chromosome” in a file:

bash

1sed -i 's/chr/chromosome/g' FILENAME

This command replaces all occurrences of “chr” with “chromosome” in the file named “FILENAME”. The “-i” option tells SED to edit the file in place, meaning the changes are saved to the original file.

Overall, SED is a versatile and powerful tool for text manipulation in bioinformatics, and is often used in combination with other command-line utilities such as awk and grep to perform complex data analysis tasks.

Basic SED syntax and commands

SED, or Stream Editor, is a powerful utility in Unix and Linux systems used for parsing and transforming text. In bioinformatics, SED is often used for text manipulation tasks such as filtering, formatting, and summarizing large sequence data files.

Here is the basic syntax of SED:

bash

1sed OPTIONS... [SCRIPT] [INPUTFILE...]

Here are some common SED commands:

  • sed 's/unix/linux/' geekfile.txt: This command replaces the first occurrence of “unix” with “linux” in the file “geekfile.txt”.
  • sed 's/unix/linux/2' geekfile.txt: This command replaces the second occurrence of “unix” with “linux” in the file “geekfile.txt”.
  • sed 's/unix/linux/g' geekfile.txt: This command replaces all occurrences of “unix” with “linux” in the file “geekfile.txt”.
  • sed 's/unix/linux/g;s/os/system/' geekfile.txt: This command replaces all occurrences of “unix” with “linux” and “os” with “system” in the file “geekfile.txt”.
  • sed '1,5d' geekfile.txt: This command deletes lines 1 to 5 in the file “geekfile.txt”.
  • sed 's/^#.*$//' geekfile.txt: This command removes any lines that start with “#” in the file “geekfile.txt”.
  • sed 's/\s\+$//' geekfile.txt: This command removes any trailing whitespace in the file “geekfile.txt”.

SED can also be used in pipes and loops to perform more complex text manipulation tasks. For example, you can use SED to remove all the “CHR”s from a file:

bash

1sed 's/CHR//g' inputfile.txt > outputfile.txt

This command replaces all occurrences of “CHR” with nothing in the file “inputfile.txt” and saves the output to “outputfile.txt”.

Exercise: Using SED to print specific lines from a file

To print specific lines from a file using SED, you can use the following commands:

  • To print the last line of a file:
bash

1sed -n '$p' filename
  • To print a range of lines from N to the end of a file:
bash

1sed -n 'N,$p' filename

Replace N with the starting line number.

  • To print a line that matches a pattern:
bash

1sed -n '/PATTERN/p' filename

Replace PATTERN with the regular expression you want to match.

  • To print every Nth line starting from line M:
bash

1sed -n 'M~Np' filename

Replace M with the starting line number and N with the interval between lines.

  • To print a range of lines from M to N:
bash

1sed -n 'M,Np' filename

Replace M and N with the starting and ending line numbers, respectively.

  • To print a range of lines from the line that matches P1 to the line that matches P2:
bash

1sed -n '/P1/,/P2/p' filename

Replace P1 and P2 with the regular expressions you want to match.

  • To print the lines that match P1 and the next N lines following P1:
bash

1sed -n '/P1/,+Np' filename

Replace P1 with the regular expression you want to match and N with the number of lines to print after the match.

Search and Replace with SED

How to search for and replace text using SED

To search for and replace text using SED, you can use the following commands:

  • To replace the first occurrence of a pattern in a file:
bash

1sed 's/PATTERN/REPLACEMENT/' filename

Replace PATTERN with the regular expression you want to match and REPLACEMENT with the text you want to replace it with.

  • To replace all occurrences of a pattern in a file:
bash

1sed 's/PATTERN/REPLACEMENT/g' filename

Add the g flag to replace all occurrences of the pattern in the file.

  • To replace a pattern only in a specific range of lines:
bash

1sed 'M,Ns/PATTERN/REPLACEMENT/g' filename

Replace M and N with the starting and ending line numbers, respectively, and PATTERN and REPLACEMENT with the text you want to match and replace.

  • To replace a pattern only in the first N lines of a file:
bash

1sed '1,Ns/PATTERN/REPLACEMENT/g' filename

Replace N with the number of lines to apply the replacement to.

  • To replace a pattern only in the last N lines of a file:
bash

1sed -n ':a;N;$!ba;s/PATTERN/REPLACEMENT/g' filename
2sed -i 'N;$!N;$!N;s/PATTERN/REPLACEMENT/g' filename

The first command uses a loop to read the entire file into memory and replace the pattern in the last N lines. The second command uses the -i option to edit the file in place.

  • To replace a pattern only if it is preceded by another pattern:
bash

1sed '/PRECEDES/s/PATTERN/REPLACEMENT/g' filename

Replace PRECEDES and PATTERN with the regular expressions you want to match and REPLACEMENT with the text you want to replace it with.

Regular expressions in SED

SED supports regular expressions, which are patterns used to match and manipulate text. Here are some examples of using regular expressions in SED:

  1. Matching a pattern:

To match a pattern, you can use the s command followed by the pattern and the replacement string. For example, to replace all occurrences of the word “hello” with “hi” in a file, you can use the following command:

bash

1sed 's/hello/hi/g' filename

The g flag at the end of the command tells SED to replace all occurrences of the pattern on each line.

  1. Matching a pattern with a regular expression:

You can use regular expressions to match more complex patterns. For example, to match all lines that start with the letter “a”, you can use the following command:

bash

1sed '/^a/p' filename

The ^ character matches the beginning of a line, and the p command prints the matching lines.

  1. Matching a pattern with a regular expression and replacing it:

You can also use regular expressions to match a pattern and replace it with a different string. For example, to replace all occurrences of one or more digits at the beginning of a line with the word “number”, you can use the following command:

bash

1sed 's/^[0-9]* /number /g' filename

The [0-9] character class matches any digit, and the * quantifier matches zero or more occurrences of the previous character.

  1. Matching a pattern with a regular expression and deleting the line:

You can also use regular expressions to match a pattern and delete the entire line. For example, to delete all lines that contain the word “error”, you can use the following command:

bash

1sed '/error/d' filename

The d command deletes the entire line.

  1. Matching a pattern with a regular expression and printing only the matching lines:

You can also use regular expressions to match a pattern and print only the matching lines. For example, to print all lines that contain the word “warning”, you can use the following command:

bash

1sed -n '/warning/p' filename

The -n option tells SED not to print any lines by default, and the p command prints only the matching lines.

Exercise: Using SED to replace all occurrences of a gene name in a FASTA file

Sure, I can help you with that! To replace all occurrences of a gene name in a FASTA file using SED, you can use the following command:

bash

1sed -i 's/old_gene_name/new_gene_name/g' filename.fasta

Replace old_gene_name with the name of the gene you want to replace, and new_gene_name with the new name you want to use. The -i option tells SED to edit the file in place, meaning the changes will be saved to the original file. The g flag at the end of the command tells SED to replace all occurrences of the pattern on each line.

Here’s an example:

bash

1$ cat example.fasta
2>gene1
3ATGCGATCGATCGATCGATCGATCG
4>gene2
5ATGCGATCGATCGATCGATCGATCG
6>gene1
7ATGCGATCGATCGATCGATCGATCG
8
9$ sed -i 's/gene1/new_gene_name/g' example.fasta
10
11$ cat example.fasta
12>new_gene_name
13ATGCGATCGATCGATCGATCGATCG
14>gene2
15ATGCGATCGATCGATCGATCGATCG
16>new_gene_name
17ATGCGATCGATCGATCGATCGATCG

In this example, we replaced all occurrences of “gene1” with “new_gene_name” in the example.fasta file.

File Manipulation with SED

How to delete, insert, and modify lines in a file using SED

Here are some examples of how to delete, insert, and modify lines in a file using SED:

  1. Deleting lines:

To delete a line, you can use the d command. For example, to delete the third line of a file, you can use the following command:

bash

1sed '3d' filename

To delete a range of lines, you can use the , character to specify the range. For example, to delete lines 3 to 5 of a file, you can use the following command:

bash

1sed '3,5d' filename

To delete all lines that match a pattern, you can use the following command:

bash

1sed '/pattern/d' filename
  1. Inserting lines:

To insert a line, you can use the i command. For example, to insert the line “This is a new line” before the third line of a file, you can use the following command:

bash

1sed '3i This is a new line' filename

To insert multiple lines, you can use the -e option to specify multiple commands. For example, to insert the lines “This is a new line 1” and “This is a new line 2” before the third line of a file, you can use the following command:

bash

1sed -e '3i This is a new line 1' -e '3i This is a new line 2' filename
  1. Modifying lines:

To modify a line, you can use the s command. For example, to replace the first occurrence of the word “old” with “new” in the third line of a file, you can use the following command:

bash

1sed '3s/old/new/' filename

To replace all occurrences of the word “old” with “new” in the third line of a file, you can use the following command:

bash

1sed '3s/old/new/g' filename

To replace the first occurrence of the word “old” with “new” in all lines that match a pattern, you can use the following command:

bash

1sed '/pattern/s/old/new/' filename

Exercise: Using SED to extract specific fields from a tab-separated value (TSV) file

To extract specific fields from a tab-separated value (TSV) file using SED, you can use the following command:

bash

1cut -f N1,N2,N3 filename.tsv

Replace N1N2, and N3 with the field numbers you want to extract, separated by commas. For example, to extract the first and third fields from a TSV file named example.tsv, you can use the following command:

bash

1cut -f 1,3 example.tsv

If you want to use SED instead of cut, you can use the following command:

bash

1sed 's/\t/\n/g' filename.tsv | sed -n 'N;N;s/\n/\t/2;s/\t/\t/3;s/^[^\t]*\t\([^\t]*\)\t[^\t]*\t\([^\t]*\)\t.*$/\1\t\2/p'

This command uses two sed commands to extract the first and third fields from a TSV file. The first sed command replaces all tab characters with newline characters, so that each field is on a separate line. The second sed command extracts the first and third fields and prints them, separated by a tab character.

Here’s an example:

bash

1$ cat example.tsv
2field1 field2 field3 field4
3value1 value2 value3 value4
4value5 value6 value7 value8
5
6$ sed 's/\t/\n/g' example.tsv | sed -n 'N;N;s/\n/\t/2;s/\t/\t/3;s/^[^\t]*\t\([^\t]*\)\t[^\t]*\t\([^\t]*\)\t.*$/\1\t\2/p'
7field1 field3
8value1 value3
9value5 value7

In this example, we extracted the first and third fields from the example.tsv file using SED.

Advanced SED Techniques

Sure, I can help you with that! To process multiple files and streams using SED, you can use the following commands:

  1. To process multiple files:

To process multiple files using SED, you can simply list them after the SED command. For example, to print the first line of each file, you can use the following command:

bash

1sed -n '1p' file1.txt file2.txt file3.txt

By default, SED treats multiple input files as one long stream. However, you can use the -s option to process each file separately. For example, to print the last line of each file, you can use the following command:

bash

1sed -ns '$p' file1.txt file2.txt file3.txt
  1. To process a stream:

To process a stream using SED, you can use the - character as a file name. For example, to print the first 10 lines of a stream, you can use the following command:

bash

1echo "hello world" | sed '1,10!d'

You can also use a pipe to process the output of another command. For example, to print the number of lines in a file, you can use the following command:

bash

1wc -l < file.txt | sed 's/^ *//'

How to combine SED with other command-line tools

To combine SED with other command-line tools, you can use pipes (|) or input/output redirection. Here are some examples:

  1. Using pipes:

To combine SED with other command-line tools using pipes, you can simply separate the commands with a pipe character (|). For example, to print the first 10 lines of a file and then count the number of characters in those lines, you can use the following command:

bash

1sed '1,10!d' file.txt | wc -c
  1. Using input/output redirection:

To combine SED with other command-line tools using input/output redirection, you can use the < and > characters. For example, to sort the lines of a file and then print only the unique lines, you can use the following command:

bash

1sort file.txt | uniq

You can also use input/output redirection with SED. For example, to replace all occurrences of “hello” with “world” in a file and save the result to a new file, you can use the following command:

bash

1sed 's/hello/world/g' file.txt > newfile.txt

Exercise: Using SED to filter and format BLAST output

Sure, I can help you with that! To filter and format BLAST output using SED, you can use the following commands:

  1. To extract specific fields from the BLAST output:

To extract specific fields from the BLAST output, you can use the cut command to select the desired columns. For example, to extract the query ID, subject ID, percentage identity, alignment length, and E-value, you can use the following command:

bash

1cut -f 1,2,11,12,13 blast_output.txt
  1. To replace the default field separator with a tab character:

To replace the default field separator (a space) with a tab character, you can use the tr command. For example, to replace all spaces with tabs, you can use the following command:

bash

1tr ' ' '\t' < blast_output.txt > blast_output_tab.txt
  1. To remove the header line from the BLAST output:

To remove the header line from the BLAST output, you can use the sed command. For example, to remove the first line, you can use the following command:

bash

1sed '1d' blast_output.txt
  1. To sort the BLAST output by E-value:

To sort the BLAST output by E-value, you can use the sort command. For example, to sort the output in ascending order, you can use the following command:

bash

1sort -k 13,13n blast_output.txt
  1. To remove duplicate entries from the BLAST output:

To remove duplicate entries from the BLAST output, you can use the sort and uniq commands. For example, to remove duplicate entries based on the query and subject IDs, you can use the following commands:

bash

1cut -f 1,2 blast_output.txt | sort | uniq > unique_blast_output.txt

SED in Bioinformatics Workflows

How to incorporate SED into bioinformatics pipelines and workflows

To incorporate SED into bioinformatics pipelines and workflows, you can use it to filter and format data in various ways. Here’s an example of how to count the occurrences of different base-quality sequences in a FASTQ file using SED:

First, let’s extract the lines that have base quality scores, which occur on lines 4, 8, 12, etc. We can use SED to print every fourth line starting from the first line:

bash

1zcat data/DPCh_plate1_F12_S72.R1.fq.gz | sed -n '1~4p'

Next, we can use SED to extract the base quality scores from each line. The base quality scores are represented by characters with ASCII values 33 to 126. We can use SED to extract the characters in this range and print them along with the number of occurrences:

bash

1zcat data/DPCh_plate1_F12_S72.R1.fq.gz | sed -n '1~4p' | sed -e 's/[!-~]/\n/g' | awk '{print length, $0}' | sort | uniq -c | sort -nr

Let’s break down this command:

  • zcat data/DPCh_plate1_F12_S72.R1.fq.gz: Extract the contents of the gzipped FASTQ file.
  • sed -n '1~4p': Print every fourth line starting from the first line.
  • sed -e 's/[!-~]/\n/g': Replace all characters outside the range of base quality scores with newline characters.
  • awk '{print length, $0}': Print the length and the line.
  • sort: Sort the lines based on the length.
  • uniq -c: Count the number of occurrences of each line.
  • sort -nr: Sort the lines in reverse numerical order based on the count.

This command will output the number of occurrences and the base quality sequence, sorted in descending order based on the number of occurrences.

To sort the output in a different way, you can pipe the output to the sort command with the desired options. For example, to sort the output by the number of occurrences in descending order, you can use:

bash

1zcat data/DPCh_plate1_F12_S72.R1.fq.gz | sed -n '1~4p' | sed -e 's/[!-~]/\n/g' | awk '{print length, $0}' | sort | uniq -c | sort -n -b -r -k 1

Real-world examples of SED in bioinformatics data analysis

some real-world examples of SED in bioinformatics data analysis. Here are a few examples:

  1. Filtering FASTQ files: SED can be used to filter out low-quality reads from FASTQ files. For example, the following command filters out reads with more than 10 bases with a quality score below 20:
bash

1sed -n '/^@/{N;N;N;N;/[^@]*[!ACGTacgt][!ACGTacgt]{19}/!p}' input.fastq > output.fastq
  1. Extracting specific fields from VCF files: SED can be used to extract specific fields from Variant Call Format (VCF) files. For example, the following command extracts the chromosome, position, and reference allele fields:
bash

1sed 's/\t.*\t/\t/;s/^.*\t\(.*\)\t.*\t\(.*\)\t.*\t\(.*\)\t.*/\1\t\2\t\3/' input.vcf
  1. Replacing sequence names in FASTA files: SED can be used to replace sequence names in FASTA files. For example, the following command replaces all occurrences of “>seq1” with “>gene1”:
bash

1sed 's/>seq1/>gene1/g' input.fasta > output.fasta
  1. Converting between FASTA and FASTQ formats: SED can be used to convert between FASTA and FASTQ formats. For example, the following command converts a FASTA file to a FASTQ file:
bash

1sed '1~4s/^>/@/;N;N;N;s/\(.*\)\n\(.*\)\n\(.*\)\n/\1\t\2\t\3\t+\4\n/' input.fasta > output.fastq
  1. Extracting specific lines from BAM files: SED can be used to extract specific lines from Binary Alignment Map (BAM) files. For example, the following command extracts the header lines:
bash

1samtools view -H input.bam | sed 's/^@//' > header.txt

I hope these examples give you an idea of how SED can be used in bioinformatics data analysis.

Exercise: Using SED to process and analyze a real-world bioinformatics dataset

To count up the number of times different base-quality sequences occur in the file DPCh_plate1_F12_S72.R1.fq.gz, you can use the following command:

bash

1zcat DPCh_plate1_F12_S72.R1.fq.gz | awk '(NR-1) % 4 == 0 {print length, $0}' | sort | uniq -c | sort -nr

Let’s break down this command:

  • zcat DPCh_plate1_F12_S72.R1.fq.gz: Extract the contents of the gzipped FASTQ file.
  • awk '(NR-1) % 4 == 0 {print length, $0}': Print the length and the line for every fourth line (base quality scores).
  • sort: Sort the lines based on the length.
  • uniq -c: Count the number of occurrences of each line.
  • sort -nr: Sort the lines in reverse numerical order based on the count.

This command will output the number of occurrences and the base quality score sequence, sorted in descending order based on the number of occurrences.

To sort the output in a different way, you can pipe the output into:

sort -n -b -r -k 1

Pipe that to less and look through it. It is actually pretty cool!

Here are some sample SED commands that could be used in the exercises:

  • Print specific lines from a file:
bash

1sed -n '10p' FILENAME
  • Replace all occurrences of a gene name in a FASTA file:
bash

1sed 's/gene_name/new_gene_name/g' FASTA_FILE > NEW_FASTA_FILE
  • Extract specific fields from a TSV file:
bash

1sed 's/\t/\n/g' TSV_FILE | sed -n '2p;3p'
  • Filter and format BLAST output:
bash

1blastn -query query.fasta -db db.fasta -outfmt 6 | sed 's/query//g;s/subject//g;s/evalue//g;s/bit score//g;s/length//g' > blast_output.txt
Shares