SED Text Processing for Bioinformatics: An Introduction
March 11, 2024Table of Contents
Introduction to SED
SED, which stands for Stream Editor, is a powerful utility in Unix and Linux systems used for parsing and transforming text. In the context of bioinformatics, SED is often used for text manipulation tasks such as filtering, formatting, and summarizing large sequence data files.
For instance, SED can be used to convert a FASTQ file to a FASTA file, which is a common format for DNA sequences in bioinformatics. Here’s an example command:
1sed -n '1~4p;2~4p' FASTQ_FILE | sed 's/^@/>/g' > FASTA_FILE
This command selects every 4th line from the FASTQ file (which contains the sequence identifier and description) and prepends a “>” symbol to it, which is the format required for FASTA files.
SED is also useful for searching and replacing text in large files, which is a common task in bioinformatics data analysis. For example, you can use SED to replace all occurrences of “chr” with “chromosome” in a file:
1sed -i 's/chr/chromosome/g' FILENAME
This command replaces all occurrences of “chr” with “chromosome” in the file named “FILENAME”. The “-i” option tells SED to edit the file in place, meaning the changes are saved to the original file.
Overall, SED is a versatile and powerful tool for text manipulation in bioinformatics, and is often used in combination with other command-line utilities such as awk and grep to perform complex data analysis tasks.
Basic SED syntax and commands
SED, or Stream Editor, is a powerful utility in Unix and Linux systems used for parsing and transforming text. In bioinformatics, SED is often used for text manipulation tasks such as filtering, formatting, and summarizing large sequence data files.
Here is the basic syntax of SED:
1sed OPTIONS... [SCRIPT] [INPUTFILE...]
Here are some common SED commands:
sed 's/unix/linux/' geekfile.txt
: This command replaces the first occurrence of “unix” with “linux” in the file “geekfile.txt”.sed 's/unix/linux/2' geekfile.txt
: This command replaces the second occurrence of “unix” with “linux” in the file “geekfile.txt”.sed 's/unix/linux/g' geekfile.txt
: This command replaces all occurrences of “unix” with “linux” in the file “geekfile.txt”.sed 's/unix/linux/g;s/os/system/' geekfile.txt
: This command replaces all occurrences of “unix” with “linux” and “os” with “system” in the file “geekfile.txt”.sed '1,5d' geekfile.txt
: This command deletes lines 1 to 5 in the file “geekfile.txt”.sed 's/^#.*$//' geekfile.txt
: This command removes any lines that start with “#” in the file “geekfile.txt”.sed 's/\s\+$//' geekfile.txt
: This command removes any trailing whitespace in the file “geekfile.txt”.
SED can also be used in pipes and loops to perform more complex text manipulation tasks. For example, you can use SED to remove all the “CHR”s from a file:
1sed 's/CHR//g' inputfile.txt > outputfile.txt
This command replaces all occurrences of “CHR” with nothing in the file “inputfile.txt” and saves the output to “outputfile.txt”.
Exercise: Using SED to print specific lines from a file
To print specific lines from a file using SED, you can use the following commands:
- To print the last line of a file:
1sed -n '$p' filename
- To print a range of lines from N to the end of a file:
1sed -n 'N,$p' filename
Replace N
with the starting line number.
- To print a line that matches a pattern:
1sed -n '/PATTERN/p' filename
Replace PATTERN
with the regular expression you want to match.
- To print every Nth line starting from line M:
1sed -n 'M~Np' filename
Replace M
with the starting line number and N
with the interval between lines.
- To print a range of lines from M to N:
1sed -n 'M,Np' filename
Replace M
and N
with the starting and ending line numbers, respectively.
- To print a range of lines from the line that matches P1 to the line that matches P2:
1sed -n '/P1/,/P2/p' filename
Replace P1
and P2
with the regular expressions you want to match.
- To print the lines that match P1 and the next N lines following P1:
1sed -n '/P1/,+Np' filename
Replace P1
with the regular expression you want to match and N
with the number of lines to print after the match.
Search and Replace with SED
How to search for and replace text using SED
To search for and replace text using SED, you can use the following commands:
- To replace the first occurrence of a pattern in a file:
1sed 's/PATTERN/REPLACEMENT/' filename
Replace PATTERN
with the regular expression you want to match and REPLACEMENT
with the text you want to replace it with.
- To replace all occurrences of a pattern in a file:
1sed 's/PATTERN/REPLACEMENT/g' filename
Add the g
flag to replace all occurrences of the pattern in the file.
- To replace a pattern only in a specific range of lines:
1sed 'M,Ns/PATTERN/REPLACEMENT/g' filename
Replace M
and N
with the starting and ending line numbers, respectively, and PATTERN
and REPLACEMENT
with the text you want to match and replace.
- To replace a pattern only in the first N lines of a file:
1sed '1,Ns/PATTERN/REPLACEMENT/g' filename
Replace N
with the number of lines to apply the replacement to.
- To replace a pattern only in the last N lines of a file:
1sed -n ':a;N;$!ba;s/PATTERN/REPLACEMENT/g' filename
2sed -i 'N;$!N;$!N;s/PATTERN/REPLACEMENT/g' filename
The first command uses a loop to read the entire file into memory and replace the pattern in the last N lines. The second command uses the -i
option to edit the file in place.
- To replace a pattern only if it is preceded by another pattern:
1sed '/PRECEDES/s/PATTERN/REPLACEMENT/g' filename
Replace PRECEDES
and PATTERN
with the regular expressions you want to match and REPLACEMENT
with the text you want to replace it with.
Regular expressions in SED
SED supports regular expressions, which are patterns used to match and manipulate text. Here are some examples of using regular expressions in SED:
- Matching a pattern:
To match a pattern, you can use the s
command followed by the pattern and the replacement string. For example, to replace all occurrences of the word “hello” with “hi” in a file, you can use the following command:
1sed 's/hello/hi/g' filename
The g
flag at the end of the command tells SED to replace all occurrences of the pattern on each line.
- Matching a pattern with a regular expression:
You can use regular expressions to match more complex patterns. For example, to match all lines that start with the letter “a”, you can use the following command:
1sed '/^a/p' filename
The ^
character matches the beginning of a line, and the p
command prints the matching lines.
- Matching a pattern with a regular expression and replacing it:
You can also use regular expressions to match a pattern and replace it with a different string. For example, to replace all occurrences of one or more digits at the beginning of a line with the word “number”, you can use the following command:
1sed 's/^[0-9]* /number /g' filename
The [0-9]
character class matches any digit, and the *
quantifier matches zero or more occurrences of the previous character.
- Matching a pattern with a regular expression and deleting the line:
You can also use regular expressions to match a pattern and delete the entire line. For example, to delete all lines that contain the word “error”, you can use the following command:
1sed '/error/d' filename
The d
command deletes the entire line.
- Matching a pattern with a regular expression and printing only the matching lines:
You can also use regular expressions to match a pattern and print only the matching lines. For example, to print all lines that contain the word “warning”, you can use the following command:
1sed -n '/warning/p' filename
The -n
option tells SED not to print any lines by default, and the p
command prints only the matching lines.
Exercise: Using SED to replace all occurrences of a gene name in a FASTA file
Sure, I can help you with that! To replace all occurrences of a gene name in a FASTA file using SED, you can use the following command:
1sed -i 's/old_gene_name/new_gene_name/g' filename.fasta
Replace old_gene_name
with the name of the gene you want to replace, and new_gene_name
with the new name you want to use. The -i
option tells SED to edit the file in place, meaning the changes will be saved to the original file. The g
flag at the end of the command tells SED to replace all occurrences of the pattern on each line.
Here’s an example:
1$ cat example.fasta
2>gene1
3ATGCGATCGATCGATCGATCGATCG
4>gene2
5ATGCGATCGATCGATCGATCGATCG
6>gene1
7ATGCGATCGATCGATCGATCGATCG
8
9$ sed -i 's/gene1/new_gene_name/g' example.fasta
10
11$ cat example.fasta
12>new_gene_name
13ATGCGATCGATCGATCGATCGATCG
14>gene2
15ATGCGATCGATCGATCGATCGATCG
16>new_gene_name
17ATGCGATCGATCGATCGATCGATCG
In this example, we replaced all occurrences of “gene1” with “new_gene_name” in the example.fasta
file.
File Manipulation with SED
How to delete, insert, and modify lines in a file using SED
Here are some examples of how to delete, insert, and modify lines in a file using SED:
- Deleting lines:
To delete a line, you can use the d
command. For example, to delete the third line of a file, you can use the following command:
1sed '3d' filename
To delete a range of lines, you can use the ,
character to specify the range. For example, to delete lines 3 to 5 of a file, you can use the following command:
1sed '3,5d' filename
To delete all lines that match a pattern, you can use the following command:
1sed '/pattern/d' filename
- Inserting lines:
To insert a line, you can use the i
command. For example, to insert the line “This is a new line” before the third line of a file, you can use the following command:
1sed '3i This is a new line' filename
To insert multiple lines, you can use the -e
option to specify multiple commands. For example, to insert the lines “This is a new line 1” and “This is a new line 2” before the third line of a file, you can use the following command:
1sed -e '3i This is a new line 1' -e '3i This is a new line 2' filename
- Modifying lines:
To modify a line, you can use the s
command. For example, to replace the first occurrence of the word “old” with “new” in the third line of a file, you can use the following command:
1sed '3s/old/new/' filename
To replace all occurrences of the word “old” with “new” in the third line of a file, you can use the following command:
1sed '3s/old/new/g' filename
To replace the first occurrence of the word “old” with “new” in all lines that match a pattern, you can use the following command:
1sed '/pattern/s/old/new/' filename
Exercise: Using SED to extract specific fields from a tab-separated value (TSV) file
To extract specific fields from a tab-separated value (TSV) file using SED, you can use the following command:
1cut -f N1,N2,N3 filename.tsv
Replace N1
, N2
, and N3
with the field numbers you want to extract, separated by commas. For example, to extract the first and third fields from a TSV file named example.tsv
, you can use the following command:
1cut -f 1,3 example.tsv
If you want to use SED instead of cut
, you can use the following command:
1sed 's/\t/\n/g' filename.tsv | sed -n 'N;N;s/\n/\t/2;s/\t/\t/3;s/^[^\t]*\t\([^\t]*\)\t[^\t]*\t\([^\t]*\)\t.*$/\1\t\2/p'
This command uses two sed
commands to extract the first and third fields from a TSV file. The first sed
command replaces all tab characters with newline characters, so that each field is on a separate line. The second sed
command extracts the first and third fields and prints them, separated by a tab character.
Here’s an example:
1$ cat example.tsv
2field1 field2 field3 field4
3value1 value2 value3 value4
4value5 value6 value7 value8
5
6$ sed 's/\t/\n/g' example.tsv | sed -n 'N;N;s/\n/\t/2;s/\t/\t/3;s/^[^\t]*\t\([^\t]*\)\t[^\t]*\t\([^\t]*\)\t.*$/\1\t\2/p'
7field1 field3
8value1 value3
9value5 value7
In this example, we extracted the first and third fields from the example.tsv
file using SED.
Advanced SED Techniques
Sure, I can help you with that! To process multiple files and streams using SED, you can use the following commands:
- To process multiple files:
To process multiple files using SED, you can simply list them after the SED command. For example, to print the first line of each file, you can use the following command:
1sed -n '1p' file1.txt file2.txt file3.txt
By default, SED treats multiple input files as one long stream. However, you can use the -s
option to process each file separately. For example, to print the last line of each file, you can use the following command:
1sed -ns '$p' file1.txt file2.txt file3.txt
- To process a stream:
To process a stream using SED, you can use the -
character as a file name. For example, to print the first 10 lines of a stream, you can use the following command:
1echo "hello world" | sed '1,10!d'
You can also use a pipe to process the output of another command. For example, to print the number of lines in a file, you can use the following command:
1wc -l < file.txt | sed 's/^ *//'
How to combine SED with other command-line tools
To combine SED with other command-line tools, you can use pipes (|
) or input/output redirection. Here are some examples:
- Using pipes:
To combine SED with other command-line tools using pipes, you can simply separate the commands with a pipe character (|
). For example, to print the first 10 lines of a file and then count the number of characters in those lines, you can use the following command:
1sed '1,10!d' file.txt | wc -c
- Using input/output redirection:
To combine SED with other command-line tools using input/output redirection, you can use the <
and >
characters. For example, to sort the lines of a file and then print only the unique lines, you can use the following command:
1sort file.txt | uniq
You can also use input/output redirection with SED. For example, to replace all occurrences of “hello” with “world” in a file and save the result to a new file, you can use the following command:
1sed 's/hello/world/g' file.txt > newfile.txt
Exercise: Using SED to filter and format BLAST output
Sure, I can help you with that! To filter and format BLAST output using SED, you can use the following commands:
- To extract specific fields from the BLAST output:
To extract specific fields from the BLAST output, you can use the cut
command to select the desired columns. For example, to extract the query ID, subject ID, percentage identity, alignment length, and E-value, you can use the following command:
1cut -f 1,2,11,12,13 blast_output.txt
- To replace the default field separator with a tab character:
To replace the default field separator (a space) with a tab character, you can use the tr
command. For example, to replace all spaces with tabs, you can use the following command:
1tr ' ' '\t' < blast_output.txt > blast_output_tab.txt
- To remove the header line from the BLAST output:
To remove the header line from the BLAST output, you can use the sed
command. For example, to remove the first line, you can use the following command:
1sed '1d' blast_output.txt
- To sort the BLAST output by E-value:
To sort the BLAST output by E-value, you can use the sort
command. For example, to sort the output in ascending order, you can use the following command:
1sort -k 13,13n blast_output.txt
- To remove duplicate entries from the BLAST output:
To remove duplicate entries from the BLAST output, you can use the sort
and uniq
commands. For example, to remove duplicate entries based on the query and subject IDs, you can use the following commands:
1cut -f 1,2 blast_output.txt | sort | uniq > unique_blast_output.txt
SED in Bioinformatics Workflows
How to incorporate SED into bioinformatics pipelines and workflows
To incorporate SED into bioinformatics pipelines and workflows, you can use it to filter and format data in various ways. Here’s an example of how to count the occurrences of different base-quality sequences in a FASTQ file using SED:
First, let’s extract the lines that have base quality scores, which occur on lines 4, 8, 12, etc. We can use SED to print every fourth line starting from the first line:
1zcat data/DPCh_plate1_F12_S72.R1.fq.gz | sed -n '1~4p'
Next, we can use SED to extract the base quality scores from each line. The base quality scores are represented by characters with ASCII values 33 to 126. We can use SED to extract the characters in this range and print them along with the number of occurrences:
1zcat data/DPCh_plate1_F12_S72.R1.fq.gz | sed -n '1~4p' | sed -e 's/[!-~]/\n/g' | awk '{print length, $0}' | sort | uniq -c | sort -nr
Let’s break down this command:
zcat data/DPCh_plate1_F12_S72.R1.fq.gz
: Extract the contents of the gzipped FASTQ file.sed -n '1~4p'
: Print every fourth line starting from the first line.sed -e 's/[!-~]/\n/g'
: Replace all characters outside the range of base quality scores with newline characters.awk '{print length, $0}'
: Print the length and the line.sort
: Sort the lines based on the length.uniq -c
: Count the number of occurrences of each line.sort -nr
: Sort the lines in reverse numerical order based on the count.
This command will output the number of occurrences and the base quality sequence, sorted in descending order based on the number of occurrences.
To sort the output in a different way, you can pipe the output to the sort
command with the desired options. For example, to sort the output by the number of occurrences in descending order, you can use:
1zcat data/DPCh_plate1_F12_S72.R1.fq.gz | sed -n '1~4p' | sed -e 's/[!-~]/\n/g' | awk '{print length, $0}' | sort | uniq -c | sort -n -b -r -k 1
Real-world examples of SED in bioinformatics data analysis
some real-world examples of SED in bioinformatics data analysis. Here are a few examples:
- Filtering FASTQ files: SED can be used to filter out low-quality reads from FASTQ files. For example, the following command filters out reads with more than 10 bases with a quality score below 20:
1sed -n '/^@/{N;N;N;N;/[^@]*[!ACGTacgt][!ACGTacgt]{19}/!p}' input.fastq > output.fastq
- Extracting specific fields from VCF files: SED can be used to extract specific fields from Variant Call Format (VCF) files. For example, the following command extracts the chromosome, position, and reference allele fields:
1sed 's/\t.*\t/\t/;s/^.*\t\(.*\)\t.*\t\(.*\)\t.*\t\(.*\)\t.*/\1\t\2\t\3/' input.vcf
- Replacing sequence names in FASTA files: SED can be used to replace sequence names in FASTA files. For example, the following command replaces all occurrences of “>seq1” with “>gene1”:
1sed 's/>seq1/>gene1/g' input.fasta > output.fasta
- Converting between FASTA and FASTQ formats: SED can be used to convert between FASTA and FASTQ formats. For example, the following command converts a FASTA file to a FASTQ file:
1sed '1~4s/^>/@/;N;N;N;s/\(.*\)\n\(.*\)\n\(.*\)\n/\1\t\2\t\3\t+\4\n/' input.fasta > output.fastq
- Extracting specific lines from BAM files: SED can be used to extract specific lines from Binary Alignment Map (BAM) files. For example, the following command extracts the header lines:
1samtools view -H input.bam | sed 's/^@//' > header.txt
I hope these examples give you an idea of how SED can be used in bioinformatics data analysis.
Exercise: Using SED to process and analyze a real-world bioinformatics dataset
To count up the number of times different base-quality sequences occur in the file DPCh_plate1_F12_S72.R1.fq.gz
, you can use the following command:
1zcat DPCh_plate1_F12_S72.R1.fq.gz | awk '(NR-1) % 4 == 0 {print length, $0}' | sort | uniq -c | sort -nr
Let’s break down this command:
zcat DPCh_plate1_F12_S72.R1.fq.gz
: Extract the contents of the gzipped FASTQ file.awk '(NR-1) % 4 == 0 {print length, $0}'
: Print the length and the line for every fourth line (base quality scores).sort
: Sort the lines based on the length.uniq -c
: Count the number of occurrences of each line.sort -nr
: Sort the lines in reverse numerical order based on the count.
This command will output the number of occurrences and the base quality score sequence, sorted in descending order based on the number of occurrences.
To sort the output in a different way, you can pipe the output into:
sort -n -b -r -k 1
Pipe that to less
and look through it. It is actually pretty cool!
Here are some sample SED commands that could be used in the exercises:
- Print specific lines from a file:
1sed -n '10p' FILENAME
- Replace all occurrences of a gene name in a FASTA file:
1sed 's/gene_name/new_gene_name/g' FASTA_FILE > NEW_FASTA_FILE
- Extract specific fields from a TSV file:
1sed 's/\t/\n/g' TSV_FILE | sed -n '2p;3p'
- Filter and format BLAST output:
1blastn -query query.fasta -db db.fasta -outfmt 6 | sed 's/query//g;s/subject//g;s/evalue//g;s/bit score//g;s/length//g' > blast_output.txt