Python Strings for Bioinformatics: From Basics to Applications
March 25, 2024 Off By adminCourse Description: This course provides a comprehensive introduction to using Python strings in bioinformatics. Starting with the fundamentals of Python strings, the course progresses to advanced topics such as sequence alignment, parsing biological data, and manipulating biological sequences. Through hands-on exercises and real-world examples, students will learn how to apply Python string operations to solve common bioinformatics tasks.
Table of Contents
Introduction to Python Strings
Basic string operations
In Python, strings are sequences of characters and support various operations for manipulation. Here are some basic string operations:
- Concatenation: Joining two or more strings together.python
str1 = "Hello"
str2 = "World"
result = str1 + " " + str2
print(result) # Output: Hello World
- Length: Getting the length of a string.python
string = "Python"
length = len(string)
print(length) # Output: 6
- Indexing: Accessing individual characters in a string.python
string = "Python"
print(string[0]) # Output: P
- Slicing: Extracting a substring from a string.python
string = "Python"
substring = string[2:5]
print(substring) # Output: thon
- Formatting: Formatting strings using placeholders or format() method.python
name = "Alice"
age = 30
message = "My name is {} and I am {} years old.".format(name, age)
print(message) # Output: My name is Alice and I am 30 years old.
- Conversion: Converting the case of a string.python
string = "Hello, World!"
print(string.lower()) # Output: hello, world!
print(string.upper()) # Output: HELLO, WORLD!
- Splitting: Splitting a string into a list of substrings.python
string = "apple,banana,cherry"
fruits = string.split(",")
print(fruits) # Output: ['apple', 'banana', 'cherry']
- Joining: Joining a list of strings into a single string.python
fruits = ['apple', 'banana', 'cherry']
string = ",".join(fruits)
print(string) # Output: apple,banana,cherry
These are just some of the basic string operations in Python. Strings in Python are versatile and offer many more operations for manipulation and formatting.
String indexing and slicing
String indexing and slicing are common operations used to access and extract substrings from a string in Python. Here’s a brief overview of how indexing and slicing work:
- String Indexing:
- In Python, indexing starts at 0, so the first character of a string has index 0, the second character has index 1, and so on.
- You can use positive indexes to access characters from the beginning of the string and negative indexes to access characters from the end of the string.
pythonstring = "Hello, World!"
# Accessing individual characters
print(string[0]) # Output: 'H'
print(string[7]) # Output: 'W'# Accessing characters using negative indexes
print(string[-1]) # Output: '!'
print(string[-3]) # Output: 'd'
- String Slicing:
- String slicing allows you to extract a substring from a string by specifying a start index and an end index (not inclusive).
- The syntax for string slicing is
string[start:end]
. Ifstart
is not specified, it defaults to 0. Ifend
is not specified, it defaults to the end of the string.
pythonstring = "Hello, World!"
# Slicing from index 2 to index 5 (not inclusive)
print(string[2:5]) # Output: 'llo'# Slicing from the beginning to index 5 (not inclusive)
print(string[:5]) # Output: 'Hello'# Slicing from index 7 to the end
print(string[7:]) # Output: 'World!'# Slicing using negative indexes
print(string[-6:-1]) # Output: 'World'
- Step Size in Slicing:
- You can specify a step size when slicing a string to extract characters at regular intervals.
- The syntax for specifying a step size is
string[start:end:step]
.
pythonstring = "Hello, World!"
# Slicing with a step size of 2
print(string[::2]) # Output: 'Hlo ol!'# Reversing a string
print(string[::-1]) # Output: '!dlroW ,olleH'
String indexing and slicing are powerful features in Python that allow you to work with strings efficiently and effectively.
String formatting
String formatting in Python allows you to create strings with placeholders that can be replaced with variables or values. There are several ways to format strings in Python, including using the format()
method, f-strings (formatted string literals), and the %
operator. Here’s an overview of each method:
- Using the
format()
Method:- The
format()
method formats a string by replacing placeholders ({}
) with values. - You can specify the order of placeholders or use named placeholders for more clarity.
pythonname = "Alice"
age = 30
message = "My name is {} and I am {} years old.".format(name, age)
print(message) # Output: My name is Alice and I am 30 years old.
- The
- Using f-strings (Formatted String Literals):
- F-strings provide a more concise and readable way to format strings by embedding expressions directly inside string literals.
- F-strings are prefixed with
f
orF
before the string literal.
pythonname = "Alice"
age = 30
message = f"My name is {name} and I am {age} years old."
print(message) # Output: My name is Alice and I am 30 years old.
- Using the
%
Operator (Deprecated in Python 3.6 and above):- The
%
operator can be used for string formatting, but it is less preferred thanformat()
and f-strings. - Placeholders are represented by
%s
for strings,%d
for integers, and%f
for floats.
pythonname = "Alice"
age = 30
message = "My name is %s and I am %d years old." % (name, age)
print(message) # Output: My name is Alice and I am 30 years old.
- The
String formatting is a powerful feature in Python that allows you to create dynamic and customized strings for various purposes, such as output messages, logging, and data representation.
Working with Biological Sequences
Understanding DNA, RNA, and protein sequences
DNA, RNA, and protein sequences are fundamental components of living organisms, each serving distinct biological functions:
- DNA (Deoxyribonucleic Acid):
- DNA is a molecule that carries the genetic instructions used in the growth, development, functioning, and reproduction of all known living organisms.
- It consists of two long chains of nucleotides twisted into a double helix and held together by hydrogen bonds between complementary base pairs (adenine [A] with thymine [T], and cytosine [C] with guanine [G]).
- DNA encodes the information necessary for the synthesis of proteins, which are essential for the structure, function, and regulation of the body’s tissues and organs.
- RNA (Ribonucleic Acid):
- RNA is a molecule that plays multiple roles in the coding, decoding, regulation, and expression of genes.
- It is single-stranded and is typically synthesized from DNA templates during the process of transcription.
- There are several types of RNA, including messenger RNA (mRNA), transfer RNA (tRNA), and ribosomal RNA (rRNA), each with specific functions in protein synthesis.
- Proteins:
- Proteins are large, complex molecules that are essential for the structure and function of cells, tissues, and organs.
- Proteins are made up of amino acids linked together in a specific sequence, which is determined by the sequence of nucleotides in the corresponding mRNA molecule.
- The sequence of amino acids in a protein determines its structure and function, and any changes in the sequence can lead to changes in the protein’s properties.
In summary, DNA carries the genetic information, RNA helps in the expression of this information, and proteins are the primary functional molecules in cells, performing a wide variety of roles based on their structure and function. Understanding these sequences and their interactions is crucial for understanding the molecular basis of life and for various applications in biotechnology, medicine, and evolutionary biology.
Reading sequences from files
To read sequences from files in Python, you can use the Bio.SeqIO
module from Biopython. This module provides a convenient way to parse and read sequences from various file formats, such as FASTA, GenBank, and FASTQ. Here’s an example of how to read sequences from a FASTA file:
- Install Biopython: If you haven’t installed Biopython yet, you can install it using pip:bash
pip install biopython
- Read sequences from a FASTA file: Suppose you have a file named
sequences.fasta
containing one or more sequences in FASTA format. You can use the following code to read these sequences:pythonfrom Bio import SeqIO
fasta_file = "sequences.fasta"
# Read sequences from the FASTA file
sequences = []
for record in SeqIO.parse(fasta_file, "fasta"):
sequences.append(record.seq)# Print the first sequence
print("First sequence:", sequences[0])
This code reads the sequences from the
sequences.fasta
file and stores them in a list. You can then access and manipulate these sequences as needed in your program. - Read sequences from other file formats: If your sequences are stored in a different format, such as GenBank or FASTQ, you can use the same approach with minor modifications. For example, to read sequences from a GenBank file:python
from Bio import SeqIO
genbank_file = "sequences.gb"
# Read sequences from the GenBank file
sequences = []
for record in SeqIO.parse(genbank_file, "genbank"):
sequences.append(record.seq)# Print the first sequence
print("First sequence:", sequences[0])
Replace
sequences.gb
with the name of your GenBank file.
The SeqIO.parse()
function reads sequences from the specified file and returns an iterator over the sequences. You can then iterate over this iterator to access individual sequences and their associated metadata (e.g., sequence ID, description).
To read sequences from a file in Python without using Biopython, you can simply open the file and read its contents. Here’s a basic example of how to read a sequence from a FASTA file:
# Open the FASTA file for reading
with open("sequence.fasta", "r") as file:
lines = file.readlines()# Parse the sequence from the file
sequence = ""
for line in lines:
if not line.startswith(">"): # Skip the header line
sequence += line.strip()
print("Sequence:", sequence)
In this example, replace "sequence.fasta"
with the path to your FASTA file. The code reads the lines from the file, skips the header line (which starts with >
in FASTA format), and concatenates the remaining lines to form the sequence.
If you have a file with multiple sequences and want to read them one by one, you can modify the code to store each sequence in a list:
# Open the FASTA file for reading
with open("sequences.fasta", "r") as file:
lines = file.readlines()sequences = []
current_sequence = ""
for line in lines:
if line.startswith(">"): # Start of a new sequence
if current_sequence:
sequences.append(current_sequence)
current_sequence = ""
else:
current_sequence += line.strip()
# Add the last sequence to the list
if current_sequence:
sequences.append(current_sequence)
# Print the list of sequences
for idx, seq in enumerate(sequences, start=1):
print(f"Sequence {idx}: {seq}")
This code reads multiple sequences from a FASTA file, stores them in a list, and then prints each sequence along with its index.
Writing sequences to files
To write sequences to a file in Python, you can open a file in write mode and then write the sequences to the file. Here’s a basic example of how to write a sequence to a file in FASTA format:
sequence_id = "Sequence_1"
sequence = "ATCGATCGATCG"# Open the output file for writing
with open("output.fasta", "w") as file:
# Write the sequence in FASTA format
file.write(f">{sequence_id}\n") # Write the sequence ID
file.write(f"{sequence}\n") # Write the sequence
In this example, the with open("output.fasta", "w") as file
statement opens a file named output.fasta
in write mode. The file.write()
statements are then used to write the sequence ID (prefixed with >
) and the sequence itself to the file. The "\n"
character is used to add a newline at the end of each line.
If you have multiple sequences and want to write them to the same file, you can modify the code to write each sequence one by one:
sequences = {
"Sequence_1": "ATCGATCGATCG",
"Sequence_2": "GCTAGCTAGCTA",
"Sequence_3": "TATGTATGTATG"
}# Open the output file for writing
with open("output.fasta", "w") as file:
# Write each sequence in FASTA format
for sequence_id, sequence in sequences.items():
file.write(f">{sequence_id}\n") # Write the sequence ID
file.write(f"{sequence}\n") # Write the sequence
In this example, the sequences
dictionary contains sequence IDs as keys and sequences as values. The for
loop iterates over each key-value pair in the dictionary and writes the sequence ID and sequence to the file in FASTA format.
Sequence Alignment
Pairwise sequence alignment
Pairwise sequence alignment is a method used to compare two sequences (DNA, RNA, or protein) to identify regions of similarity. It helps in understanding the evolutionary relationships between sequences and identifying functional or structural similarities. One of the most common algorithms used for pairwise sequence alignment is the Needleman-Wunsch algorithm for global alignment.
Here’s a basic example of pairwise sequence alignment using the Needleman-Wunsch algorithm in Python, without using Biopython:
def needleman_wunsch(seq1, seq2, match=1, mismatch=-1, gap=-1):
# Initialize the scoring matrix
rows = len(seq1) + 1
cols = len(seq2) + 1
score_matrix = [[0 for _ in range(cols)] for _ in range(rows)] # Initialize the traceback matrix
traceback_matrix = [[0 for _ in range(cols)] for _ in range(rows)]
# Initialize the first row and column of the scoring matrix
for i in range(1, rows):
score_matrix[i][0] = score_matrix[i-1][0] + gap
traceback_matrix[i][0] = 1
for j in range(1, cols):
score_matrix[0][j] = score_matrix[0][j-1] + gap
traceback_matrix[0][j] = 2
# Fill in the scoring matrix
for i in range(1, rows):
for j in range(1, cols):
match_mismatch = match if seq1[i-1] == seq2[j-1] else mismatch
diag_score = score_matrix[i-1][j-1] + match_mismatch
up_score = score_matrix[i-1][j] + gap
left_score = score_matrix[i][j-1] + gap
max_score = max(diag_score, up_score, left_score)
score_matrix[i][j] = max_score
if max_score == diag_score:
traceback_matrix[i][j] = 3
elif max_score == up_score:
traceback_matrix[i][j] = 1
else:
traceback_matrix[i][j] = 2
# Traceback to find the alignment
align1 = ""
align2 = ""
i, j = rows - 1, cols - 1
while i > 0 or j > 0:
if traceback_matrix[i][j] == 3:
align1 = seq1[i-1] + align1
align2 = seq2[j-1] + align2
i -= 1
j -= 1
elif traceback_matrix[i][j] == 1:
align1 = seq1[i-1] + align1
align2 = "-" + align2
i -= 1
else:
align1 = "-" + align1
align2 = seq2[j-1] + align2
j -= 1
return align1, align2
# Example usage
seq1 = "AGTACGCA"
seq2 = "TATGC"
alignment1, alignment2 = needleman_wunsch(seq1, seq2)
print("Sequence 1:", alignment1)
print("Sequence 2:", alignment2)
This example defines the needleman_wunsch
function, which takes two sequences (seq1
and seq2
) and optional scoring parameters (match
, mismatch
, gap
) and performs a global pairwise sequence alignment using the Needleman-Wunsch algorithm. The function returns the two aligned sequences as strings.
Multiple sequence alignment
Multiple sequence alignment (MSA) is a technique used to align three or more sequences simultaneously. It is used to identify conserved regions, detect evolutionary relationships, and predict the structure and function of proteins. One of the most commonly used algorithms for MSA is the progressive alignment method, such as the ClustalW algorithm.
Here’s a basic example of multiple sequence alignment using the ClustalW algorithm in Python, without using Biopython:
def clustalw(sequences, match=1, mismatch=-1, gap=-1):
# Initialize the alignment matrix
alignment = [list(seq) for seq in sequences] # Perform pairwise alignments
while len(alignment) > 1:
# Calculate pairwise similarity scores
scores = []
for i in range(len(alignment)):
for j in range(i+1, len(alignment)):
score = sum(a == b for a, b in zip(alignment[i], alignment[j]))
scores.append((i, j, score))
# Find the most similar pair
i, j, _ = max(scores, key=lambda x: x[2])
# Merge the two sequences
merged_seq = []
for a, b in zip(alignment[i], alignment[j]):
if a == b:
merged_seq.append(a)
else:
merged_seq.append('-')
alignment[i] = merged_seq
alignment.pop(j)
# Return the final alignment
return ''.join(alignment[0])
# Example usage
sequences = [
"AGTACGCA",
"TATGC",
"GACTA",
"AGCT"
]
alignment = clustalw(sequences)
print("Multiple Sequence Alignment:")
print(alignment)
This example defines the clustalw
function, which takes a list of sequences and optional scoring parameters (match
, mismatch
, gap
) and performs a progressive multiple sequence alignment using a simplified version of the ClustalW algorithm. The function returns a single aligned sequence as a string.
Using Python libraries for alignment (e.g., Biopython)
To perform multiple sequence alignment (MSA) using Python libraries like Biopython, you can use the Bio.Align
module, which provides access to various alignment algorithms. Here’s an example of how to perform MSA using Biopython:
from Bio import AlignIO
from Bio.Align import MultipleSeqAlignment
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq# Define the sequences
seq1 = Seq("AGTACGCA")
seq2 = Seq("TATGC")
seq3 = Seq("GACTA")
seq4 = Seq("AGCT")
# Create SeqRecord objects for each sequence
seq_records = [
SeqRecord(seq1, id="seq1"),
SeqRecord(seq2, id="seq2"),
SeqRecord(seq3, id="seq3"),
SeqRecord(seq4, id="seq4")
]
# Create a MultipleSeqAlignment object
alignment = MultipleSeqAlignment(seq_records)
# Perform the alignment using the ClustalW algorithm
# You may need to install the ClustalW software separately
# and provide the path to the executable
from Bio.Align.Applications import ClustalwCommandline
clustalw_exe = "clustalw2" # Path to the ClustalW executable
clustalw_cline = ClustalwCommandline(clustalw_exe, infile="temp.fasta")
stdout, stderr = clustalw_cline()
# Parse the alignment result
alignment = AlignIO.read("temp.aln", "clustal")
# Print the alignment
print("Multiple Sequence Alignment:")
print(alignment)
In this example, we first create SeqRecord
objects for each sequence and then create a MultipleSeqAlignment
object. We then use the ClustalwCommandline
class to perform the alignment using the ClustalW algorithm. Finally, we parse the alignment result using AlignIO.read
and print the aligned sequences.
Note: Before running this code, you need to install the ClustalW software and provide the path to the executable (clustalw_exe
). You also need to have Biopython installed in your Python environment.
Sequence Manipulation
Translating DNA sequences to protein sequences
Translating DNA sequences to protein sequences is a common task in bioinformatics, and it can be easily done using Biopython. Here’s how you can translate a DNA sequence to a protein sequence:
from Bio.Seq import Seq# DNA sequence
dna_seq = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
# Translate DNA to protein sequence
protein_seq = dna_seq.translate()
print("DNA sequence:", dna_seq)
print("Protein sequence:", protein_seq)
In this example, we use the Seq
class from Biopython to create a Seq
object representing the DNA sequence. We then use the translate()
method of the Seq
object to translate the DNA sequence to a protein sequence. The resulting protein sequence is a Seq
object containing the translated sequence.
Note that DNA sequences are translated using the standard genetic code, where each codon (a sequence of three nucleotides) is translated to a specific amino acid. The translation starts from the start codon (ATG) and continues until a stop codon is encountered (TAA, TAG, or TGA).
Reverse complement of DNA sequences
To find the reverse complement of a DNA sequence in Python using Biopython, you can use the reverse_complement
method of the Seq
object. Here’s an example:
from Bio.Seq import Seq# DNA sequence
dna_seq = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
# Reverse complement
reverse_complement_seq = dna_seq.reverse_complement()
print("DNA sequence:", dna_seq)
print("Reverse complement:", reverse_complement_seq)
In this example, we first create a Seq
object representing the DNA sequence. Then, we use the reverse_complement
method to find the reverse complement of the DNA sequence. The result is a Seq
object containing the reverse complement sequence.
Biopython also provides a reverse()
method to reverse the sequence without complementing it. If you only need the reverse of the sequence, you can use this method instead:
# Reverse sequence only
reverse_seq = dna_seq.reverse()
print("Reverse sequence:", reverse_seq)
Finding motifs and patterns in sequences
To find motifs and patterns in sequences, you can use regular expressions in Python. Regular expressions provide a powerful way to search for specific patterns in strings. Here’s a basic example of how to find motifs in a DNA sequence using regular expressions:
import re# DNA sequence
dna_seq = "ATGCGTACGTCATGCGTAGCG"
# Define the motif pattern
motif_pattern = "ATG"
# Find motifs using regular expression
motif_positions = [match.start() for match in re.finditer(motif_pattern, dna_seq)]
print("DNA sequence:", dna_seq)
print("Motif pattern:", motif_pattern)
print("Motif positions:", motif_positions)
In this example, we use the re.finditer()
function to find all occurrences of the motif pattern (“ATG”) in the DNA sequence. The function returns an iterator yielding match objects for each match found. We extract the starting position of each match using the start()
method of the match object.
You can use more complex regular expressions to define motifs with specific patterns, such as motifs containing ambiguous characters or motifs with certain constraints. The regular expression syntax allows for a wide range of possibilities to specify patterns.
Keep in mind that regular expressions are case-sensitive by default. If you want to perform case-insensitive searches, you can use the re.IGNORECASE
flag in the re.finditer()
function:
# Find motifs using case-insensitive search
motif_positions = [match.start() for match in re.finditer(motif_pattern, dna_seq, re.IGNORECASE)]
Additionally, if you need to search for motifs in RNA sequences, you can first convert the RNA sequence to DNA (e.g., by replacing “U” with “T”) and then use regular expressions as shown above.
Parsing Biological Data
Parsing data from biological databases (e.g., GenBank, UniProt)
To parse data from biological databases like GenBank or UniProt, you can use Biopython’s Bio.Entrez
and Bio.SeqIO
modules for GenBank, and Bio.SwissProt
module for UniProt. Here’s a basic example of how to parse data from GenBank and UniProt using Biopython:
- Parsing data from GenBank:
from Bio import Entrez, SeqIO# Email address (required by NCBI)
Entrez.email = "[email protected]"
# Accession number of the sequence
accession = "NC_000913.3"
# Fetch the sequence record from NCBI
handle = Entrez.efetch(db="nucleotide", id=accession, rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank")
handle.close()
# Print the sequence record
print(record)
- Parsing data from UniProt:
from Bio import SwissProt# UniProt accession number
accession = "P0A7B8"
# Fetch the UniProt record
handle = open(f"{accession}.txt") # You can download the UniProt record in text format
record = SwissProt.read(handle)
handle.close()
# Print the UniProt record
print(record)
These examples demonstrate how to fetch and parse data from GenBank and UniProt using Biopython. Make sure to replace the email address with your own email address in the Entrez.email
field.
Parsing data from biological databases without Biopython can be more challenging, as you’ll need to handle the parsing of the specific file formats used by each database. Here’s a basic example of how you might parse data from a GenBank file without using Biopython:
- Parsing data from GenBank:
def parse_genbank(file_path):
with open(file_path, "r") as file:
record = {}
sequence = "" for line in file:
if line.startswith("LOCUS"):
record["LOCUS"] = line.strip()
elif line.startswith("ACCESSION"):
record["ACCESSION"] = line.strip()
elif line.startswith("DEFINITION"):
record["DEFINITION"] = line.strip()
elif line.startswith("ORIGIN"):
break # Reached the start of sequence section
elif line.strip(): # Skip empty lines
sequence += line.strip().replace(" ", "") # Remove spaces in the sequence lines
record["SEQUENCE"] = sequence
return record
# Example usage
file_path = "example.gb"
parsed_record = parse_genbank(file_path)
print(parsed_record)
In this example, parse_genbank
is a function that reads a GenBank file and extracts information such as the locus, accession number, definition, and sequence. Note that this is a simplified example and may need to be adapted to handle more complex GenBank files.
For parsing UniProt data without Biopython, you would similarly need to parse the specific file format used by UniProt. UniProt provides data in several formats (e.g., UniProtKB/Swiss-Prot in text format, UniProtKB/TrEMBL in text format), so you would need to handle each format accordingly.
Extracting relevant information from biological records
To extract relevant information from biological records, you’ll need to understand the format of the records and the specific information you’re interested in extracting. Here’s a general approach you can follow:
- Identify the Record Format: Determine the format of the biological records you’re working with (e.g., GenBank, UniProt, FASTA).
- Read the Records: Read the records from the file or database.
- Parse the Records: Parse the records to extract the relevant information. For example, in a GenBank record, you might want to extract the locus, accession number, definition, and sequence.
- Process the Information: Process the extracted information as needed (e.g., store it in a data structure, perform further analysis).
Here’s an example of how you might extract information from a GenBank record in Python:
def extract_genbank_info(file_path):
with open(file_path, "r") as file:
record = {}
sequence = "" for line in file:
if line.startswith("LOCUS"):
record["LOCUS"] = line.strip()
elif line.startswith("ACCESSION"):
record["ACCESSION"] = line.strip()
elif line.startswith("DEFINITION"):
record["DEFINITION"] = line.strip()
elif line.startswith("ORIGIN"):
break # Reached the start of sequence section
elif line.strip(): # Skip empty lines
sequence += line.strip().replace(" ", "") # Remove spaces in the sequence lines
record["SEQUENCE"] = sequence
return record
# Example usage
file_path = "example.gb"
genbank_record = extract_genbank_info(file_path)
print(genbank_record)
This is a simplified example and may need to be adapted to handle more complex GenBank records or other record formats.
Handling different data formats (e.g., FASTA, BED, GFF)
Handling different data formats in bioinformatics requires understanding the specific format and how to parse it. Here’s a general approach to handle some common bioinformatics file formats (FASTA, BED, GFF) in Python:
- FASTA Format:
- The FASTA format is a simple format for representing nucleotide or protein sequences.
- Each sequence is represented by a header line starting with “>” followed by the sequence data.
def parse_fasta(file_path):
sequences = {}
with open(file_path, "r") as file:
header = None
sequence = ""
for line in file:
line = line.strip()
if line.startswith(">"):
if header is not None:
sequences[header] = sequence
header = line[1:]
sequence = ""
else:
sequence += line
if header is not None:
sequences[header] = sequence
return sequences# Example usage
file_path = "example.fasta"
fasta_sequences = parse_fasta(file_path)
print(fasta_sequences)
- BED Format:
- The BED format is used to represent genomic annotations, such as gene structures or regulatory regions.
- Each line in a BED file represents a genomic feature, with columns specifying the chromosome, start position, end position, and optionally additional information.
def parse_bed(file_path):
features = []
with open(file_path, "r") as file:
for line in file:
if line.startswith("#"):
continue # Skip comment lines
fields = line.strip().split("\t")
if len(fields) < 3:
continue # Skip invalid lines
chromosome, start, end = fields[:3]
feature = {"chromosome": chromosome, "start": int(start), "end": int(end)}
features.append(feature)
return features# Example usage
file_path = "example.bed"
bed_features = parse_bed(file_path)
print(bed_features)
- GFF Format:
- The GFF (General Feature Format) format is similar to BED but allows for more detailed annotations, including strand information and feature types.
- GFF files have nine tab-separated columns, with the first eight columns containing feature information and the ninth column containing additional attributes.
def parse_gff(file_path):
features = []
with open(file_path, "r") as file:
for line in file:
if line.startswith("#"):
continue # Skip comment lines
fields = line.strip().split("\t")
if len(fields) < 8:
continue # Skip invalid lines
chromosome, _, feature_type, start, end, _, strand, _, attributes = fields
feature = {
"chromosome": chromosome,
"start": int(start),
"end": int(end),
"feature_type": feature_type,
"strand": strand,
"attributes": attributes
}
features.append(feature)
return features# Example usage
file_path = "example.gff"
gff_features = parse_gff(file_path)
print(gff_features)
These are simplified examples and may need to be adapted to handle specific requirements or variations in the file formats.
Applications in Bioinformatics
Gene prediction and annotation
Gene prediction and annotation are crucial steps in bioinformatics for identifying and characterizing genes in genomic sequences. Here’s a general approach to gene prediction and annotation using Python:
- Gene Prediction:
- Gene prediction algorithms aim to identify the locations of genes in a genome based on various features, such as open reading frames (ORFs), sequence conservation, and gene expression patterns.
- Common gene prediction algorithms include GeneMark, Glimmer, and Augustus.
- Annotation:
- Gene annotation involves assigning biological information to the predicted genes, such as gene function, protein domains, and regulatory elements.
- Annotation is often performed using databases and tools such as BLAST, InterProScan, and Pfam.
Here’s a simplified example of gene prediction and annotation using the SeqIO
module from Biopython and the GeneMark gene prediction tool:
from Bio import SeqIO
import subprocess# Step 1: Gene Prediction
def predict_genes(input_file):
# Run GeneMark gene prediction tool (this is a simplified example)
cmd = f"genemark -sequence {input_file} -output {input_file}.out"
subprocess.run(cmd, shell=True)
# Step 2: Gene Annotation
def annotate_genes(gff_file):
# Use BLAST for functional annotation (this is a simplified example)
cmd = f"blastp -query {gff_file} -db nr -out {gff_file}.blastp.out"
subprocess.run(cmd, shell=True)
# Example usage
input_file = "genome.fasta"
predict_genes(input_file)
gff_file = f"{input_file}.out.gff"
annotate_genes(gff_file)
In this example, we first use a hypothetical gene prediction tool genemark
to predict genes in a genomic sequence (genome.fasta
). We then use BLAST (blastp
) to annotate the predicted genes based on similarity to known proteins in the NCBI non-redundant (nr) database.
Note: Gene prediction and annotation are complex processes that require specialized tools and databases. This example provides a simplified overview and may need to be adapted for specific requirements and tools.
Phylogenetic analysis
Phylogenetic analysis is a key method in bioinformatics for studying evolutionary relationships between organisms. It involves constructing phylogenetic trees, which depict the evolutionary history and relatedness of different species or groups of organisms. Here’s a general approach to phylogenetic analysis using Python:
- Sequence Alignment:
- Align homologous sequences (e.g., DNA, RNA, or protein sequences) using alignment tools like Clustal Omega or MUSCLE.
- Phylogenetic Tree Construction:
- Construct a phylogenetic tree using alignment data and a tree-building algorithm, such as neighbor-joining, maximum likelihood, or Bayesian inference.
- Biopython provides modules for phylogenetic tree construction, including
Phylo
andBio.Phylo.Applications
.
- Visualization:
- Visualize the phylogenetic tree using tools like PhyloTree or ETE Toolkit.
- Biopython’s
Phylo
module also provides functions for tree visualization.
Here’s a simplified example of phylogenetic tree construction using Biopython:
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Align import MultipleSeqAlignment
from Bio import AlignIO# Example alignment file in FASTA format
alignment_file = "alignment.fasta"
alignment = AlignIO.read(alignment_file, "fasta")
# Calculate distances between sequences
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)
# Construct a tree using neighbor-joining algorithm
constructor = DistanceTreeConstructor(calculator, 'nj')
tree = constructor.build_tree(alignment)
# Visualize the tree
Phylo.draw(tree)
In this example, we first read a multiple sequence alignment from a file (alignment.fasta
). We then calculate the pairwise distances between sequences, construct a phylogenetic tree using the neighbor-joining algorithm, and visualize the tree using Biopython’s Phylo
module.
Note: Phylogenetic analysis can be computationally intensive, especially for large datasets. It’s important to consider the appropriate algorithms and methods based on the size and complexity of your data.
Structural bioinformatics
Structural bioinformatics is a field that focuses on the analysis, prediction, and modeling of biomolecular structures, such as proteins, nucleic acids, and complexes. It plays a crucial role in understanding the structure-function relationships of biomolecules and is essential for drug discovery, protein engineering, and molecular biology research. Here’s an overview of key concepts and methods in structural bioinformatics:
- Protein Structure Prediction:
- Predicting the three-dimensional structure of a protein from its amino acid sequence.
- Methods include homology modeling (comparative modeling), ab initio modeling, and threading (fold recognition).
- Protein Structure Determination:
- Experimental methods for determining protein structures, such as X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM).
- Structural databases like the Protein Data Bank (PDB) store experimentally determined protein structures.
- Molecular Docking:
- Predicting the binding mode and affinity of small molecules (ligands) to a target protein.
- Used in drug discovery and virtual screening.
- Molecular Dynamics (MD) Simulation:
- Simulating the dynamic behavior of biomolecular systems over time.
- Provides insights into protein folding, dynamics, and interactions.
- Structural Analysis and Visualization:
- Analyzing and visualizing biomolecular structures to understand their properties and functions.
- Tools like PyMOL, VMD, and Chimera are commonly used for visualization.
- Structure-Function Relationships:
- Studying how the three-dimensional structure of a biomolecule relates to its biological function.
- Important for understanding enzyme mechanisms, protein-ligand interactions, and molecular recognition.
- Bioinformatics Databases and Tools:
- Databases like the Protein Data Bank (PDB), Structural Classification of Proteins (SCOP), and CATH provide resources for structural bioinformatics research.
- Software tools and packages (e.g., MODELLER, Rosetta, GROMACS) are used for structure prediction, analysis, and simulation.
Overall, structural bioinformatics combines computational methods, bioinformatics tools, and experimental techniques to study the structure and function of biomolecules, contributing to advances in biotechnology, medicine, and molecular biology.
Advanced Topics
Regular expressions for pattern matching
Regular expressions (regex) are powerful tools for pattern matching and searching within strings. They allow you to define complex search patterns, making them very useful in bioinformatics for tasks such as sequence pattern matching, motif finding, and data extraction. Here’s a basic overview of using regular expressions in Python for pattern matching:
- Import the
re
Module: There
module in Python provides support for working with regular expressions. - Compile the Regular Expression: Use the
re.compile()
function to compile your regular expression pattern into a regex object, which can then be used for matching. - Search for Patterns: Use the
search()
method of the regex object to search for the pattern in a string. This method returns a match object if the pattern is found, orNone
otherwise. - Extracting Matched Patterns: Use methods like
group()
orgroups()
of the match object to extract the matched patterns.
Here’s a simple example to illustrate:
import re# Define a regular expression pattern
pattern = r"\b[A-Z]{3}\b" # Matches three uppercase letters
# Compile the pattern into a regex object
regex = re.compile(pattern)
# Search for the pattern in a string
text = "The ATP-binding domain is essential for protein function"
match = regex.search(text)
if match:
print("Found:", match.group())
else:
print("Pattern not found")
In this example, the regular expression pattern "\b[A-Z]{3}\b"
matches three consecutive uppercase letters (e.g., “ATP” in the input string). The search()
method is used to search for this pattern in the input text. If the pattern is found, match.group()
is used to extract and print the matched substring.
Regular expressions can be much more complex, allowing you to define patterns with varying levels of specificity and complexity depending on your needs. They are a powerful tool for working with text data in bioinformatics and beyond.
Handling large biological datasets efficiently
Handling large biological datasets efficiently often requires a combination of proper data storage, processing, and analysis techniques. Here are some strategies to consider:
- Use Efficient Data Structures: Choose appropriate data structures for storing your data. For large datasets, consider using data storage solutions like databases (e.g., SQLite, MySQL, PostgreSQL) or specialized bioinformatics databases (e.g., BioSQL, BioMart) instead of plain text files.
- Batch Processing: Process data in batches rather than all at once. This can help manage memory usage and improve processing speed. Use streaming techniques when possible to avoid loading entire datasets into memory.
- Parallel and Distributed Processing: Use parallel and distributed computing techniques to process data more efficiently. Libraries like Dask, multiprocessing, and Apache Spark can help distribute processing tasks across multiple cores or machines.
- Optimized Algorithms: Use algorithms optimized for large datasets. For example, use indexing techniques (e.g., k-d trees, hash tables) for efficient searching and retrieval operations.
- Data Filtering and Preprocessing: Filter out irrelevant data and preprocess data before analysis to reduce the dataset size and complexity. This can include removing duplicates, handling missing values, and normalizing data.
- Compressed File Formats: Use compressed file formats (e.g., gzip, bzip2) for storing and transferring large datasets to reduce storage space and improve data transfer speeds.
- Data Pipelines: Use data pipelines to automate data processing tasks. This can help manage complex workflows and ensure data integrity and consistency.
- Use of Cloud Computing: Consider using cloud computing services (e.g., AWS, Google Cloud, Microsoft Azure) for scalable and on-demand computing resources to handle large datasets.
By implementing these strategies, you can efficiently handle and analyze large biological datasets in your bioinformatics projects.
Best practices and tips for bioinformatics programming
Here are some best practices and tips for bioinformatics programming:
- Use Version Control: Use version control systems like Git to manage your code. This helps track changes, collaborate with others, and revert to previous versions if needed.
- Document Your Code: Write clear and concise documentation for your code, including comments and docstrings. This helps others understand your code and makes it easier to maintain.
- Use Meaningful Variable Names: Use descriptive variable names that convey the purpose of the variable. This improves readability and makes your code easier to understand.
- Modularize Your Code: Break your code into small, reusable modules or functions. This makes your code more modular, easier to test, and helps avoid repetition.
- Use Libraries and Packages: Leverage existing libraries and packages for common tasks in bioinformatics. This saves time and ensures that you are using well-tested and optimized code.
- Optimize Your Code: Write efficient code by avoiding unnecessary loops, using appropriate data structures, and optimizing algorithms. This is especially important when working with large datasets.
- Handle Errors Gracefully: Use try-except blocks to handle errors and exceptions in your code. This prevents your program from crashing and helps identify and fix issues.
- Test Your Code: Write unit tests to ensure that your code behaves as expected. This helps catch bugs early and ensures that your code is robust and reliable.
- Use Virtual Environments: Use virtual environments like conda or virtualenv to manage dependencies and isolate your project’s environment. This avoids conflicts between different projects.
- Stay Updated: Keep your software and libraries up to date to benefit from the latest features, improvements, and security patches.
- Collaborate and Seek Feedback: Collaborate with others and seek feedback on your code. This can help improve your code quality and learn new techniques.
- Stay Organized: Organize your code and project files in a logical manner. Use meaningful directory structures and naming conventions.
Following these best practices can help you write better code, improve productivity, and build more robust and maintainable bioinformatics applications.
Examples and Exercises
Python: Strings
Strings are immutable objects representing text. To define a string, we can write:
var = “text”
or, equivalently:
var = ‘text’
To insert special characters, we need to perform escaping with a backslash :
\
path = “data\\fasta”
or use the prefix (raw):
r
path = r”data\fasta”
Here is a reference list of escape characters. You will probably only need the most obvious
\\
\n
\t
ones, like
,
and .
Escape Character | Meaning |
\\ | Backslash |
\’ | Single-quote |
\” | Double-quote |
\a | ASCII bell |
Escape Character | Meaning |
\b | ASCII backspace |
\f | ASCII formfeed |
\n | ASCII linefeed (also known as newline) |
\r | Carriage Return |
\t | Horizontal Tab |
\v | ASCII vertical tab |
\N{name} | Unicode character name (Unicode only!) |
\uxxxx | Unicode 16-bit hex value xxxx (u’’ string only) |
\Uxxxxxxxx | Unicode 32-bit hex value xxxxxxxx (u’’ string only) |
\ooo | Character with octal value ooo |
\xhh | Character with hex value hh |
To create a multi-line string, we can manually place the newline character at each line:
\n
sad_joke = “Time flies like an arrow.\nFruit flies like a banana.” print sad_joke
or we can use triple quotes:
sad_joke = “””Time flies like an arrow. Fruit flies like a banana.”””
print sad_joke
Warning
interprets special characters, while terminal echo doesn’t. Try to write:
print path
and (from the interpreter):
path
In the rirst case, we see one slash (the escaping slash is automatically interpreted by
), in the second case we see two slashes (the escape slash is not interpreted).
The same if we print .
sad_joke
String-Number conversion
We can convert a number into a string with :
str()
n = 10
s = str(n)
print n, type(n) print s, type(s)
or perform the opposite conversion:
float()
int()
n = int(“123”)
q = float(“1.23”) print n, type(n) print q, type(q)
If the string doesn’t contain the correct numeric type, Python will give an error message:
int(“3.14”)
float(“ribosome”) int(“1 2 3”)
int(“fifteen”)
# Not an int
# Not a number # Not a number # Not a number
Operations
Result | Operator | Meaning |
bool | == | Check whether two strings are identical. |
Result | Operator | Meaning |
int | len(str) | Return the length of the string |
str | str + str | Concatenate two strings |
str | str * int | Replicate the string |
bool | str in str | Check if a string is present in another string |
str | str[int:int] | Extract a sub-string |
Example. Let’s concatenate two strings:
string = “one” + ” ” + “string” length = len(string)
print “the string:”, string, “is”, length, “characters long”
Another example:
string = “Python is hell!” * 1000
print “the string is”, len(string), “characters long”
Warning
We cannot concatenate strings with other types. For example:
var = 123
print “the value of var is” + var
gives an error message. Two working alternatives:
print “the value of var is” + str(123)
or:
print “the value of var is”, var
(In the second case we miss a space between and .)
is
123
Example. The operator
substring in string
, for example:
string
checks if
appears once or more times in
substring
string = “A beautiful journey”
print “A” in string # True print “beautiful” in string # True print “BEAUTIFUL” in string # False print “ul jour” in string # True print “Gengis Khan” in string # False print ” ” in string # True
print ” ” in string # False
The result is always or .
False
True
Example. To extract a substring we can use indexes:
print alphabet[len(alphabet)–1] # “z”
print alphabet[len(alphabet)] # Error
print alphabet[10000] # Error
# “klmnop”
print alphabet[10:–10]
# “vwxy”
# “vwxyz”
print alphabet[–5:–1]
print alphabet[–5:]
# “a”
# “ab”
# “abcde” # “abcde”
print alphabet[0:1]
print alphabet[0:2] print alphabet[0:5] print alphabet[:5]
# “z”
# “y”
print alphabet[–1]
print alphabet[–2]
# “a”
# “b”
print alphabet[0]
print alphabet[1]
alphabet = “abcdefghijklmnopqrstuvwxyz”
-1
-2|
-3||
|||
…
0
|1
||2
|||
#
# # #
Warning
Extraction is inclusive with respect to the first index, but exclusive with respect to the
second. In other words corresponds to:
alphabet[i:j]
alphabet[i] + alphabet[i+1] + … + alphabet[j–1]
Note that is excluded.
alphabet[j]
Warning
Extraction return a new string, leaving the original unvaried:
alphabet = “abcdefghijklmnopqrstuvwxyz”
substring = alphabet[2:–2]
print substring print alphabet
# Is unvaried
Methods
Result | Method | Meaning |
str | str.upper() | Return the string in upper case |
str | str.lower() | Return the string in lower case |
str | str.strip(str) | Remove strings from the sides |
str | str.lstrip(str) | Remove strings from the left |
str | str.rstrip(str) | Remove strings from the right |
bool | str.startswith(str) | Check if the string starts with another |
bool | str.endswith(str) | Check if the string ends with another |
int | str.find(str) | Return the position of a substring |
int | str.count(str) | Count the number of occurrences of a substring |
str | str.replace(str, str) | Replace substrings |
Warning
Methods return a new string, leaving the original unvaried (as with extraction):
alphabet = “abcdefghijklmnopqrstuvwxyz”
alphabet_upper = alphabet.upper() print alphabet_upper
print alphabet # Is unvaried
Example.
lower()
upper()
and
are very simple:
text = “No Yelling”
result = text.upper() print result
result = result.lower() print result
Example. variants are also simple:
strip()
text = ” one example ”
print text.strip()
print text.lstrip() print text.rstrip()
# equivalent to text.strip(” “)
# idem # idem
print text
# text is unvaried
Note that the space between
“example”
“one”
than one character to be removed:
and
is never removed. We can specify more
“AAAA one example BBBB”.strip(“AB”)
Example. The same is valid with and :
endswith()
startswith()
text = “123456789”
print text.startswith(“1”) print text.startswith(“a”)
# True
# False
print text.endswith(“56789”) # True
print text.endswith(“5ABC9”) # False
Example.
find()
returns the position of the first occurrence of a substring, or
if the
substring never occurs:
-1
text = “123456789”
print text.find(“1”)
print text.find(“56789”)
# 0
# 4
print text.find(“Q”)
# -1
Example. returns a copy of the string where a substring is replaced with another:
replace()
text = “if roses were rotten, then” print text.replace(“ro”, “gro”)
Example. Given this unformatted string of aminoacids:
sequence = “>MAnlFKLgaENIFLGrKW ”
To increase uniformity, we want to remove the convert everything to upper case:
“>”
character, remove spaces and finally
s1 = sequence.lstrip(“>”) s2 = s2.rstrip(” “)
s3 = s2.upper() print s3
Alternatively, all in one step:
print sequence.lstrip(“>”).rstrip(” “).upper()
Why does it work? Let’s write it with brackets:
print ( ( sequence.lstrip(“>”) ).rstrip(” “) ).upper()
\ / str
\ / str
\ / str
As you can see, the result of each method is a string (as and we can invoke string methods.
s1
s2
s3
Exercises
,
e
in the example above);
- How can I:
- Create a string consisting of five spaces only.
- Check whether a string contains at least one space.
- Check whether a string contains exactly five (arbitrary) characters.
- Create an empty string, and check whether it is really empty.
- Create a string that contains one hundred copies of .
“is way better”
“Python is great”
- Given the strings string
1
“but cell biology is way better”
, and
.
“12345”
“biology”
“but cell”
, compose them into the
- Check whether the string
begins with
(the character, not the number!)
- Create a string consisting of a single character . (Check whether the output matches
\
using both the echo of the interpreter and , and possibly also with )
len()
- Check whether the string contains one or two backslashes.
“\\”
- Check whether a string (of choice) begins or ends by .
\
- Check whether a string (of choice) contains at least three times at the beginning
x
and/or at the end. For instance, the following strings satisfy the desideratum:
“x. xx”
“xx. x”
“xxxx. ”
# 1 + 2 >= 3
# 2 + 1 >= 3
# 4 + 0 >= 3
while these do not:
“x. x”
“…x. ”
” ”
# 1 + 1 < 3
# 0 + 0 < 3
# 0 + 0 < 3
- Given the string:
s = “0123456789”
which of the following extractions are correct? 1.
2.
s[10]
s[9]
s[:10]
s[1000]
s[0]
s[-1]
s[1:5]
s[-1:-5]
s[-5:-1]
s[-1000]
- Create a two-line string that contains the two following lines of text literally, including all the special characters and the implicit newline character:
never say “never”! said the sad turtle
- Given the strings:
string = “a 1 b 2”
digit = “DIGIT”
character = “CHARACTER”
replace all the digits in the variable with the text provided by the variable ,
digit
string
and all alphabetic characters with the content of the variable . The result should look like this:
“CHARACTER DIGIT CHARACTER DIGIT”
character
You are free to use auxiliary variables to hold any intermediate results, but do not need to.
- Given the following multi-line sequence:
chain_a = “””SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKM FCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVV
RRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFR HSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILT IITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKG EPHHELPPGSTKRALPNNT”””
which represents the aminoacid sequence of the DNA-binding domain of the Tumor Suppressor Protein TP53 , answer the following questions.
- How many lines does it hold?
- How long is the sequence? (Do not forget to ignore the special characters!)
- Remove all newline characters, and put the result in a new variable .
sequence
- How many cysteines are there in the sequence? How many histidines ?
“H”
“C”
- Does the chain contain the sub-sequence ? In what position?
[i:j]
“NLRVEYLDDRN”
- How can I use line from
chain_a
and the sub-string extraction
?
find()
operators to extract the first
- Given (a small portion of) the tertiary structure of chain A of the TP53 protein:
structure_chain_a = “””SER A 96 77.253 20.522 75.007
VAL A 97 76.066 22.304 71.921
PRO A 98 77.731 23.371 68.681
SER A 99 80.136 26.246 68.973
GLN A 100 79.039 29.534 67.364
LYS A 101 81.787 32.022 68.157″””
Each line represents an \(C_\alpha\) atom of the backbone of the structure. Of each atom,
we know: – the aminoacid code of the residue – the chain (which is always in this
“A”
example) – the position of the residue within the chain (starting from the N-terminal) – and the \(x, y, z\) coordinates of the atom
- Extract the second line using and the extraction operator. Put the line in a new
find()
variable .
line
- Extract the coordinates of the second residue, and put them into three variables , ,
x
y
and .
z
- Extract the coordinates from third residue as well, putting them in different variables
, ,
z_prime
y_prime
x_prime
- Compute the Euclidean distance between the two residues:
\(d((x,y,z),(x’,y’,z’)) = \sqrt{(x-x’)^2 + (y-y’)^2 + (z-z’)^2}\)
Hint: make sure to use numbers when computing the distance.
float
- Given the following DNA sequence, part of the BRCA2 human gene:
dna_seq = “””GGGCTTGTGGCGCGAGCTTCTGAAACTAGGCGGCAGAGGCGGAGCCGCT GTGGCACTGCTGCGCCTCTGCTGCGCCTCGGGTGTCTTTT
GCGGCGGTGGGTCGCCGCCGGGAGAAGCGTGAGGGGACAG ATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTCCGGCCA AAAAAGAACTGCACCTCTGGAGCGG”””
- Calculate the GC-content of the sequence
- Convert the DNA sequence into an RNA sequence
- Assuming that this sequence contains an intron ranging from nucleotide 51 to nucleotide 156, store the sequence of the intron in a string, and the sequence of the spliced transcript in another string.
Python: Strings (Solutions)
Note
Later on in the solutions, I will sometimes use the backslash character line.
\
at the end of a
When used this way, tells Python that the command continues on the following line,
\
allowing to break long commands over multiple lines.
- Solutions:
- Solution:
# 12345
text = ” ” print text
print len(text)
- Solution:
at_least_one_space = ” ” in text
# check whether it works
print ” ” in “nospaceatallhere”
print ” ” in “onlyonespacehere–> <–” print ” ” in “more spaces in here”
- Solution:
exactly_5_characters = len(text) == 5
# check whether it works print len(“1234”) == 5 print len(“12345”) == 5 print len(“123456”) == 5
- Solution:
empty_string = “”
print len(empty_string) == 0
- Solution:
base = “Python is great” repeats = base * 100
# check whether the length is correct
print len(repeats) == len(base) * 100
- Solution:
part_1 = “but cell” part_2 = “biology”
part_3 = “is way better”
text = (part_1 + part_2 + part_3) * 1000
- Let’s try this:
start_with_1 = “12345”.startswith(1)
but Python gives an error message:
Traceback (most recent call last):
File “<stdin>”, line 1, in <module>
TypeError: startswith first arg must be str, unicode, or tuple, not int
# ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^
The error message, see highlighted parts, says that requires the argument
1
startswith()
to be a string, non an int as in our case: The solution is:
start_with_1 = “12345”.startswith(“1”) print start_with_1
, is an int.
the value is
True
- Solution:
, as expected.
string = “\\”
string
print string
print len(string)
# 1
alternatively:
string = r”\”
string
print string
print len(string)
# 1
- Already checked before, the answer is no. Anyway:
backslash = r”\”
print backslash*2 in “\\”
# False
- First method:
backslash = r”\”
condition = text.startswith(backslash) or \
text.endswith(backslash)
Second method:
condition = (text[0] == backslash) or \
(text[–1] == backslash)
- Solution:
condition = \
text.startswith(“xxx”) or \
(text.startswith(“xx”) and text.endswith(“x”)) or \ (text.startswith(“x”) and text.endswith(“xx”)) or \
text.endswith(“xxx”)
It’s worth to check the condition using the examples provided in the exercise.
- Solution:
s = “0123456789”
print len(s)
# 10
Which of the following extractions are correct?
10
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
: correct, extracts the last character.
: invalid.
s[10]
s[9]
: corrett, extracts all characters (remember that the second index, case, is exclusive.)
s[:10]
: invalid.
s[1000]
: correct, extracts the first character.
s[0]
: correct, extracts the last character.
s[-1]
: correct, ectracts from the 2nd to the 6th character.
s[1:5]
: correct
s[-1:-5]
: correct, but nothing is extracted (indexes are inverted!)
s[-5:-1]
: invalid.
s[-1000]
in this
- Solution (one of two possible solutions):
text = “””never say \”never!\”
\said the sad turtle.”””
- Solution:
string = “a 1 b 2 c 3”
digit = “DIGIT”
character = “CHARACTER”
result = string.replace(“1”, digit) result = result.replace(“2”, digit) result = result.replace(“3”, digit)
result = result.replace(“a”, character)
result = result.replace(“b”, character) result = result.replace(“c”, character)
print result
# “CHARACTER DIGIT CHARACTER …”
In one line:
print string.replace(“1”, digit).replace(“2”, digit) …
- Solution:
chain_a = “””SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKM
FCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVV RRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFR HSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILT IITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKG EPHHELPPGSTKRALPNNT”””
num_lines = chain_a.count(“\n”) + 1
print num_lines
# 6
# NOTE: we want to know the length of the actual *sequence*, non the length of the *string*
length_sequence = len(chain_a) – chain_a.count(“\n”) print length_sequenza # 219
sequence = chain_a.replace(“\n”, “”)
print len(chain_a) – len(sequence) print len(sequence)
# 5 (giusto)
# 219
num_cysteine = sequence.count(“C”)
num_histidine = sequence.count(“H”) print num_cysteine, num_histidine
# 10, 9
print “NLRVEYLDDRN” in sequence
print sequence.find(“NLRVEYLDDRN”)
# let’s check
# True
# 106
print sequence[106 : 106 + len(“NLRVEYLDDRN”)] # “NLRVEYLDDRN”
index_first_newline = chain_a.find(“\n”)
first_line = chain_a[:index_first_newline] print first_line
- Solution:
structure_chain_a = “””SER A 96 77.253 20.522 75.007
VAL A 97 76.066 22.304 71.921
PRO A 98 77.731 23.371 68.681
SER A 99 80.136 26.246 68.973
GLN A 100 79.039 29.534 67.364
LYS A 101 81.787 32.022 68.157″””
# I use a variable with a shorter name
chain = structure_chain_a
index_first_newline = chain.find(“\n”)
index_second_newline = chain[index_first_newline + 1:].find(“\n”) index_third_newline = chain[index_second_newline + 1:].find(“\n”)
print index_first_newline, index_second_newline, index_third_newline
second_line = chain[index_first_newline + 1 : index_second_newline]
print second_line # “VAL A 97 76.066 22.304 71.921” # | | | | | | # 01234567890123456789012345678
# 0 1 2
x = second_line[9:15]
y = second_line[16:22]
z = second_line[23:] print x, y, z
# NOTE: they are all strings
third_line = chain[index_second_newline + 1 : index_third_newline]
print third_line # “PRO A 98 77.731 23.371 68.681” # | | | | | | # 01234567890123456789012345678
# 0 1 2
x_prime = third_line[9:15] y_prime = third_line[16:22] z_prime = third_line[23:]
print x_prime, y_prime, z_prime
# NOTE: they are all strings
# we should convert all variables to floats, in order to calculate distances
x, y, z = float(x), float(y), float(z)
x_prime, y_prime, z_prime = float(x_prime), float(y_prime), float(z_prime)
diff_x = x – x_prime diff_y = y – y_prime diff_z = z – z_prime
distance = (diff_x**2 + diff_y**2 + diff_z**2)**0.5 print distance
The solution is way simpler using :
split()
lines = chain.split(“\n”) second_line = lines[1]
third_line = lines[2]
words = second_line.split()
x, y, z = float(words[–3]), float(words[–2]), float(words[–1])
words = third_line.split()
x_prime, y_prime, z_prime = float(words[–3]), float(words[–2]), float(words[–1]) distance = ((x – x_prime)**2 + (y – y_prime)**2 + (z – z_prime)**2)**0.5
- Solutions:
- Solution:
dna_seq = dna_seq.replace(“\n”, “”) # Remove newline characters
length = len(dna_seq) # Calculate length
ng = dna_seq.count(“G”) # Calculate the number of Gs
nc = dna_seq.count(“C”) # Calculate the number of Cs
gc_cont = (ng + nc)/float(length) # Calculate the GC-content
- Solution:
rna_seq = dna_seq.replace(“T”,”U”)
- Solution:
intron = dna_seq[50:156]
exon1 = dna_seq[:50] exon2 = dna_seq[156:] spliced = exon1+exon2
# Careful with indexes
# Careful with indexes # Careful with indexes