Python Strings for Bioinformatics: From Basics to Applications

March 25, 2024 Off By admin

Course Description: This course provides a comprehensive introduction to using Python strings in bioinformatics. Starting with the fundamentals of Python strings, the course progresses to advanced topics such as sequence alignment, parsing biological data, and manipulating biological sequences. Through hands-on exercises and real-world examples, students will learn how to apply Python string operations to solve common bioinformatics tasks.

Table of Contents

Introduction to Python Strings

Basic string operations

In Python, strings are sequences of characters and support various operations for manipulation. Here are some basic string operations:

Concatenation: Joining two or more strings together.
python
str1 = "Hello" str2 = "World" result = str1 + " " + str2 print(result) # Output: Hello World
Length: Getting the length of a string.
python
string = "Python" length = len(string) print(length) # Output: 6
Indexing: Accessing individual characters in a string.
python
string = "Python" print(string[0]) # Output: P
Slicing: Extracting a substring from a string.
python
string = "Python" substring = string[2:5] print(substring) # Output: thon
Formatting: Formatting strings using placeholders or format() method.
python
name = "Alice" age = 30 message = "My name is {} and I am {} years old.".format(name, age) print(message) # Output: My name is Alice and I am 30 years old.
Conversion: Converting the case of a string.
python
string = "Hello, World!" print(string.lower()) # Output: hello, world! print(string.upper()) # Output: HELLO, WORLD!
Splitting: Splitting a string into a list of substrings.
python
string = "apple,banana,cherry" fruits = string.split(",") print(fruits) # Output: ['apple', 'banana', 'cherry']
Joining: Joining a list of strings into a single string.
python
fruits = ['apple', 'banana', 'cherry'] string = ",".join(fruits) print(string) # Output: apple,banana,cherry

These are just some of the basic string operations in Python. Strings in Python are versatile and offer many more operations for manipulation and formatting.

String indexing and slicing

String indexing and slicing are common operations used to access and extract substrings from a string in Python. Here’s a brief overview of how indexing and slicing work:

String Indexing:
- In Python, indexing starts at 0, so the first character of a string has index 0, the second character has index 1, and so on.
- You can use positive indexes to access characters from the beginning of the string and negative indexes to access characters from the end of the string.
python
string = "Hello, World!" # Accessing individual characters print(string[0]) # Output: 'H' print(string[7]) # Output: 'W'
# Accessing characters using negative indexes print(string[-1]) # Output: '!' print(string[-3]) # Output: 'd'
String Slicing:
- String slicing allows you to extract a substring from a string by specifying a start index and an end index (not inclusive).
- The syntax for string slicing is string[start:end]. If start is not specified, it defaults to 0. If end is not specified, it defaults to the end of the string.
python
string = "Hello, World!" # Slicing from index 2 to index 5 (not inclusive) print(string[2:5]) # Output: 'llo' # Slicing from the beginning to index 5 (not inclusive) print(string[:5]) # Output: 'Hello' # Slicing from index 7 to the end print(string[7:]) # Output: 'World!'
# Slicing using negative indexes print(string[-6:-1]) # Output: 'World'
Step Size in Slicing:
- You can specify a step size when slicing a string to extract characters at regular intervals.
- The syntax for specifying a step size is string[start:end:step].
python
string = "Hello, World!" # Slicing with a step size of 2 print(string[::2]) # Output: 'Hlo ol!'
# Reversing a string print(string[::-1]) # Output: '!dlroW ,olleH'

String indexing and slicing are powerful features in Python that allow you to work with strings efficiently and effectively.

String formatting

String formatting in Python allows you to create strings with placeholders that can be replaced with variables or values. There are several ways to format strings in Python, including using the format() method, f-strings (formatted string literals), and the % operator. Here’s an overview of each method:

Using the format() Method:
- The format() method formats a string by replacing placeholders ({}) with values.
- You can specify the order of placeholders or use named placeholders for more clarity.
python
name = "Alice" age = 30 message = "My name is {} and I am {} years old.".format(name, age) print(message) # Output: My name is Alice and I am 30 years old.
Using f-strings (Formatted String Literals):
- F-strings provide a more concise and readable way to format strings by embedding expressions directly inside string literals.
- F-strings are prefixed with f or F before the string literal.
python
name = "Alice" age = 30 message = f"My name is {name} and I am {age} years old." print(message) # Output: My name is Alice and I am 30 years old.
Using the % Operator (Deprecated in Python 3.6 and above):
- The % operator can be used for string formatting, but it is less preferred than format() and f-strings.
- Placeholders are represented by %s for strings, %d for integers, and %f for floats.
python
name = "Alice" age = 30 message = "My name is %s and I am %d years old." % (name, age) print(message) # Output: My name is Alice and I am 30 years old.

String formatting is a powerful feature in Python that allows you to create dynamic and customized strings for various purposes, such as output messages, logging, and data representation.

Working with Biological Sequences

Understanding DNA, RNA, and protein sequences

DNA, RNA, and protein sequences are fundamental components of living organisms, each serving distinct biological functions:

DNA (Deoxyribonucleic Acid):
- DNA is a molecule that carries the genetic instructions used in the growth, development, functioning, and reproduction of all known living organisms.
- It consists of two long chains of nucleotides twisted into a double helix and held together by hydrogen bonds between complementary base pairs (adenine [A] with thymine [T], and cytosine [C] with guanine [G]).
- DNA encodes the information necessary for the synthesis of proteins, which are essential for the structure, function, and regulation of the body’s tissues and organs.
RNA (Ribonucleic Acid):
- RNA is a molecule that plays multiple roles in the coding, decoding, regulation, and expression of genes.
- It is single-stranded and is typically synthesized from DNA templates during the process of transcription.
- There are several types of RNA, including messenger RNA (mRNA), transfer RNA (tRNA), and ribosomal RNA (rRNA), each with specific functions in protein synthesis.
Proteins:
- Proteins are large, complex molecules that are essential for the structure and function of cells, tissues, and organs.
- Proteins are made up of amino acids linked together in a specific sequence, which is determined by the sequence of nucleotides in the corresponding mRNA molecule.
- The sequence of amino acids in a protein determines its structure and function, and any changes in the sequence can lead to changes in the protein’s properties.

In summary, DNA carries the genetic information, RNA helps in the expression of this information, and proteins are the primary functional molecules in cells, performing a wide variety of roles based on their structure and function. Understanding these sequences and their interactions is crucial for understanding the molecular basis of life and for various applications in biotechnology, medicine, and evolutionary biology.

Reading sequences from files

To read sequences from files in Python, you can use the Bio.SeqIO module from Biopython. This module provides a convenient way to parse and read sequences from various file formats, such as FASTA, GenBank, and FASTQ. Here’s an example of how to read sequences from a FASTA file:

Install Biopython: If you haven’t installed Biopython yet, you can install it using pip:
bash
pip install biopython
Read sequences from a FASTA file: Suppose you have a file named sequences.fasta containing one or more sequences in FASTA format. You can use the following code to read these sequences:
python
from Bio import SeqIO fasta_file = "sequences.fasta" # Read sequences from the FASTA file sequences = [] for record in SeqIO.parse(fasta_file, "fasta"): sequences.append(record.seq)
# Print the first sequence print("First sequence:", sequences[0])
This code reads the sequences from the sequences.fasta file and stores them in a list. You can then access and manipulate these sequences as needed in your program.
Read sequences from other file formats: If your sequences are stored in a different format, such as GenBank or FASTQ, you can use the same approach with minor modifications. For example, to read sequences from a GenBank file:
python
from Bio import SeqIO genbank_file = "sequences.gb" # Read sequences from the GenBank file sequences = [] for record in SeqIO.parse(genbank_file, "genbank"): sequences.append(record.seq)
# Print the first sequence print("First sequence:", sequences[0])
Replace sequences.gb with the name of your GenBank file.

The SeqIO.parse() function reads sequences from the specified file and returns an iterator over the sequences. You can then iterate over this iterator to access individual sequences and their associated metadata (e.g., sequence ID, description).

To read sequences from a file in Python without using Biopython, you can simply open the file and read its contents. Here’s a basic example of how to read a sequence from a FASTA file:

python

# Open the FASTA file for reading
 with open("sequence.fasta", "r") as file:
 lines = file.readlines()
# Parse the sequence from the file
 sequence = ""
 for line in lines:
 if not line.startswith(">"): # Skip the header line
 sequence += line.strip()

print("Sequence:", sequence)

In this example, replace "sequence.fasta" with the path to your FASTA file. The code reads the lines from the file, skips the header line (which starts with > in FASTA format), and concatenates the remaining lines to form the sequence.

If you have a file with multiple sequences and want to read them one by one, you can modify the code to store each sequence in a list:

python

# Open the FASTA file for reading
 with open("sequences.fasta", "r") as file:
 lines = file.readlines()
sequences = []
 current_sequence = ""
for line in lines:
 if line.startswith(">"): # Start of a new sequence
 if current_sequence:
 sequences.append(current_sequence)
 current_sequence = ""
 else:
 current_sequence += line.strip()
# Add the last sequence to the list
 if current_sequence:
 sequences.append(current_sequence)

# Print the list of sequences for idx, seq in enumerate(sequences, start=1): print(f"Sequence {idx}: {seq}")

This code reads multiple sequences from a FASTA file, stores them in a list, and then prints each sequence along with its index.

Writing sequences to files

To write sequences to a file in Python, you can open a file in write mode and then write the sequences to the file. Here’s a basic example of how to write a sequence to a file in FASTA format:

python

sequence_id = "Sequence_1"
 sequence = "ATCGATCGATCG"

# Open the output file for writing with open("output.fasta", "w") as file: # Write the sequence in FASTA format file.write(f">{sequence_id}\n") # Write the sequence ID file.write(f"{sequence}\n") # Write the sequence

In this example, the with open("output.fasta", "w") as file statement opens a file named output.fasta in write mode. The file.write() statements are then used to write the sequence ID (prefixed with >) and the sequence itself to the file. The "\n" character is used to add a newline at the end of each line.

If you have multiple sequences and want to write them to the same file, you can modify the code to write each sequence one by one:

python

sequences = {
 "Sequence_1": "ATCGATCGATCG",
 "Sequence_2": "GCTAGCTAGCTA",
 "Sequence_3": "TATGTATGTATG"
 }

# Open the output file for writing with open("output.fasta", "w") as file: # Write each sequence in FASTA format for sequence_id, sequence in sequences.items(): file.write(f">{sequence_id}\n") # Write the sequence ID file.write(f"{sequence}\n") # Write the sequence

In this example, the sequences dictionary contains sequence IDs as keys and sequences as values. The for loop iterates over each key-value pair in the dictionary and writes the sequence ID and sequence to the file in FASTA format.

Sequence Alignment

Pairwise sequence alignment

Pairwise sequence alignment is a method used to compare two sequences (DNA, RNA, or protein) to identify regions of similarity. It helps in understanding the evolutionary relationships between sequences and identifying functional or structural similarities. One of the most common algorithms used for pairwise sequence alignment is the Needleman-Wunsch algorithm for global alignment.

Here’s a basic example of pairwise sequence alignment using the Needleman-Wunsch algorithm in Python, without using Biopython:

python

def needleman_wunsch(seq1, seq2, match=1, mismatch=-1, gap=-1):
 # Initialize the scoring matrix
 rows = len(seq1) + 1
 cols = len(seq2) + 1
 score_matrix = [[0 for _ in range(cols)] for _ in range(rows)]
 # Initialize the traceback matrix
 traceback_matrix = [[0 for _ in range(cols)] for _ in range(rows)]
 # Initialize the first row and column of the scoring matrix
 for i in range(1, rows):
 score_matrix[i][0] = score_matrix[i-1][0] + gap
 traceback_matrix[i][0] = 1
 for j in range(1, cols):
 score_matrix[0][j] = score_matrix[0][j-1] + gap
 traceback_matrix[0][j] = 2
 # Fill in the scoring matrix
 for i in range(1, rows):
 for j in range(1, cols):
 match_mismatch = match if seq1[i-1] == seq2[j-1] else mismatch
 diag_score = score_matrix[i-1][j-1] + match_mismatch
 up_score = score_matrix[i-1][j] + gap
 left_score = score_matrix[i][j-1] + gap
 max_score = max(diag_score, up_score, left_score)
 score_matrix[i][j] = max_score
 if max_score == diag_score:
 traceback_matrix[i][j] = 3
 elif max_score == up_score:
 traceback_matrix[i][j] = 1
 else:
 traceback_matrix[i][j] = 2
 # Traceback to find the alignment
 align1 = ""
 align2 = ""
 i, j = rows - 1, cols - 1
 while i > 0 or j > 0:
 if traceback_matrix[i][j] == 3:
 align1 = seq1[i-1] + align1
 align2 = seq2[j-1] + align2
 i -= 1
 j -= 1
 elif traceback_matrix[i][j] == 1:
 align1 = seq1[i-1] + align1
 align2 = "-" + align2
 i -= 1
 else:
 align1 = "-" + align1
 align2 = seq2[j-1] + align2
 j -= 1
 return align1, align2

# Example usage seq1 = "AGTACGCA" seq2 = "TATGC" alignment1, alignment2 = needleman_wunsch(seq1, seq2) print("Sequence 1:", alignment1) print("Sequence 2:", alignment2)

This example defines the needleman_wunsch function, which takes two sequences (seq1 and seq2) and optional scoring parameters (match, mismatch, gap) and performs a global pairwise sequence alignment using the Needleman-Wunsch algorithm. The function returns the two aligned sequences as strings.

Multiple sequence alignment

Multiple sequence alignment (MSA) is a technique used to align three or more sequences simultaneously. It is used to identify conserved regions, detect evolutionary relationships, and predict the structure and function of proteins. One of the most commonly used algorithms for MSA is the progressive alignment method, such as the ClustalW algorithm.

Here’s a basic example of multiple sequence alignment using the ClustalW algorithm in Python, without using Biopython:

python

def clustalw(sequences, match=1, mismatch=-1, gap=-1):
 # Initialize the alignment matrix
 alignment = [list(seq) for seq in sequences]
 # Perform pairwise alignments
 while len(alignment) > 1:
 # Calculate pairwise similarity scores
 scores = []
 for i in range(len(alignment)):
 for j in range(i+1, len(alignment)):
 score = sum(a == b for a, b in zip(alignment[i], alignment[j]))
 scores.append((i, j, score))
 # Find the most similar pair
 i, j, _ = max(scores, key=lambda x: x[2])
 # Merge the two sequences
 merged_seq = []
 for a, b in zip(alignment[i], alignment[j]):
 if a == b:
 merged_seq.append(a)
 else:
 merged_seq.append('-')
 alignment[i] = merged_seq
 alignment.pop(j)
 # Return the final alignment
 return ''.join(alignment[0])

# Example usage sequences = [ "AGTACGCA", "TATGC", "GACTA", "AGCT" ] alignment = clustalw(sequences) print("Multiple Sequence Alignment:") print(alignment)

This example defines the clustalw function, which takes a list of sequences and optional scoring parameters (match, mismatch, gap) and performs a progressive multiple sequence alignment using a simplified version of the ClustalW algorithm. The function returns a single aligned sequence as a string.

Using Python libraries for alignment (e.g., Biopython)

To perform multiple sequence alignment (MSA) using Python libraries like Biopython, you can use the Bio.Align module, which provides access to various alignment algorithms. Here’s an example of how to perform MSA using Biopython:

python

from Bio import AlignIO
 from Bio.Align import MultipleSeqAlignment
 from Bio.SeqRecord import SeqRecord
 from Bio.Seq import Seq
# Define the sequences
 seq1 = Seq("AGTACGCA")
 seq2 = Seq("TATGC")
 seq3 = Seq("GACTA")
 seq4 = Seq("AGCT")
# Create SeqRecord objects for each sequence
 seq_records = [
 SeqRecord(seq1, id="seq1"),
 SeqRecord(seq2, id="seq2"),
 SeqRecord(seq3, id="seq3"),
 SeqRecord(seq4, id="seq4")
 ]
# Create a MultipleSeqAlignment object
 alignment = MultipleSeqAlignment(seq_records)
# Perform the alignment using the ClustalW algorithm
 # You may need to install the ClustalW software separately
 # and provide the path to the executable
 from Bio.Align.Applications import ClustalwCommandline
clustalw_exe = "clustalw2" # Path to the ClustalW executable
 clustalw_cline = ClustalwCommandline(clustalw_exe, infile="temp.fasta")
 stdout, stderr = clustalw_cline()
# Parse the alignment result
 alignment = AlignIO.read("temp.aln", "clustal")

# Print the alignment print("Multiple Sequence Alignment:") print(alignment)

In this example, we first create SeqRecord objects for each sequence and then create a MultipleSeqAlignment object. We then use the ClustalwCommandline class to perform the alignment using the ClustalW algorithm. Finally, we parse the alignment result using AlignIO.read and print the aligned sequences.

Note: Before running this code, you need to install the ClustalW software and provide the path to the executable (clustalw_exe). You also need to have Biopython installed in your Python environment.

Sequence Manipulation

Translating DNA sequences to protein sequences

Translating DNA sequences to protein sequences is a common task in bioinformatics, and it can be easily done using Biopython. Here’s how you can translate a DNA sequence to a protein sequence:

python

from Bio.Seq import Seq
# DNA sequence
 dna_seq = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
# Translate DNA to protein sequence
 protein_seq = dna_seq.translate()

print("DNA sequence:", dna_seq) print("Protein sequence:", protein_seq)

In this example, we use the Seq class from Biopython to create a Seq object representing the DNA sequence. We then use the translate() method of the Seq object to translate the DNA sequence to a protein sequence. The resulting protein sequence is a Seq object containing the translated sequence.

Note that DNA sequences are translated using the standard genetic code, where each codon (a sequence of three nucleotides) is translated to a specific amino acid. The translation starts from the start codon (ATG) and continues until a stop codon is encountered (TAA, TAG, or TGA).

Reverse complement of DNA sequences

To find the reverse complement of a DNA sequence in Python using Biopython, you can use the reverse_complement method of the Seq object. Here’s an example:

python

from Bio.Seq import Seq
# DNA sequence
 dna_seq = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")
# Reverse complement
 reverse_complement_seq = dna_seq.reverse_complement()

print("DNA sequence:", dna_seq) print("Reverse complement:", reverse_complement_seq)

In this example, we first create a Seq object representing the DNA sequence. Then, we use the reverse_complement method to find the reverse complement of the DNA sequence. The result is a Seq object containing the reverse complement sequence.

Biopython also provides a reverse() method to reverse the sequence without complementing it. If you only need the reverse of the sequence, you can use this method instead:

python

# Reverse sequence only
 reverse_seq = dna_seq.reverse()
 print("Reverse sequence:", reverse_seq)

Finding motifs and patterns in sequences

To find motifs and patterns in sequences, you can use regular expressions in Python. Regular expressions provide a powerful way to search for specific patterns in strings. Here’s a basic example of how to find motifs in a DNA sequence using regular expressions:

python

import re
# DNA sequence
 dna_seq = "ATGCGTACGTCATGCGTAGCG"
# Define the motif pattern
 motif_pattern = "ATG"
# Find motifs using regular expression
 motif_positions = [match.start() for match in re.finditer(motif_pattern, dna_seq)]

print("DNA sequence:", dna_seq) print("Motif pattern:", motif_pattern) print("Motif positions:", motif_positions)

In this example, we use the re.finditer() function to find all occurrences of the motif pattern (“ATG”) in the DNA sequence. The function returns an iterator yielding match objects for each match found. We extract the starting position of each match using the start() method of the match object.

You can use more complex regular expressions to define motifs with specific patterns, such as motifs containing ambiguous characters or motifs with certain constraints. The regular expression syntax allows for a wide range of possibilities to specify patterns.

Keep in mind that regular expressions are case-sensitive by default. If you want to perform case-insensitive searches, you can use the re.IGNORECASE flag in the re.finditer() function:

python

# Find motifs using case-insensitive search
 motif_positions = [match.start() for match in re.finditer(motif_pattern, dna_seq, re.IGNORECASE)]

Additionally, if you need to search for motifs in RNA sequences, you can first convert the RNA sequence to DNA (e.g., by replacing “U” with “T”) and then use regular expressions as shown above.

Parsing Biological Data

Parsing data from biological databases (e.g., GenBank, UniProt)

To parse data from biological databases like GenBank or UniProt, you can use Biopython’s Bio.Entrez and Bio.SeqIO modules for GenBank, and Bio.SwissProt module for UniProt. Here’s a basic example of how to parse data from GenBank and UniProt using Biopython:

Parsing data from GenBank:

python

from Bio import Entrez, SeqIO
# Email address (required by NCBI)
 Entrez.email = "your.email@example.com"
# Accession number of the sequence
 accession = "NC_000913.3"
# Fetch the sequence record from NCBI
 handle = Entrez.efetch(db="nucleotide", id=accession, rettype="gb", retmode="text")
 record = SeqIO.read(handle, "genbank")
 handle.close()

# Print the sequence record print(record)

Parsing data from UniProt:

python

from Bio import SwissProt
# UniProt accession number
 accession = "P0A7B8"
# Fetch the UniProt record
 handle = open(f"{accession}.txt") # You can download the UniProt record in text format
 record = SwissProt.read(handle)
 handle.close()

# Print the UniProt record print(record)

These examples demonstrate how to fetch and parse data from GenBank and UniProt using Biopython. Make sure to replace the email address with your own email address in the Entrez.email field.

Parsing data from biological databases without Biopython can be more challenging, as you’ll need to handle the parsing of the specific file formats used by each database. Here’s a basic example of how you might parse data from a GenBank file without using Biopython:

Parsing data from GenBank:

python

def parse_genbank(file_path):
 with open(file_path, "r") as file:
 record = {}
 sequence = ""
 for line in file:
 if line.startswith("LOCUS"):
 record["LOCUS"] = line.strip()
 elif line.startswith("ACCESSION"):
 record["ACCESSION"] = line.strip()
 elif line.startswith("DEFINITION"):
 record["DEFINITION"] = line.strip()
 elif line.startswith("ORIGIN"):
 break # Reached the start of sequence section
 elif line.strip(): # Skip empty lines
 sequence += line.strip().replace(" ", "") # Remove spaces in the sequence lines
 record["SEQUENCE"] = sequence
 return record

# Example usage file_path = "example.gb" parsed_record = parse_genbank(file_path) print(parsed_record)

In this example, parse_genbank is a function that reads a GenBank file and extracts information such as the locus, accession number, definition, and sequence. Note that this is a simplified example and may need to be adapted to handle more complex GenBank files.

For parsing UniProt data without Biopython, you would similarly need to parse the specific file format used by UniProt. UniProt provides data in several formats (e.g., UniProtKB/Swiss-Prot in text format, UniProtKB/TrEMBL in text format), so you would need to handle each format accordingly.

Extracting relevant information from biological records

To extract relevant information from biological records, you’ll need to understand the format of the records and the specific information you’re interested in extracting. Here’s a general approach you can follow:

Identify the Record Format: Determine the format of the biological records you’re working with (e.g., GenBank, UniProt, FASTA).
Read the Records: Read the records from the file or database.
Parse the Records: Parse the records to extract the relevant information. For example, in a GenBank record, you might want to extract the locus, accession number, definition, and sequence.
Process the Information: Process the extracted information as needed (e.g., store it in a data structure, perform further analysis).

Here’s an example of how you might extract information from a GenBank record in Python:

python

def extract_genbank_info(file_path):
 with open(file_path, "r") as file:
 record = {}
 sequence = ""
 for line in file:
 if line.startswith("LOCUS"):
 record["LOCUS"] = line.strip()
 elif line.startswith("ACCESSION"):
 record["ACCESSION"] = line.strip()
 elif line.startswith("DEFINITION"):
 record["DEFINITION"] = line.strip()
 elif line.startswith("ORIGIN"):
 break # Reached the start of sequence section
 elif line.strip(): # Skip empty lines
 sequence += line.strip().replace(" ", "") # Remove spaces in the sequence lines
 record["SEQUENCE"] = sequence
 return record

# Example usage file_path = "example.gb" genbank_record = extract_genbank_info(file_path) print(genbank_record)

This is a simplified example and may need to be adapted to handle more complex GenBank records or other record formats.

Handling different data formats (e.g., FASTA, BED, GFF)

Handling different data formats in bioinformatics requires understanding the specific format and how to parse it. Here’s a general approach to handle some common bioinformatics file formats (FASTA, BED, GFF) in Python:

FASTA Format:
- The FASTA format is a simple format for representing nucleotide or protein sequences.
- Each sequence is represented by a header line starting with “>” followed by the sequence data.

python

def parse_fasta(file_path):
 sequences = {}
 with open(file_path, "r") as file:
 header = None
 sequence = ""
 for line in file:
 line = line.strip()
 if line.startswith(">"):
 if header is not None:
 sequences[header] = sequence
 header = line[1:]
 sequence = ""
 else:
 sequence += line
 if header is not None:
 sequences[header] = sequence
 return sequences

# Example usage file_path = "example.fasta" fasta_sequences = parse_fasta(file_path) print(fasta_sequences)

BED Format:
- The BED format is used to represent genomic annotations, such as gene structures or regulatory regions.
- Each line in a BED file represents a genomic feature, with columns specifying the chromosome, start position, end position, and optionally additional information.

python

def parse_bed(file_path):
 features = []
 with open(file_path, "r") as file:
 for line in file:
 if line.startswith("#"):
 continue # Skip comment lines
 fields = line.strip().split("\t")
 if len(fields) < 3:
 continue # Skip invalid lines
 chromosome, start, end = fields[:3]
 feature = {"chromosome": chromosome, "start": int(start), "end": int(end)}
 features.append(feature)
 return features

# Example usage file_path = "example.bed" bed_features = parse_bed(file_path) print(bed_features)

GFF Format:
- The GFF (General Feature Format) format is similar to BED but allows for more detailed annotations, including strand information and feature types.
- GFF files have nine tab-separated columns, with the first eight columns containing feature information and the ninth column containing additional attributes.

python

def parse_gff(file_path):
 features = []
 with open(file_path, "r") as file:
 for line in file:
 if line.startswith("#"):
 continue # Skip comment lines
 fields = line.strip().split("\t")
 if len(fields) < 8:
 continue # Skip invalid lines
 chromosome, _, feature_type, start, end, _, strand, _, attributes = fields
 feature = {
 "chromosome": chromosome,
 "start": int(start),
 "end": int(end),
 "feature_type": feature_type,
 "strand": strand,
 "attributes": attributes
 }
 features.append(feature)
 return features

# Example usage file_path = "example.gff" gff_features = parse_gff(file_path) print(gff_features)

These are simplified examples and may need to be adapted to handle specific requirements or variations in the file formats.

Applications in Bioinformatics

Gene prediction and annotation

Gene prediction and annotation are crucial steps in bioinformatics for identifying and characterizing genes in genomic sequences. Here’s a general approach to gene prediction and annotation using Python:

Gene Prediction:
- Gene prediction algorithms aim to identify the locations of genes in a genome based on various features, such as open reading frames (ORFs), sequence conservation, and gene expression patterns.
- Common gene prediction algorithms include GeneMark, Glimmer, and Augustus.
Annotation:
- Gene annotation involves assigning biological information to the predicted genes, such as gene function, protein domains, and regulatory elements.
- Annotation is often performed using databases and tools such as BLAST, InterProScan, and Pfam.

Here’s a simplified example of gene prediction and annotation using the SeqIO module from Biopython and the GeneMark gene prediction tool:

python

from Bio import SeqIO
 import subprocess
# Step 1: Gene Prediction
 def predict_genes(input_file):
 # Run GeneMark gene prediction tool (this is a simplified example)
 cmd = f"genemark -sequence {input_file} -output {input_file}.out"
 subprocess.run(cmd, shell=True)
# Step 2: Gene Annotation
 def annotate_genes(gff_file):
 # Use BLAST for functional annotation (this is a simplified example)
 cmd = f"blastp -query {gff_file} -db nr -out {gff_file}.blastp.out"
 subprocess.run(cmd, shell=True)
# Example usage
 input_file = "genome.fasta"
 predict_genes(input_file)

gff_file = f"{input_file}.out.gff" annotate_genes(gff_file)

In this example, we first use a hypothetical gene prediction tool genemark to predict genes in a genomic sequence (genome.fasta). We then use BLAST (blastp) to annotate the predicted genes based on similarity to known proteins in the NCBI non-redundant (nr) database.

Note: Gene prediction and annotation are complex processes that require specialized tools and databases. This example provides a simplified overview and may need to be adapted for specific requirements and tools.

Phylogenetic analysis

Phylogenetic analysis is a key method in bioinformatics for studying evolutionary relationships between organisms. It involves constructing phylogenetic trees, which depict the evolutionary history and relatedness of different species or groups of organisms. Here’s a general approach to phylogenetic analysis using Python:

Sequence Alignment:
- Align homologous sequences (e.g., DNA, RNA, or protein sequences) using alignment tools like Clustal Omega or MUSCLE.
Phylogenetic Tree Construction:
- Construct a phylogenetic tree using alignment data and a tree-building algorithm, such as neighbor-joining, maximum likelihood, or Bayesian inference.
- Biopython provides modules for phylogenetic tree construction, including Phylo and Bio.Phylo.Applications.
Visualization:
- Visualize the phylogenetic tree using tools like PhyloTree or ETE Toolkit.
- Biopython’s Phylo module also provides functions for tree visualization.

Here’s a simplified example of phylogenetic tree construction using Biopython:

python

from Bio import Phylo
 from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
 from Bio.Align import MultipleSeqAlignment
 from Bio import AlignIO
# Example alignment file in FASTA format
 alignment_file = "alignment.fasta"
 alignment = AlignIO.read(alignment_file, "fasta")
# Calculate distances between sequences
 calculator = DistanceCalculator('identity')
 dm = calculator.get_distance(alignment)
# Construct a tree using neighbor-joining algorithm
 constructor = DistanceTreeConstructor(calculator, 'nj')
 tree = constructor.build_tree(alignment)

# Visualize the tree Phylo.draw(tree)

In this example, we first read a multiple sequence alignment from a file (alignment.fasta). We then calculate the pairwise distances between sequences, construct a phylogenetic tree using the neighbor-joining algorithm, and visualize the tree using Biopython’s Phylo module.

Note: Phylogenetic analysis can be computationally intensive, especially for large datasets. It’s important to consider the appropriate algorithms and methods based on the size and complexity of your data.

Structural bioinformatics

Structural bioinformatics is a field that focuses on the analysis, prediction, and modeling of biomolecular structures, such as proteins, nucleic acids, and complexes. It plays a crucial role in understanding the structure-function relationships of biomolecules and is essential for drug discovery, protein engineering, and molecular biology research. Here’s an overview of key concepts and methods in structural bioinformatics:

Protein Structure Prediction:
- Predicting the three-dimensional structure of a protein from its amino acid sequence.
- Methods include homology modeling (comparative modeling), ab initio modeling, and threading (fold recognition).
Protein Structure Determination:
- Experimental methods for determining protein structures, such as X-ray crystallography, nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM).
- Structural databases like the Protein Data Bank (PDB) store experimentally determined protein structures.
Molecular Docking:
- Predicting the binding mode and affinity of small molecules (ligands) to a target protein.
- Used in drug discovery and virtual screening.
Molecular Dynamics (MD) Simulation:
- Simulating the dynamic behavior of biomolecular systems over time.
- Provides insights into protein folding, dynamics, and interactions.
Structural Analysis and Visualization:
- Analyzing and visualizing biomolecular structures to understand their properties and functions.
- Tools like PyMOL, VMD, and Chimera are commonly used for visualization.
Structure-Function Relationships:
- Studying how the three-dimensional structure of a biomolecule relates to its biological function.
- Important for understanding enzyme mechanisms, protein-ligand interactions, and molecular recognition.
Bioinformatics Databases and Tools:
- Databases like the Protein Data Bank (PDB), Structural Classification of Proteins (SCOP), and CATH provide resources for structural bioinformatics research.
- Software tools and packages (e.g., MODELLER, Rosetta, GROMACS) are used for structure prediction, analysis, and simulation.

Overall, structural bioinformatics combines computational methods, bioinformatics tools, and experimental techniques to study the structure and function of biomolecules, contributing to advances in biotechnology, medicine, and molecular biology.

Advanced Topics

Regular expressions for pattern matching

Regular expressions (regex) are powerful tools for pattern matching and searching within strings. They allow you to define complex search patterns, making them very useful in bioinformatics for tasks such as sequence pattern matching, motif finding, and data extraction. Here’s a basic overview of using regular expressions in Python for pattern matching:

Import the re Module: The re module in Python provides support for working with regular expressions.
Compile the Regular Expression: Use the re.compile() function to compile your regular expression pattern into a regex object, which can then be used for matching.
Search for Patterns: Use the search() method of the regex object to search for the pattern in a string. This method returns a match object if the pattern is found, or None otherwise.
Extracting Matched Patterns: Use methods like group() or groups() of the match object to extract the matched patterns.

Here’s a simple example to illustrate:

python

import re
# Define a regular expression pattern
 pattern = r"\b[A-Z]{3}\b" # Matches three uppercase letters
# Compile the pattern into a regex object
 regex = re.compile(pattern)
# Search for the pattern in a string
 text = "The ATP-binding domain is essential for protein function"
 match = regex.search(text)

if match: print("Found:", match.group()) else: print("Pattern not found")

In this example, the regular expression pattern "\b[A-Z]{3}\b" matches three consecutive uppercase letters (e.g., “ATP” in the input string). The search() method is used to search for this pattern in the input text. If the pattern is found, match.group() is used to extract and print the matched substring.

Regular expressions can be much more complex, allowing you to define patterns with varying levels of specificity and complexity depending on your needs. They are a powerful tool for working with text data in bioinformatics and beyond.

Handling large biological datasets efficiently

Handling large biological datasets efficiently often requires a combination of proper data storage, processing, and analysis techniques. Here are some strategies to consider:

Use Efficient Data Structures: Choose appropriate data structures for storing your data. For large datasets, consider using data storage solutions like databases (e.g., SQLite, MySQL, PostgreSQL) or specialized bioinformatics databases (e.g., BioSQL, BioMart) instead of plain text files.
Batch Processing: Process data in batches rather than all at once. This can help manage memory usage and improve processing speed. Use streaming techniques when possible to avoid loading entire datasets into memory.
Parallel and Distributed Processing: Use parallel and distributed computing techniques to process data more efficiently. Libraries like Dask, multiprocessing, and Apache Spark can help distribute processing tasks across multiple cores or machines.
Optimized Algorithms: Use algorithms optimized for large datasets. For example, use indexing techniques (e.g., k-d trees, hash tables) for efficient searching and retrieval operations.
Data Filtering and Preprocessing: Filter out irrelevant data and preprocess data before analysis to reduce the dataset size and complexity. This can include removing duplicates, handling missing values, and normalizing data.
Compressed File Formats: Use compressed file formats (e.g., gzip, bzip2) for storing and transferring large datasets to reduce storage space and improve data transfer speeds.
Data Pipelines: Use data pipelines to automate data processing tasks. This can help manage complex workflows and ensure data integrity and consistency.
Use of Cloud Computing: Consider using cloud computing services (e.g., AWS, Google Cloud, Microsoft Azure) for scalable and on-demand computing resources to handle large datasets.

By implementing these strategies, you can efficiently handle and analyze large biological datasets in your bioinformatics projects.

Best practices and tips for bioinformatics programming

Here are some best practices and tips for bioinformatics programming:

Use Version Control: Use version control systems like Git to manage your code. This helps track changes, collaborate with others, and revert to previous versions if needed.
Document Your Code: Write clear and concise documentation for your code, including comments and docstrings. This helps others understand your code and makes it easier to maintain.
Use Meaningful Variable Names: Use descriptive variable names that convey the purpose of the variable. This improves readability and makes your code easier to understand.
Modularize Your Code: Break your code into small, reusable modules or functions. This makes your code more modular, easier to test, and helps avoid repetition.
Use Libraries and Packages: Leverage existing libraries and packages for common tasks in bioinformatics. This saves time and ensures that you are using well-tested and optimized code.
Optimize Your Code: Write efficient code by avoiding unnecessary loops, using appropriate data structures, and optimizing algorithms. This is especially important when working with large datasets.
Handle Errors Gracefully: Use try-except blocks to handle errors and exceptions in your code. This prevents your program from crashing and helps identify and fix issues.
Test Your Code: Write unit tests to ensure that your code behaves as expected. This helps catch bugs early and ensures that your code is robust and reliable.
Use Virtual Environments: Use virtual environments like conda or virtualenv to manage dependencies and isolate your project’s environment. This avoids conflicts between different projects.
Stay Updated: Keep your software and libraries up to date to benefit from the latest features, improvements, and security patches.
Collaborate and Seek Feedback: Collaborate with others and seek feedback on your code. This can help improve your code quality and learn new techniques.
Stay Organized: Organize your code and project files in a logical manner. Use meaningful directory structures and naming conventions.

Following these best practices can help you write better code, improve productivity, and build more robust and maintainable bioinformatics applications.

Examples and Exercises

Python: Strings

Strings are immutable objects representing text. To define a string, we can write:

var = “text”

or, equivalently:

var = ‘text’

To insert special characters, we need to perform escaping with a backslash :

path = “data\\fasta”

or use the prefix (raw):

path = r”data\fasta”

Here is a reference list of escape characters. You will probably only need the most obvious

ones, like

and .

Escape Character	Meaning
\\	Backslash
\’	Single-quote
\”	Double-quote
\a	ASCII bell

Escape Character	Meaning
\b	ASCII backspace
\f	ASCII formfeed
\n	ASCII linefeed (also known as newline)
\r	Carriage Return
\t	Horizontal Tab
\v	ASCII vertical tab
\N{name}	Unicode character name (Unicode only!)
\uxxxx	Unicode 16-bit hex value xxxx (u’’ string only)
\Uxxxxxxxx	Unicode 32-bit hex value xxxxxxxx (u’’ string only)
\ooo	Character with octal value ooo
\xhh	Character with hex value hh

To create a multi-line string, we can manually place the newline character at each line:

sad_joke = “Time flies like an arrow.\nFruit flies like a banana.” print sad_joke

or we can use triple quotes:

sad_joke = “””Time flies like an arrow. Fruit flies like a banana.”””

print sad_joke

Warning

interprets special characters, while terminal echo doesn’t. Try to write:

print path

and (from the interpreter):

path

In the rirst case, we see one slash (the escaping slash is automatically interpreted by

), in the second case we see two slashes (the escape slash is not interpreted).

The same if we print .

sad_joke

String-Number conversion

We can convert a number into a string with :

str()

n = 10

s = str(n)

print n, type(n) print s, type(s)

or perform the opposite conversion:

float()

int()

n = int(“123”)

q = float(“1.23”) print n, type(n) print q, type(q)

If the string doesn’t contain the correct numeric type, Python will give an error message:

int(“3.14”)

float(“ribosome”) int(“1 2 3”)

int(“fifteen”)

# Not an int

# Not a number # Not a number # Not a number

Operations

Result	Operator	Meaning
bool	==	Check whether two strings are identical.

Result	Operator	Meaning
int	len(str)	Return the length of the string
str	str + str	Concatenate two strings
str	str * int	Replicate the string
bool	str in str	Check if a string is present in another string
str	str[int:int]	Extract a sub-string

Example. Let’s concatenate two strings:

string = “one” + ” ” + “string” length = len(string)

print “the string:”, string, “is”, length, “characters long”

Another example:

string = “Python is hell!” * 1000

print “the string is”, len(string), “characters long”

Warning

We cannot concatenate strings with other types. For example:

var = 123

print “the value of var is” + var

gives an error message. Two working alternatives:

print “the value of var is” + str(123)

or:

print “the value of var is”, var

(In the second case we miss a space between and .)

123

Example. The operator

substring in string

, for example:

string

checks if

appears once or more times in

substring

string = “A beautiful journey”

print “A” in string # True print “beautiful” in string # True print “BEAUTIFUL” in string # False print “ul jour” in string # True print “Gengis Khan” in string # False print ” ” in string # True

print ” ” in string # False

The result is always or .

False

True

Example. To extract a substring we can use indexes:

print alphabet[len(alphabet)–1] # “z”

print alphabet[len(alphabet)] # Error

print alphabet[10000] # Error

# “klmnop”

print alphabet[10:–10]

# “vwxy”

# “vwxyz”

print alphabet[–5:–1]

print alphabet[–5:]

# “a”

# “ab”

# “abcde” # “abcde”

print alphabet[0:1]

print alphabet[0:2] print alphabet[0:5] print alphabet[:5]

# “z”

# “y”

print alphabet[–1]

print alphabet[–2]

# “a”

# “b”

print alphabet[0]

print alphabet[1]

alphabet = “abcdefghijklmnopqrstuvwxyz”

-1

-2|

-3||

|||

…

||2

|||

# # #

Warning

Extraction is inclusive with respect to the first index, but exclusive with respect to the

second. In other words corresponds to:

alphabet[i:j]

alphabet[i] + alphabet[i+1] + … + alphabet[j–1]

Note that is excluded.

alphabet[j]

Warning

Extraction return a new string, leaving the original unvaried:

alphabet = “abcdefghijklmnopqrstuvwxyz”

substring = alphabet[2:–2]

print substring print alphabet

# Is unvaried

Methods

Result	Method	Meaning
str	str.upper()	Return the string in upper case
str	str.lower()	Return the string in lower case
str	str.strip(str)	Remove strings from the sides
str	str.lstrip(str)	Remove strings from the left
str	str.rstrip(str)	Remove strings from the right
bool	str.startswith(str)	Check if the string starts with another
bool	str.endswith(str)	Check if the string ends with another
int	str.find(str)	Return the position of a substring
int	str.count(str)	Count the number of occurrences of a substring
str	str.replace(str, str)	Replace substrings

Warning

Methods return a new string, leaving the original unvaried (as with extraction):

alphabet = “abcdefghijklmnopqrstuvwxyz”

alphabet_upper = alphabet.upper() print alphabet_upper

print alphabet # Is unvaried

Example.

lower()

upper()

and

are very simple:

text = “No Yelling”

result = text.upper() print result

result = result.lower() print result

Example. variants are also simple:

strip()

text = ” one example ”

print text.strip()

print text.lstrip() print text.rstrip()

# equivalent to text.strip(” “)

# idem # idem

print text

# text is unvaried

Note that the space between

“example”

“one”

than one character to be removed:

and

is never removed. We can specify more

“AAAA one example BBBB”.strip(“AB”)

Example. The same is valid with and :

endswith()

startswith()

text = “123456789”

print text.startswith(“1”) print text.startswith(“a”)

# True

# False

print text.endswith(“56789”) # True

print text.endswith(“5ABC9”) # False

Example.

find()

returns the position of the first occurrence of a substring, or

if the

substring never occurs:

-1

text = “123456789”

print text.find(“1”)

print text.find(“56789”)

# 0

# 4

print text.find(“Q”)

# -1

Example. returns a copy of the string where a substring is replaced with another:

replace()

text = “if roses were rotten, then” print text.replace(“ro”, “gro”)

Example. Given this unformatted string of aminoacids:

sequence = “>MAnlFKLgaENIFLGrKW ”

To increase uniformity, we want to remove the convert everything to upper case:

“>”

character, remove spaces and finally

s1 = sequence.lstrip(“>”) s2 = s2.rstrip(” “)

s3 = s2.upper() print s3

Alternatively, all in one step:

print sequence.lstrip(“>”).rstrip(” “).upper()

Why does it work? Let’s write it with brackets:

print ( ( sequence.lstrip(“>”) ).rstrip(” “) ).upper()

\ / str

As you can see, the result of each method is a string (as and we can invoke string methods.

Exercises

in the example above);

How can I:
1. Create a string consisting of five spaces only.
2. Check whether a string contains at least one space.
3. Check whether a string contains exactly five (arbitrary) characters.
4. Create an empty string, and check whether it is really empty.
5. Create a string that contains one hundred copies of .

“is way better”

“Python is great”

1. Given the strings string

“but cell biology is way better”

, and

“12345”

“biology”

“but cell”

, compose them into the

1. Check whether the string

begins with

(the character, not the number!)

1. Create a string consisting of a single character . (Check whether the output matches

using both the echo of the interpreter and , and possibly also with )

len()

1. Check whether the string contains one or two backslashes.

“\\”

1. Check whether a string (of choice) begins or ends by .

1. Check whether a string (of choice) contains at least three times at the beginning

and/or at the end. For instance, the following strings satisfy the desideratum:

“x. xx”

“xx. x”

“xxxx. ”

# 1 + 2 >= 3

# 2 + 1 >= 3

# 4 + 0 >= 3

while these do not:

“x. x”

“…x. ”

” ”

# 1 + 1 < 3

# 0 + 0 < 3

Given the string:

s = “0123456789”

which of the following extractions are correct? 1.

s[10]

s[9]

s[:10]

s[1000]

s[0]

s[-1]

s[1:5]

s[-1:-5]

s[-5:-1]

s[-1000]

Create a two-line string that contains the two following lines of text literally, including all the special characters and the implicit newline character:

never say “never”! said the sad turtle

Given the strings:

string = “a 1 b 2”

digit = “DIGIT”

character = “CHARACTER”

replace all the digits in the variable with the text provided by the variable ,

digit

string

and all alphabetic characters with the content of the variable . The result should look like this:

“CHARACTER DIGIT CHARACTER DIGIT”

character

You are free to use auxiliary variables to hold any intermediate results, but do not need to.

Given the following multi-line sequence:

chain_a = “””SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKM FCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVV

RRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFR HSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILT IITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKG EPHHELPPGSTKRALPNNT”””

which represents the aminoacid sequence of the DNA-binding domain of the Tumor Suppressor Protein TP53 , answer the following questions.

1. How many lines does it hold?
2. How long is the sequence? (Do not forget to ignore the special characters!)
3. Remove all newline characters, and put the result in a new variable .

sequence

1. How many cysteines are there in the sequence? How many histidines ?

“H”

“C”

1. Does the chain contain the sub-sequence ? In what position?

[i:j]

“NLRVEYLDDRN”

1. How can I use line from

chain_a

and the sub-string extraction

find()

operators to extract the first

Given (a small portion of) the tertiary structure of chain A of the TP53 protein:

structure_chain_a = “””SER A 96 77.253 20.522 75.007

VAL A 97 76.066 22.304 71.921

PRO A 98 77.731 23.371 68.681

SER A 99 80.136 26.246 68.973

GLN A 100 79.039 29.534 67.364

LYS A 101 81.787 32.022 68.157″””

Each line represents an \(C_\alpha\) atom of the backbone of the structure. Of each atom,

we know: – the aminoacid code of the residue – the chain (which is always in this

“A”

example) – the position of the residue within the chain (starting from the N-terminal) – and the \(x, y, z\) coordinates of the atom

1. Extract the second line using and the extraction operator. Put the line in a new

find()

variable .

line

1. Extract the coordinates of the second residue, and put them into three variables , ,

and .

1. Extract the coordinates from third residue as well, putting them in different variables

, ,

z_prime

y_prime

x_prime

1. Compute the Euclidean distance between the two residues:

\(d((x,y,z),(x’,y’,z’)) = \sqrt{(x-x’)^2 + (y-y’)^2 + (z-z’)^2}\)

Hint: make sure to use numbers when computing the distance.

float

Given the following DNA sequence, part of the BRCA2 human gene:

dna_seq = “””GGGCTTGTGGCGCGAGCTTCTGAAACTAGGCGGCAGAGGCGGAGCCGCT GTGGCACTGCTGCGCCTCTGCTGCGCCTCGGGTGTCTTTT

GCGGCGGTGGGTCGCCGCCGGGAGAAGCGTGAGGGGACAG ATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTCCGGCCA AAAAAGAACTGCACCTCTGGAGCGG”””

1. Calculate the GC-content of the sequence
2. Convert the DNA sequence into an RNA sequence
3. Assuming that this sequence contains an intron ranging from nucleotide 51 to nucleotide 156, store the sequence of the intron in a string, and the sequence of the spliced transcript in another string.

Python: Strings (Solutions)

Note

Later on in the solutions, I will sometimes use the backslash character line.

at the end of a

When used this way, tells Python that the command continues on the following line,

allowing to break long commands over multiple lines.

Solutions:
1. Solution:

# 12345

text = ” ” print text

print len(text)

1. Solution:

at_least_one_space = ” ” in text

# check whether it works

print ” ” in “nospaceatallhere”

print ” ” in “onlyonespacehere–> <–” print ” ” in “more spaces in here”

1. Solution:

exactly_5_characters = len(text) == 5

# check whether it works print len(“1234”) == 5 print len(“12345”) == 5 print len(“123456”) == 5

1. Solution:

empty_string = “”

print len(empty_string) == 0

1. Solution:

base = “Python is great” repeats = base * 100

# check whether the length is correct

print len(repeats) == len(base) * 100

1. Solution:

part_1 = “but cell” part_2 = “biology”

part_3 = “is way better”

text = (part_1 + part_2 + part_3) * 1000

1. Let’s try this:

start_with_1 = “12345”.startswith(1)

but Python gives an error message:

Traceback (most recent call last):

File “<stdin>”, line 1, in <module>

TypeError: startswith first arg must be str, unicode, or tuple, not int

# ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^

The error message, see highlighted parts, says that requires the argument

startswith()

to be a string, non an int as in our case: The solution is:

start_with_1 = “12345”.startswith(“1”) print start_with_1

, is an int.

the value is

True

1. Solution:

, as expected.

string = “\\”

string

print string

print len(string)

# 1

alternatively:

string = r”\”

string

print string

print len(string)

# 1

1. Already checked before, the answer is no. Anyway:

backslash = r”\”

print backslash*2 in “\\”

# False

1. First method:

backslash = r”\”

condition = text.startswith(backslash) or \

text.endswith(backslash)

Second method:

condition = (text[0] == backslash) or \

(text[–1] == backslash)

1. Solution:

condition = \

text.startswith(“xxx”) or \

(text.startswith(“xx”) and text.endswith(“x”)) or \ (text.startswith(“x”) and text.endswith(“xx”)) or \

text.endswith(“xxx”)

It’s worth to check the condition using the examples provided in the exercise.

Solution:

s = “0123456789”

print len(s)

# 10

Which of the following extractions are correct?

10.

: correct, extracts the last character.

: invalid.

s[10]

s[9]

: corrett, extracts all characters (remember that the second index, case, is exclusive.)

s[:10]

: invalid.

s[1000]

: correct, extracts the first character.

s[0]

: correct, extracts the last character.

s[-1]

: correct, ectracts from the 2nd to the 6th character.

s[1:5]

: correct

s[-1:-5]

: correct, but nothing is extracted (indexes are inverted!)

s[-5:-1]

: invalid.

s[-1000]

in this

Solution (one of two possible solutions):

text = “””never say \”never!\”

\said the sad turtle.”””

Solution:

string = “a 1 b 2 c 3”

digit = “DIGIT”

character = “CHARACTER”

result = string.replace(“1”, digit) result = result.replace(“2”, digit) result = result.replace(“3”, digit)

result = result.replace(“a”, character)

result = result.replace(“b”, character) result = result.replace(“c”, character)

print result

# “CHARACTER DIGIT CHARACTER …”

In one line:

print string.replace(“1”, digit).replace(“2”, digit) …

Solution:

chain_a = “””SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKM

FCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVV RRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFR HSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILT IITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKG EPHHELPPGSTKRALPNNT”””

num_lines = chain_a.count(“\n”) + 1

print num_lines

# 6

# NOTE: we want to know the length of the actual *sequence*, non the length of the *string*

length_sequence = len(chain_a) – chain_a.count(“\n”) print length_sequenza # 219

sequence = chain_a.replace(“\n”, “”)

print len(chain_a) – len(sequence) print len(sequence)

# 5 (giusto)

# 219

num_cysteine = sequence.count(“C”)

num_histidine = sequence.count(“H”) print num_cysteine, num_histidine

# 10, 9

print “NLRVEYLDDRN” in sequence

print sequence.find(“NLRVEYLDDRN”)

# let’s check

# True

# 106

print sequence[106 : 106 + len(“NLRVEYLDDRN”)] # “NLRVEYLDDRN”

index_first_newline = chain_a.find(“\n”)

first_line = chain_a[:index_first_newline] print first_line

Solution:

structure_chain_a = “””SER A 96 77.253 20.522 75.007

VAL A 97 76.066 22.304 71.921

PRO A 98 77.731 23.371 68.681

SER A 99 80.136 26.246 68.973

GLN A 100 79.039 29.534 67.364

LYS A 101 81.787 32.022 68.157″””

# I use a variable with a shorter name

chain = structure_chain_a

index_first_newline = chain.find(“\n”)

index_second_newline = chain[index_first_newline + 1:].find(“\n”) index_third_newline = chain[index_second_newline + 1:].find(“\n”)

print index_first_newline, index_second_newline, index_third_newline

second_line = chain[index_first_newline + 1 : index_second_newline]

print second_line # “VAL A 97 76.066 22.304 71.921” # | | | | | | # 01234567890123456789012345678

# 0 1 2

x = second_line[9:15]

y = second_line[16:22]

z = second_line[23:] print x, y, z

# NOTE: they are all strings

third_line = chain[index_second_newline + 1 : index_third_newline]

print third_line # “PRO A 98 77.731 23.371 68.681” # | | | | | | # 01234567890123456789012345678

# 0 1 2

x_prime = third_line[9:15] y_prime = third_line[16:22] z_prime = third_line[23:]

print x_prime, y_prime, z_prime

# NOTE: they are all strings

# we should convert all variables to floats, in order to calculate distances

x, y, z = float(x), float(y), float(z)

x_prime, y_prime, z_prime = float(x_prime), float(y_prime), float(z_prime)

diff_x = x – x_prime diff_y = y – y_prime diff_z = z – z_prime

distance = (diff_x**2 + diff_y**2 + diff_z**2)**0.5 print distance

The solution is way simpler using :

split()

lines = chain.split(“\n”) second_line = lines[1]

third_line = lines[2]

words = second_line.split()

x, y, z = float(words[–3]), float(words[–2]), float(words[–1])

words = third_line.split()

x_prime, y_prime, z_prime = float(words[–3]), float(words[–2]), float(words[–1]) distance = ((x – x_prime)**2 + (y – y_prime)**2 + (z – z_prime)**2)**0.5