will bioinformatics replace by AI

Revolutionizing Bioinformatics for Beginners: A Journey with ChatGPT

September 23, 2023 Off By admin
Shares

Bioinformatics and ChatGPT

Introduction:

In today’s fast-evolving technological landscape, a new beacon shines on the path of bioinformatics, particularly for those at the beginner level. ChatGPT, with its exquisite conversational and programming abilities, opens a realm of possibilities for those looking to decode the complexities of bioinformatics data analysis. This post aims to introduce an innovative approach, called the OPTICAL model, a paradigm shift designed to guide ChatGPT in generating robust code for diverse bioinformatics tasks, elucidating its feasibility, practical considerations, and the unprecedented value it brings to bioinformatics education.

Harnessing the Power of ChatGPT

The emergence of Large Language Models (LLMs) like ChatGPT has sparked immense interest due to its potential to facilitate programming-heavy data analysis in fields such as bioinformatics. The behavior of this advanced AI system significantly relies on meticulously crafted prompts from human operators, thereby necessitating the creation of a model to fine-tune instructions, enabling responses from ChatGPT to be valid, reproducible, and robust.

chatgpt-bioinformatics
The OPTICAL model for LLM chatbot-assisted scientific data analysis

The Pioneering OPTICAL Model

Derived from the essence of adaptive learning, the OPTICAL model—Optimization of Prompts Through Iterative Conversation and Assessment with an LLM chatbot—is a revolutionary step in chatbot-aided scientific data analysis. This model employs iterative steps to enhance communication with a chatbot, thus refining students’ learning outcomes. It aids students in reviewing scientific questions, computational methods, analysis tasks, and expected outputs, allowing them to create draft prompts, converse with the chatbot, execute chatbot-produced code, and assess the results critically until the expected result is acquired.

Empirical Validation Through Case Studies

To validate the effectiveness and versatility of the OPTICAL model, it was applied to three diverse bioinformatics topics. These studies included challenges like aligning short sequencing reads, making phylogenetic inferences from DNA sequences, and implementing robust circle fitting in computer vision. The interaction with ChatGPT involved several iterations, including instructing the chatbot to handle error messages and refining initial prompts to ensure detailed and error-free analysis.

Bridging the Gap: Practical Considerations and Limitations

The firsthand experience with implementing the OPTICAL model has highlighted the importance of clear definitions and guidelines on how the chatbot should respond to prompts and the significance of reviewing the code at the end of each session. It is essential to cross-reference with manuals to ensure understanding, precision, and accuracy in functions, options, and chatbot’s comments on code. This model also provides a repository of well-defined prompts, sample inputs, reference chatbot code, and expected results to further assist beginners in bioinformatics.

Addressing Disparities and Challenges

For beginners, especially those who are disadvantaged, learning bioinformatics presents numerous challenges, such as limited access to tutoring services and academic interactions. ChatGPT emerges as a beacon of hope in addressing this knowledge-dissemination disparity, allowing students to interact instantaneously and learn coding skills, thus serving as a valuable peer, especially for those who often struggle to find capable peers.

A Visionary Perspective and Conclusion

While the OPTICAL model has its limitations and areas for improvement, it symbolizes a transformative step in bioinformatics education for beginners. The application of ChatGPT, in conjunction with the OPTICAL model, transcends traditional learning barriers, fostering an environment where students can enhance their coding skills and critical thinking. The journey of practicing bioinformatics with a chatbot like ChatGPT is not just confined to a classroom; it extends to be a lifelong learning experience, holding immense promise for the future of bioinformatics education.

OPTICAL CASE STUDY: Short Sequencing Reads Alignment and Visualization
Scientific question, analysis methods, expected outputs

 

ChIP-seq is a technique used to measure transcription factor (TF) binding by sequencing DNA fragments that have been pulled down by an antibody specific to the TF. The number of reads sequenced at a genomic region is proportional to the likelihood of the TF binding to that region. In this session, we will instruct ChatGPT to align the sequencing reads back to a reference genome and generate a file in bigwig format for visualizing the distribution of the reads across the genome.

 

We generated two sequencing files of different depths for a CTCF ChIP-seq library in human ESC H1 cell line from ENCODE, one containing one million reads and the other containing ten million reads. We explored the impact of sequencing depth on TF prediction for the CTCF protein. Results will be compared to the visualization of the distribution generated by ENCODE.

 

Tools to be used: Bowtie2, samtools, and deepTools.

 

Coding language(s)

Bash

DEFINE CHATBOT’S BEHAVIORS
Act as an experienced bioinformatician proficient in ChIP-Seq data analysis, you will assist me by writing code with number of lines as minimal as possible. Rest the thread if asked to. Reply “YES” if understand.
ITERATION 0
Prompts
I have two fastq files in current folder from single-end sequencing of a ChIP-Seq library: ENCFF000AVS_1m.fastq.gz, and ENCFF000AVS_10m.fastq.gz. For each fastq file, align reads to the human reference genome, save to bam file, and then covert it to bigwig file. Tools to use: bowtie2, samtools, and deepTools. The index for bowtie2 is in the folder “../data/indx/bowtie2_whole_genome/” with “hg38” as the prefix. Use 24 CPU for the alignment. Please draft the code in bash.
Code
#!/bin/bash

 

# Define variables

INDEX=”../data/indx/bowtie2_whole_genome/hg38″

CPUS=24

FASTQ1=”ENCFF000AVS_1m.fastq.gz”

FASTQ2=”ENCFF000AVS_10m.fastq.gz”

 

# Align reads to reference genome using bowtie2

bowtie2 -p $CPUS -x $INDEX -U $FASTQ1 | samtools sort -@ $CPUS -o ENCFF000AVS_1m.bam

bowtie2 -p $CPUS -x $INDEX -U $FASTQ2 | samtools sort -@ $CPUS -o ENCFF000AVS_10m.bam

 

# Convert BAM files to bigWig format using deepTools

bamCoverage -b ENCFF000AVS_1m.bam -o ENCFF000AVS_1m.bw –binSize 10 –normalizeUsing CPM

bamCoverage -b ENCFF000AVS_10m.bam -o ENCFF000AVS_10m.bw –binSize 10 –normalizeUsing CPM

Result
 
Assessment*
The alignment worked well. However, error message indicates that bam files are not indexed. This error message was passed to ChatGPT and see its response.
ITERATION 1
Prompts
[E::idx_find_and_load] Could not retrieve index file for ‘ENCFF000AVS_1m.bam’
Code
#!/bin/bash

 

# Define variables

INDEX=”../data/indx/bowtie2_whole_genome/hg38″

CPUS=24

BAM1=”ENCFF000AVS_1m.bam”

BAM2=”ENCFF000AVS_10m.bam”

 

# Sort BAM files using samtools

samtools sort -@ $CPUS -o ENCFF000AVS_1m.sorted.bam $BAM1

samtools sort -@ $CPUS -o ENCFF000AVS_10m.sorted.bam $BAM2

 

# Index sorted BAM files using samtools

samtools index ENCFF000AVS_1m.sorted.bam

samtools index ENCFF000AVS_10m.sorted.bam

 

# Convert BAM files to bigWig format using deepTools

bamCoverage -b ENCFF000AVS_1m.sorted.bam -o ENCFF000AVS_1m.bw –binSize 10 –normalizeUsing CPM

bamCoverage -b ENCFF000AVS_10m.sorted.bam -o ENCFF000AVS_10m.bw –binSize 10 –normalizeUsing CPM

Result
NA
Assessment*
This time I noticed that “samtools index” was inserted ahead of bamCoverage. Also notice that the code assumed that bowtie alignment was done successfully, and therefore can not be directly applied to address our needs without adding back the code for the alignment.
ITERATION 2
Prompts
Wait, I saw that you have “samtools index” before “bamcoverage“. Does bamcoverage as bam to be sorted before using as input?
Code
#!/bin/bash

 

# Define variables

INDEX=”../data/indx/bowtie2_whole_genome/hg38″

CPUS=24

FASTQ1=”ENCFF000AVS_1m.fastq.gz”

FASTQ2=”ENCFF000AVS_10m.fastq.gz”

 

# Align reads to reference genome using bowtie2

bowtie2 -p $CPUS -x $INDEX -U $FASTQ1 | samtools sort -@ $CPUS -o ENCFF000AVS_1m.sorted.bam

bowtie2 -p $CPUS -x $INDEX -U $FASTQ2 | samtools sort -@ $CPUS -o ENCFF000AVS_10m.sorted.bam

 

# Index sorted BAM files using samtools

samtools index ENCFF000AVS_1m.sorted.bam

samtools index ENCFF000AVS_10m.sorted.bam

 

# Convert sorted BAM files to bigWig format using deepTools

bamCoverage -b ENCFF000AVS_1m.sorted.bam -o ENCFF000AVS_1m.bw –binSize 10 –normalizeUsing CPM

bamCoverage -b ENCFF000AVS_10m.sorted.bam -o ENCFF000AVS_10m.bw –binSize 10 –normalizeUsing CPM

Result
Figure legend: Distribution of CTCF ChIP-Seq reads on human H1-hESC across a genomic region of 198 kb. Y-axis for the first two tracks (ENCFF000AVS_1m.bw and ENCFF000AVS_10m.bw) are normalized reads density (CPM: count per million), generated by running codes drafted by chatGPT. Y-axis for the last track represent fold change over control from ENCODE using all the sequencing reads (ENCFF332TNJ.bigWig; 24.6 million).
Assessment*
The chatbot said that “bamCoverage requires that the input BAM file is sorted by genomic position” and generate code to satisfy this requirement. I confirmed that the missing of index error message was issued by bamCoverage. The code ran successfully and generated bigwig files for both fastq files.

 

The bigwig files were uploaded to IGV for visualization. We assessed the results by comparing to the bigwig from ENCODE. The results are generally consistent. The more sequencing depth, the more peaks will be observed.

WRAPPING UP
Review the final codes

Review the code comments added by ChatGPT and identify key steps, as represented by functions and parameters, that are missing in the initial prompts.

Prompts
I need to insert line-by-line comments to the below code which works well to address the needs for the data analysis task. Wait for my code.
Code with explanations
#!/bin/bash

 

# Set the path to the bowtie2 index

INDEX=”../data/indx/bowtie2_whole_genome/hg38″

 

# Set the number of CPUs to use for alignment

CPUS=24

 

# Set the path to the first fastq file

FASTQ1=”ENCFF000AVS_1m.fastq.gz”

 

# Set the path to the second fastq file

FASTQ2=”ENCFF000AVS_10m.fastq.gz”

 

# Align the reads from the first fastq file using bowtie2, then sort the resulting SAM file and convert it to BAM using samtools

bowtie2 -p $CPUS -x $INDEX -U $FASTQ1 | samtools sort -@ $CPUS -o ENCFF000AVS_1m.sorted.bam

 

# Align the reads from the second fastq file using bowtie2, then sort the resulting SAM file and convert it to BAM using samtools

bowtie2 -p $CPUS -x $INDEX -U $FASTQ2 | samtools sort -@ $CPUS -o ENCFF000AVS_10m.sorted.bam

 

# Index the BAM file from the first fastq file using samtools

samtools index ENCFF000AVS_1m.sorted.bam

 

# Index the BAM file from the second fastq file using samtools

samtools index ENCFF000AVS_10m.sorted.bam

 

# Convert the sorted BAM file from the first fastq file to bigwig using deepTools, with a bin size of 10 and normalized using CPM

bamCoverage -b ENCFF000AVS_1m.sorted.bam -o ENCFF000AVS_1m.bw –binSize 10 –normalizeUsing CPM

 

# Convert the sorted BAM file from the second fastq file to bigwig using deepTools, with a bin size of 10 and normalized using CPM

bamCoverage -b ENCFF000AVS_10m.sorted.bam -o ENCFF000AVS_10m.bw –binSize 10 –normalizeUsing CPM

Missing details in the initial prompts
1.     Index the bam file

2.      The bamCoverage has an option “–normalizeUsing” to specify the type of normalization to facilitate comparison between different libraries.

Crafting the final prompts

The initial prompts were revised to include the missing details (in bold) discovered during code review.

 

Act as an experienced bioinformatician proficient in ChIP-Seq data analysis, you will assist me by writing code with number of lines as minimal as possible. Rest the thread if asked to. Reply “YES” if understand.

 

I have two fastq files in current folder from single-end sequencing of a ChIP-Seq library: ENCFF000AVS_1m.fastq.gz, and ENCFF000AVS_10m.fastq.gz. For each fastq file, align reads to the human reference genome, save to bam file, index it, and then covert it to bigwig file with CPM normalization. Tools to use: bowtie2, samtools, and deepTools. The index for bowtie2 is in the folder “../data/indx/bowtie2_whole_genome/” with “hg38” as the prefix. Use 24 CPU for the alignment. Please draft the code in bash.

 

Summary*
Aims

The goal of this session is to optimize the prompts which will guide ChatGPT to generate code for aligning sequencing files (fastq) and then generate a summarized file (bigwig) for the distribution of reads across the genome.

 

Materials and Methods

The original fastq sequencing file was obtained from ENCODE (ENCFF000AVS). In this analysis session, we extracted the first one million reads and the first ten million reads from the original file. In this way, we could tell how sequence depth impacts the finding TF bindings sites even from a visual inspection.

 

We aligned the sequencing file to the human reference (hg38) genome using Bowtie2 (v 2.4.2)1 and converted the output to sorted and indexed BAM files using Samtools (v 1.10)2. We then used the bamCoverage (v 3.5.1) function from deepTools 3 to convert the indexed BAM file to a bigwig file. The bigwig files were visualized by IGV (v2.9.2)4.

 

Results and Discussions

Visual inspection of the distribution of CTCF ChIP-Seq reads across the genome revealed a significantly higher number of peaks in the ten million reads dataset compared to the one million reads dataset. Data analysis, such as examining the signal at known CTCF target sites in ESC, conducting motif enrichment analysis, and comparing to other public CTCF ChIP-Seq datasets can be used as additional quality control for the current library.

Additional Comments*
Students are encouraged to test the robust of these prompts with new ChatGPT sessions. Alternatively, students could drop out certain details in the prompts to check what alternative solutions ChatGPT may come up with.

 

OPTICAL CASE STUDY: Phylogeny inference based on TP53
Description:

Test how ChatGPT can help students build a phylogenetic tree starting with aligned TP53 gene coding sequences from 9 species.

Coding language(s):

R

Prompts to define chatbot’s behaviors
Act as an experienced bioinformatician proficient in R, you will write code with number of lines as minimal as possible. Rest the thread if asked to. Reply “YES” if understand.
ITERATION 0
Prompts
You have a multiple alignment file named as tp53.clustal in ClustalW format. Please write R code that can load the file, calculate evolutionary distance, build a NJ tree, and visualize the phylogeny.
Code
# Load the required packages

library(ape)

 

# Read in the alignment file

aln <- read.alignment(“tp53.clustal”, format=”clustal”)

 

# Calculate the evolutionary distance

dist <- dist.dna(aln)

 

# Build the NJ tree

tree <- nj(dist)

 

# Plot the phylogeny

plot(tree)

Result
Failed with an error message.
ITERATION 1
Prompts
I got an error message complaining “could not find function “read.alignment”. Please fix it.
Code
# Load the required packages

library(seqinr)

library(ape)

 

# Read in the alignment file

aln <- read.alignment(“tp53.clustal”, format=”clustal”)

 

# Calculate the evolutionary distance using the Jukes-Cantor distance

dist <- dist.dna(aln, model=”JC”)

 

# Build the NJ tree

tree <- nj(dist)

 

# Plot the phylogeny

plot(tree)

Result
Failed with an error message.
ITERATION 2
Prompts
I got a warning message ” In dist.dna(aln) : NAs introduced by coercion”. Please fix it.
Code
# Load the required packages

library(seqinr)

library(ape)

 

# Read in the alignment file

aln <- read.alignment(“tp53.clustal”, format=”clustal”)

 

# Calculate the evolutionary distance using the Jukes-Cantor distance

dist <- dist.dna(aln, model=”JC”)

 

# Build the NJ tree

tree <- nj(dist)

 

# Plot the phylogeny

plot(tree)

Result
The code displays an unroot tree. Manual editing of the code and instruct ChatGPT to generate a rooted tree.
ITERATION 3
Prompts
I wrote an R program to read a multiple alignment file named as tp53.clustal in ClustalW format, calculate evolutionary distance, build a NJ tree, and visualize the phylogeny. But I want to root the tree with the Zebrafish sequence as the outgroup. Can you help me revise the R code? Below is my R code.

# Load the required packages

library(seqinr)

library(ape)

 

# Read in the alignment file

aln <- read.alignment(“tp53.clustal”, format=”clustal”)

 

# Calculate the evolutionary distance

dist <- dist.dna(as.DNAbin(aln))

 

# Build the NJ tree

tree <- nj(dist)

 

# Plot the phylogeny

plot(tree)

Codes
# Load the required packages

library(seqinr)

library(ape)

 

# Read in the alignment file

aln <- read.alignment(“tp53.clustal”, format=”clustal”)

 

# Identify the index of the Zebrafish sequence in the alignment

zebrafish_idx <- which(aln$species == “Danio_rerio”)

 

# Calculate the evolutionary distance

dist <- dist.dna(as.DNAbin(aln))

 

# Build the NJ tree

tree <- nj(dist, outgroup = zebrafish_idx)

 

# Plot the phylogeny

plot(tree)

Results
Failed with an error message.
ITERATION 4
Prompts
I got an error message complaining ” Error in nj(dist, outgroup = zebrafish_idx) unused argument (outgroup = zebrafish_idx)”. Please fix it.
Codes
# Load the required packages

library(seqinr)

library(ape)

 

# Read in the alignment file

aln <- read.alignment(“tp53.clustal”, format=”clustal”)

 

# Identify the index of the Zebrafish sequence in the alignment

zebrafish_idx <- which(aln$species == “Danio_rerio”)

 

# Calculate the evolutionary distance

dist <- dist.dna(as.DNAbin(aln))

 

# Build the NJ tree

tree <- nj(dist)

 

# Root the NJ tree with the Zebrafish sequence as the outgroup

tree_rooted <- root(tree, outgroup = zebrafish_idx)

 

# Plot the phylogeny

plot(tree_rooted)

Results
Failed with an error message.
ITERATION 5
Prompts
I got an error message complaining “Error in if (newroot == ROOT) { : argument is of length zero”. Please fix it.
Codes
# Load the required packages

library(seqinr)

library(ape)

 

# Read in the alignment file

aln <- read.alignment(“tp53.clustal”, format=”clustal”)

 

# Identify the index of the Zebrafish sequence in the alignment

zebrafish_idx <- which(aln$species == “Danio_rerio”)

 

# Calculate the evolutionary distance

dist <- dist.dna(as.DNAbin(aln))

 

# Build the NJ tree

tree <- nj(dist)

 

# Root the NJ tree with the Zebrafish sequence as the outgroup

tree_rooted <- root(tree, outgroup = zebrafish_idx)

 

# Plot the phylogeny

plot(tree_rooted)

Results
Failed with an error message.
Manual editing of the code.

The code displays a correctly rooted tree.

WRAPPING UP
Final codes with explanation to review:

Prompt: I created the following R code. Please add inline comments.

library(seqinr)

library(ape)

aln <- read.alignment(“tp53.clustal”, format = “clustal”)

dist_matrix <- dist.dna(as.DNAbin(aln))

zebrafish_index <- which(attr(dist_matrix, ‘Labels’) == “Zebrafish”)

tree <- nj(dist_matrix)

tree_rooted <- root(tree, zebrafish_index)

plot(tree_rooted, cex = 0.8)

# Load the required packages

library(seqinr)

library(ape)

 

# Read in the alignment file

aln <- read.alignment(“tp53.clustal”, format = “clustal”)

 

# Calculate the evolutionary distance matrix

dist_matrix <- dist.dna(as.DNAbin(aln))

 

# Identify the index of the Zebrafish sequence in the distance matrix labels

zebrafish_index <- which(attr(dist_matrix, ‘Labels’) == “Zebrafish”)

 

# Build the NJ tree

tree <- nj(dist_matrix)

 

# Root the NJ tree with the Zebrafish sequence

tree_rooted <- root(tree, zebrafish_index)

 

# Plot the phylogeny with modified label size

plot(tree_rooted, cex = 0.8)

Summarize the question, methods, and results
In this session, we aim to assess how ChatGPT can help beginners to build a phylogenetic tree starting with a set of aligned sequences of the TP53 gene from 9 species. We started with a prompt that specified the analysis steps and asked ChatGPT to generate R code. We then copied the code and tested it in R. If the execution generated an error or a warning, we presented the message to ChatGPT and requested it to fix the problem. If ChatGPT couldn’t fix the problem after two iterations, we manually edited the code.
Additional Comments
1.      ChatGPT was able to solute easy problems such as calling a function in a library but not loading the library beforehand.

2.      ChatGPT struggled to resolve more complicated problems, such as identifying and converting to correct data types, and setting advanced parameters in function calls.

3.      ChatGPT sometimes made up functions that did not exist.

 

OPTICAL CASE STUDY:  Robust Circle Fitting Under Matlab: Automatic Algorithm Development

 

Description

 

Circle fitting to computer vision is the fruit-fly problem to genetics. It takes a set of 2D coordinates as the input and tries to fit one or multiple circles to the given data. In model-based approaches, circle fitting is often cast as a nonlinear optimization problem. Analytical solutions only exist in simplified scenario (i.e., single circle and clean data). In more challengeing situations (e.g., multiple circles with outliers), RANSAC-based algorithm has been extensively studied in the literature but its Matlab implementation is often beyond the reach of Matlab beginners.

 

 

This case study demonstrates how to use ChatGPT as a virtual tutor to solve the problems of circle fitting with gradually increasing difficulty. By decomposing the challenging research problem into a series of chain-of-thought (CoT) prompts, we can assemble a sophisticated circle fitting algorithm from building blocks (like lego-building).

 

The four prompts in below will be executed in sequential order

1.     I need a matlab function for circle fitting.

2.     Can you make this fitting function robust to outliers?

3.     Can you extend it to robust fitting of multiple circles?

4.     If we know that the radius of all circles are within a range of [a,b], can you incorporate this prior information into the fitting algorithm?

 

ChatGPT failed to generate workable code when the first three prompts were combined into one, for example: “I need a matlab function for fitting multiple circles, robust to outlier points”. (see section “Additional comments”).

 

Coding language(s)

 

Matlab

 

Prompts to define chatbot’s behaviors
Act as an experienced computer vision researcher working with Matlab, you will write code with a number of lines as minimal as possible.

Reply “Yes” if understood.

 

STEP 1

Prompts
I need a matlab function for circle fitting.

 

Code
function [center, radius, residual] = fit_circle(x, y)

    % Fit a circle to a set of 2D points (x, y).

    % Return the center and radius of the fitted circle.

    % Also return the residuals (sum of squared distances between the points

    % and the fitted circle).

    %

    % Inputs:

    % x, y – the x and y coordinates of the points.

    %

    % Outputs:

    % center – a 2-element vector containing the x and y coordinates of the center of the fitted circle.

    % radius – the radius of the fitted circle.

    % residual – the sum of squared distances between the points and the fitted circle.

    %

    % Example:

    % x = [1, 2, 3];

    % y = [2, 3, 4];

    % [center, radius, residual] = fit_circle(x, y);

    %

    % Author: Zhen Li, adapted from the code by Damien Garcia.

 

    x = x(:);

    y = y(:);

    N = length(x);

    X = [x, y, ones(N, 1)];

    Y = -(x.^2 + y.^2);

    a = X\Y;

    center = [-a(1)/2, -a(2)/2];

    radius = sqrt((a(1)^2 + a(2)^2)/4 – a(3));

    residual = sum((X*a + Y).^2);

end

Result
 
Assessment
This is a relatively easy problem to solve.

 

STEP 2
Prompts
Can you make this fitting function robust to outliers?
Code
function [center, radius, inliers] = ransac_fit_circle(x, y, thresh)

% Fit a circle to a set of 2D points (x, y) using the RANSAC algorithm.

% Return the center and radius of the fitted circle, as well as the indices of the inliers.

%

% Inputs:

% x, y – the x and y coordinates of the points.

% thresh – the maximum allowed distance between a point and the fitted circle.

%

% Outputs:

% center – a 2-element vector containing the x and y coordinates of the center of the fitted circle.

% radius – the radius of the fitted circle.

% inliers – a vector of indices of the inliers (points within the threshold distance).

%

% Example:

% x = [1, 2, 3];

% y = [2, 3, 4];

% thresh = 0.1;

% [center, radius, inliers] = ransac_fit_circle(x, y, thresh);

%

% Author: Zhen Li, adapted from the code by Damien Garcia.

 

n = length(x);

best_center = [];

best_radius = [];

best_inliers = [];

max_inliers = 0;

for i = 1:1000

% Randomly select 3 points

idx = randsample(n, 3);

X = [x(idx), y(idx)];

[center_i, radius_i] = fit_circle(X(:, 1), X(:, 2));

 

% Compute the distance between each point and the fitted circle

dist = sqrt((x – center_i(1)).^2 + (y – center_i(2)).^2) – radius_i;

 

% Identify the inliers (points within the threshold)

inliers_i = abs(dist) < thresh;

 

% Update the best fitting circle if a larger set of inliers is found

if sum(inliers_i) > max_inliers

best_center = center_i;

best_radius = radius_i;

best_inliers = inliers_i;

max_inliers = sum(inliers_i);

end

end

 

center = best_center;

radius = best_radius;

inliers = find(best_inliers);

end

Result
 
Assessment
This is a textbook-based extension of the baseline algorithm.

 

STEP 3

Prompts
Can you extend it to robust fitting of multiple circles?
Code
function [centers, radii, inliers] = ransac_fit_multiple_circles(x, y, thresh, min_points, num_circles)

% Fit multiple circles to a set of 2D points (x, y) using the RANSAC algorithm.

% Return the centers, radii, and indices of the inliers of the fitted circles.

%

% Inputs:

% x, y – the x and y coordinates of the points.

% thresh – the maximum allowed distance between a point and the fitted circles.

% min_points – the minimum number of points required to fit a circle.

% num_circles – the number of circles to fit.

%

% Outputs:

% centers – a num_circles x 2 matrix containing the x and y coordinates of the centers of the fitted circles.

% radii – a num_circles x 1 vector containing the radii of the fitted circles.

% inliers – a cell array of length num_circles, where each element is a vector of indices of the inliers for a fitted circle.

%

% Example:

%     x = [1, 2, 3, 4, 5];

%     y = [2, 3, 4, 5, 6];

%     thresh = 0.1;

%     min_points = 3;

%     num_circles = 2;

%     [centers, radii, inliers] = ransac_fit_multiple_circles(x, y, thresh, min_points, num_circles);

%

% Author: Zhen Li

 

n = length(x);

points = [x, y];

inliers = cell(num_circles, 1);

centers = zeros(num_circles, 2);

radii = zeros(num_circles, 1);

for i = 1:num_circles

% Fit a circle using RANSAC

[center, radius, inlier_idx] = ransac_fit_circle(x, y, thresh);

centers(i, 🙂 = center;

radii(i) = radius;

inliers{i} = inlier_idx;

 

% Remove the inliers from the set of points

x(inlier_idx) = [];

y(inlier_idx) = [];

if length(x) < min_points

break;

end

end

end

 

Result
 
Assessment
The code run successfully. The circle fitting results look satisfactory.

 

STEP 4

Prompts
If we know that the radius of all circles are within a range of [a,b], can you incorporate this prior information into the fitting algorithm?
Code
function [centers, radii] = fit_circles_ransac_range(x, y, num_circles, num_iterations, a, b)

% Fit multiple circles to a set of points (x, y) using RANSAC.

% Returns the centers and radii of the fitted circles.

 

% Set default values for parameters

if nargin < 4

num_iterations = 1000;

end

if nargin < 5

a = 0;

end

if nargin < 6

b = Inf;

end

 

% Initialize best models

best_centers = zeros(num_circles, 2);

best_radii = zeros(num_circles, 1);

best_num_inliers = zeros(num_circles, 1);

 

% Loop over circles

for k = 1:num_circles

% Initialize inliers

inlier_indices = 1:length(x);

inlier_x = x;

inlier_y = y;

 

% Compute distance threshold

if isfinite(b)

r = (a + b) / 2;

R = (b – a) / 2;

threshold = 4 * R^2 / r^2;

else

threshold = 0.01;

end

 

% Loop over iterations

for i = 1:num_iterations

% Randomly select three points

sample_indices = randperm(length(inlier_x), 3);

sample_x = inlier_x(sample_indices);

sample_y = inlier_y(sample_indices);

 

% Fit circle to sample points

[sample_center, sample_radius] = fit_circle(sample_x, sample_y);

 

% Compute distance to circle for all points

distances = abs(sqrt((inlier_x – sample_center(1)).^2 + (inlier_y – sample_center(2)).^2) – sample_radius);

 

% Count inliers

inlier_indices = find(distances < threshold * sample_radius);

inlier_x = x(inlier_indices);

inlier_y = y(inlier_indices);

num_inliers = length(inlier_indices);

 

% Check if this is the best model so far

if num_inliers > best_num_inliers(k)

best_centers(k,:) = sample_center;

best_radii(k) = sample_radius;

best_num_inliers(k) = num_inliers;

end

end

 

% Refit model to all inliers

[centers(k,:), radii(k)] = fit_circle(inlier_x, inlier_y);

 

% Remove inliers from list of points

x(inlier_indices) = [];

y(inlier_indices) = [];

end

 

end

Result
 
Assessment
The code run successfully but the fitting results are not satisfactory.

This is because RANSAC can be trapped by a local minimum during the iterations.

This failure case is an open problem in the literature of CV.

WRAPPING UP

Aims

 

The goal of this session is to learn how to use Matlab to generate more and more sophisticated circle fitting algorithms.

 

Methods

 

By providing proper prompts to ChatGPT, the case study demonstrates how beginners could obtain codes from ChatGPT to experiment the circle fitting problem with increasing difficulty levels.

 

Results and Discussions

 

The Matlab codes generated by ChatGPT work fairly well for simple scenarios such as single circle and multiple circles without noise contamination.

 

The generated codes start to fail for more challenging situations when multiple circles intersect with each other and have interference from outliers (step 4). This is a well-known weakness of RANSAC-based solution- i.e., the result varies from experiment to experiment due to the randomness of sampling strategy,

 

The codes obtained from ChatGPT are satisfactory for novice CV researchers. They can use the codes generated by ChatGPT to jump-start their research.

 

Additional Comments
 

When the first three requirements were combined into one, “I need a matlab function for fitting multiple circles, robust to outlier points”, ChatGPT simply got frozen after outputting the main function. By hitting “regenerate response” button, among the three times of trying, each time ChatGPT tried a different approach but got stuck and could not finish the generation.

 

This is because for a sophisticated programming task that require calling subroutines (it is called “helper function” under Matlab), ChatGPT did not yet master the art of one function calling another one, including itself (known as recursion).

 

Final Thoughts

Our journey through the realms of bioinformatics, with ChatGPT and the OPTICAL model, is not just about overcoming hurdles in bioinformatics data analysis; it’s about creating a revolution in learning, an evolution in thinking, and most importantly, building a foundation for a future where learning is not confined by limitations but propelled by innovations and integrations like never before.

Reference:

Shue, E., Liu, L., Li, B., Feng, Z., Li, X., & Hu, G. (2023). Empowering beginners in bioinformatics with chatgpt. bioRxiv, 2023-03.

Shares