How to predict structures with AlphaFold
December 4, 2024Introduction: AlphaFold
AlphaFold, developed by DeepMind, represents a groundbreaking advancement in the field of structural biology. It is an artificial intelligence (AI)-powered tool designed to predict the three-dimensional structures of proteins with remarkable accuracy. AlphaFold has revolutionized the way researchers approach protein structure determination, addressing one of biology’s most challenging problems: bridging the gap between a protein’s amino acid sequence and its folded structure.
The success of AlphaFold lies in its ability to leverage deep learning algorithms and massive datasets of protein sequences and structures. It uses innovative techniques such as attention mechanisms, evolutionary information, and co-evolutionary signals to predict atomic-level details of protein structures. The accuracy of AlphaFold’s predictions has been validated in independent assessments, including the Critical Assessment of Protein Structure Prediction (CASP) competition, where it outperformed other methods by a significant margin.
AlphaFold’s contributions extend beyond research laboratories, offering immense potential for drug discovery, enzyme engineering, and understanding diseases linked to protein misfolding or dysfunction. By democratizing access to high-quality protein models, AlphaFold has become an indispensable tool for scientists worldwide, accelerating discoveries across a broad range of biological disciplines.
In 2020, the AlphaFold project of Google’s DeepMind team demonstrated a major breakthrough in predicting protein structure from sequence. Their success in the blind CASP competition astonished many experts. For an overview, see Theoretical models, bearing in mind “The Joys and Perils of AlphaFold”. AlphaFold2 continued to have the highest success rate in the 2022 CASP 15 competition. In 2024, the AlphaFold team won half of the Nobel Prize in Chemistry.
In July, 2021, DeepMind released AlphaFold as open source code. Subsequently, several Colabs became available offering free structure prediction for user-submitted protein sequences. These Google Colabs (collaboratories). enable users to submit sequences via web browser, executing the code in the Google cloud, using space private to each user, returning predicted structures. In 2024, DeepMind provided the AlphaFold3 server.
Below are instructions for beginners who wish to predict structures.
Is An Empirical Model Available?
Empirical models are the most accurate, so you should look for those first. See How To Find A Structure. If there is no empirical model for your amino acid sequence, it may be useful to explore empirical models for closely-related sequences, if available. Even if an empirical structure is available, most have missing residues or atoms, and it may be useful to compare it with the AlphaFold prediction: see Missing residues and incomplete sidechains.
Does AlphaFold Database Already Have Your Protein?
Structure predictions for over 200,000,000 proteins are available from the AlphaFold Database. If your protein is there, download the prediction from the Database. Then you can explore it in the viewer of your choice. For beginners, FirstGlance in Jmol is easiest if you want more than a momentary impression. Upload your PDB file. FirstGlance has numerous unique conveniences yet considerable depth and power.
In 2024, AlphaFold Database predictions are always single protein chain structures without ligands. If your protein is an assembly of multiple chains, you will likely want to compare the Database structure with predictions from the latest servers capable of multiple-chain + ligand predictions (see below).
You can submit one (or a set) of sequences to these servers, and they will return predicted structures, along with estimates of confidence in their predictions. This is not a comprehensive list. Please add other servers of interest to a broad range of users, including beginners.
- 2024: AlphaFoldServer.Com. Using AlphaFold3, predicts homo- and hetro-multimers involving protein, DNA, RNA, ligands, and modified residues. Straightforward to use; Guide and FAQ provided. Predictions are templated without user control (see FAQ). Free for non-commercial use — see Terms of Service and Output Terms of Use. From the DeepMind team[3].
- If you get Invalid character after pasting in your sequence, try removing the breaks between lines.
- Predicted models are in mmCIF format only. To convert to PDB format for use in FirstGlance (which colors by confidence/pLDDT automatically), see Converting AlphaFold3 CIF to PDB.
- To easily obtain average pLDDT (predicted confidence) for a range of residues, see FirstGlance/How to get average pLDDT from AlphaFold models.
- See also #Visualizing Predicted Structures and AlphaFold3 case studies.
- 2024: RosettaFold All-Atom (RFAA) predicts multimers of protein and nucleic acids with ligands. From the Baker team[4]. A free server limited to very small numbers of jobs is available from Neurosnap.
- 2024: CombFold predicts the structures of large protein complexes from subunit sequences using AlphaFold Multimer paired with a cominatorial method to assemble subunits. From Shor and Schneidman-Duhovny.
- 2022: ColabFold AlphaFold2_advanced. Predicts homo- and hetero-multimers using methods from the Steinegger/Mirdita team, before AlphaFold-multimer was available. Does NOT use templates. See Instructions from EMBL-EBI.
- 2022: AlphaFold2/Multimer Colab able to predict protein multimers. From the DeepMind team. See Instructions from EMBL-EBI.
- 2022: AlphaFill “transplants” missing ligands, cofactors and (metal) ions into AlphaFold models. From the Perrakis team. Ligand positioning is approximate.
“AlphaFill models are not meant or suitable for precise quantification of interactions between the transferred ligand(s) and the protein (e.g. hydrogen bonds, π-π or cation-π interactions, van der Waals interactions, hydrophobic interactions, halogen bonds).”
- 2021: RoseTTAFold at Robetta is an independent design from the Baker team, influenced by the design of AlphaFold2. Predicts monomers and multimers. Comparing results of RoseTTAFold with results of AlphaFold2/3 is worthwhile. At Robetta, open the Structure Prediction menu at the top, and choose Submit. Be sure to check RoseTTAFold under Optional!
The above servers are free for limited, non-commercial use.
Colabs: After multiple free jobs in a Colab, a new job may be refused. You may be informed that a GPU could not be assigned. In 2024, a subscription to Colab Pro is US $10/month. Paying this will enable you to do many more jobs.
Visualizing Predicted Structures
FirstGlance in Jmol automatically colors its initial view of uploaded AlphaFold or RoseTTAFold models by estimated confidence pLDDT (blue for high confidence, red for low confidence). After you go to other views or tools, you can always get back to this color scheme by clicking Reliability Estimates in the Views tab.
- iCn3D automatically colors AlphaFold2 Database models loaded from their UniProt IDs. For AlphaFold files opened from your computer, use pLDDT on the pull-down Color menu.
- PyMOL and ChimeraX have no built-in confidence/pLDDT color scheme. Their rainbow/spectrum color schemes for temperature/B-factor color confidence/pLDDT with the AlphaFold color scheme inverted.
Upload your predicted PDB file to FirstGlance.Jmol.Org, which has many unique conveniences and capabilities.
You can easily visualize
- Estimated confidence/pLDDT by touching an atom
- Average confidence/pLDDT (“reliability”) for the entire model, or for a specified sequence range.
- Secondary structure (Views tab)
- Distribution of hydrophobic vs. polar residues (Views tab: integral membrane proteins will have large hydrophobic surfaces while soluble proteins will have hydrophobic cores revealed by the Slab button)
- Distribution of charges (Views tab: nucleic acid binding sites will have clusters of positive charges)
- Disulfide bonds (Tools tab)
- Domain structure and positions of the ends of the polypeptide chain (Views tab: N -> C Rainbow)
- Locations of functional sites by evolutionary conservation (see instructions at How_to_see_conserved_regions)
Instructions for ColabFold 2022
This procedure was written in 2022. In 2024, ColabFold is not necessarily the best or only place to submit you job: see #Prediction Servers.
Initially, AlphaFold and ColabFold performed best with single chains, which may include one or a few domains. The instructions below were written before ColabFold was adapted to prediction of multimers. If you are interested in complexes or alternate conformations, please see ColabFold instructions in the 2023 paper by Kim et al.
First, if your query is a single chain molecule, check the AlphaFold Database for the protein of interest. If its structure has already been predicted there, download it, and skip to Interpreting Results below. Otherwise …
Don’t worry about any of the options not specifically mentioned below. Leave them at their default settings.
1. Obtain the sequence of the protein of interest, e.g. at UniProt. Click on the FASTA button above the sequence in UniProt. Copy only the sequence, excluding the FASTA header line that begins with “>”.
2. Login with a google account at AlphaFold2_advanced. You can register for a free gmail account to use for login. (Another free AlphaFold2 service is ColabFold. Using it may require a procedure different from the steps below.)
3. Paste in your sequence, making sure to completely replace the default sequence:
This input slot can accept sequences >1,000 amino acids, even though it is only one line. Sequence lengths of ~1,000 amino acids, or longer, may cause the Colab to fail, but can be predicted by submitting in two halves.[14] See also [14] and Joining AlphaFold predictions for halves of a molecule.
4. Enter a jobname in the slot below the sequence slot. The results.zip filename will begin with this jobname (but none of its contents include the jobname).
5. Scroll down to the section titled run alphafold, subsection Sampling options:
- num_models, the number of models to be predicted, is 5 by default. You could reduce this to 3 if you are in a hurry.
- max_recycles: Set this to 48 (or at least 12). The actual number of “recycles” performed will stop when the model has converged to the specified tolerance. The default of 3 recycles is often not enough for an optimal result.
- tol (tolerance): Set this to 0.5 Å (or 1.0 to get a faster result). When a prediction differs from the previous “recycle” prediction by less than this value (RMSD in Å between alpha carbons), the recycles will stop.
- num_samples (random seeds): Leave this at 1. Beware that if you increase this above 1, you will generate a number of models equal to the product of this value times num_models. This will proportionally increase the time to complete a result.
6. Open the Runtime menu at the very top of the page, and select Run all.
Don’t worry about the “Warning”. It is just Google’s disclaimer that they did not write the code you are about to execute. Click Run anyway.
Do NOT close your AlphaFold2_advanced browser tab until the job is completed. It appears that you will lose your job if you close the browser tab. You will be warned if you inadvertently try.
When the job is completed, a dialog to download a zip file will appear automatically. (Sometimes you will be asked for permission to enable download first.)
Static images of backbone renderings of predicted models will appear in your web browser at the bottom of the section run alphafold as each is completed.
Each predicted model has an average estimated reliability (pLDDT, predicted local distance difference test). >90 is likely accurate; <70 is low confidence. For more about interpreting these values, please see the AlphaFold Database FAQ.
Each residue has an estimated reliability of its position (0-100) in the PDB temperature column. BEWARE that high values mean high confidence, and low values mean low confidence. This is the INVERSE of crystallographic temperature values, where low values are good and high values are bad. Uploading your PDB file to FirstGlance in Jmol will automatically color each residue by its estimated reliability.
Some models have high confidence in a folded domain, and low confidence in a segment that is not part of a compact domain. Low-confidence segments may be intrinsically disordered. It is useful to compare predictions of disorder with AlphaFold reliability estimates.
If the predicted model has more than one domain, each domain may have high confidence, yet the relative positions of the domains may not. The estimated reliability of relative domain positions is in graphs of predicted aligned error (PAE) which are included in the downloadable zip file of results. For an explanation, see How should I interpret the relative positions of domains? in the AlphaFold Database FAQ.
You may be interested to note the number of recycles required for each model to converge to the specified tolerance. These numbers are not captured in the downloaded zip file.
The models will be ranked with number one having the highest estimated reliability (pLDDT). This is usually not in the order in which they were calculated. You might want to copy the ranking list, perhaps adding the number of recycles and final tolerance values:
model rank based on pLDDT Recycles Tolerance
rank_1_model_2_ptm_seed_0 pLDDT:62.46 10 0.33
rank_2_model_3_ptm_seed_0 pLDDT:59.59 9 0.47
rank_3_model_1_ptm_seed_0 pLDDT:55.63 12 0.52
Notice that the model predicted 2nd had the best estimated reliability (pLDDT), and that the model ranked 3rd did not quite achieve the specified tolerance of 0.5 Å RMSD after 12 recycles. (12 was specified as the maximum in this job.) Also notice that, in this case, all 3 models have low confidence (pLDDT < 70), and are of questionable value.