GitHub - konstin/ColabFold: Making Protein folding accessible to all via Google Colab! (original) (raw)

ColabFold - v1.5.5

For details of what was changed in v1.5, see change log!

Making Protein folding accessible to all via Google Colab!

Notebooks monomers complexes mmseqs2 jackhmmer templates
AlphaFold2_mmseqs2 Yes Yes Yes No Yes
AlphaFold2_batch Yes Yes Yes No Yes
AlphaFold2 (from Deepmind) Yes Yes No Yes No
relax_amber (relax input structure)
ESMFold Yes Maybe No No No
BETA (in development) notebooks
RoseTTAFold2 Yes Yes Yes No WIP
Boltz Yes Yes Yes No No
BioEmu Yes No Yes No No
OmegaFold Yes Maybe No No No
AlphaFold2_advanced_v2 (new experimental notebook) Yes Yes Yes No Yes

Check the wiki page old retired notebooks for unsupported notebooks.

FAQ

Running locally

For instructions on how to install ColabFold locally refer to localcolabfold or see our wiki on how to run ColabFold within Docker.

Generating MSAs for small scale local structure/complex predictions using the MSA server

When you pass a FASTA or CSV file containing your sequences to colabfold_batch it will automatically query the public MSA server to generate MSAs. You might want to split this into two steps for better GPU resource utilization:

# Query the MSA server and predict the structure on local GPU in one go:
colabfold_batch input_sequences.fasta out_dir

# Split querying MSA server and GPU predictions into two steps
colabfold_batch input_sequences.fasta out_dir --msa-only
colabfold_batch input_sequences.fasta out_dir

Generating MSAs for large scale structure/complex predictions

First create a directory for the databases on a disk with sufficient storage (940GB (!)). Depending on where you are, this will take a couple of hours:

Note: MMseqs2 71dd32ec43e3ac4dabf111bbc4b124f1c66a85f1 (May 28, 2023) is used to create the databases and perform sequece search in the ColabFold MSA server. Please use this version if you want to obtain the same MSAs as the server.

MMSEQS_NO_INDEX=1 ./setup_databases.sh /path/to/db_folder

If MMseqs2 is not installed in your PATH, add --mmseqs <path to mmseqs> to your mmseqs in colabfold_search:

This needs a lot of CPU

colabfold_search --mmseqs /path/to/bin/mmseqs input_sequences.fasta /path/to/db_folder msas

This needs a GPU

colabfold_batch msas predictions

This will create intermediate folder msas that contains all input multiple sequence alignments formated as a3m files and a predictions folder with all predicted pdb,json and png files.

The procedure above disables MMseqs2 preindexing of the various ColabFold databases by setting the MMSEQS_NO_INDEX=1 environment variable before calling the database setup script. For most use-cases of colabfold_search precomputing the index is not required and might hurt search speed. The precomputed index is necessary for fast response times of the ColabFold server, where the whole database is permamently kept in memory. In any case the batch searches will require a machine with about 128GB RAM or, if the databases are to be kept permamently in RAM, with over 1TB RAM.

In some cases using precomputed database can still be useful. For the following cases, call the setup_databases.sh script without the MMSEQS_NO_INDEX environment variable:

(0) As mentioned above, if you want to set-up a server.

(1) If the precomputed index is stored on a very fast storage system (e.g., NVMe-SSDs) it might be faster to read the index from disk than computing in on the fly. In this case, the search should be performed on the same machine that called setup_databases.sh since the precomputed index is created to fit within the given main memory size. Additionaly, pass the --db-load-mode 0 option to make sure the database is read once from the storage system before use.

(2) Fast single query searches require the full index (the .idx files) to be kept in memory. This can be done with e.g. by using vmtouch. Thus, this type of search requires a machine with at least 768GB to 1TB RAM for the ColabfoldDB. If the index is present in memory, use the --db-load-mode 2 parameter in colabfold_search to avoid index loading overhead.

If no index was created (MMSEQS_NO_INDEX=1 was set), then --db-load-mode does not do anything and can be ignored.

Generating MSAs on the GPU

Recently GPU-accelerated search for MMSeqs was introduced and is now supported in ColabFold. To leverage it, you will need to ajdust the database setup and how you run ⁠colabfold_search⁠.

GPU database setup

To setup the GPU databases, you will need to run the ⁠setup_databases.sh⁠ command with ⁠GPU=1⁠:

GPU=1 ./setup_databases.sh /path/to/db_folder

This will download and setup the GPU databases in the specified folder. Note that here we do not pass ⁠MMSEQS_NO_INDEX=1⁠ as an argument since the indices are useful in the GPU search since we will keep them in the GPU memory.

GPU search with ⁠ colabfold_search ⁠

To run the MSA search on the GPU, it is recommended (although not required) to start a GPU server before running the search; this server will keep the indices in the GPU memory and will be used to accelerate the search. To start a GPU server, run:

mmseqs gpuserver /path/to/db_folder/colabfold_envdb_202108_db --max-seqs 10000 --db-load-mode 0 --prefilter-mode 1 & PID1=$! mmseqs gpuserver /path/to/db_folder/uniref30_2302 --max-seqs 10000 --db-load-mode 0 --prefilter-mode 1 & PID2=$!

By default, this server will use all available GPUs and split the database up evenly across them. If you want to restrict the numbers of GPU used, you can set the environment variable ⁠CUDA_VISIBLE_DEVICES⁠ to a specific GPU or set of GPUs, e.g., ⁠CUDA_VISIBLE_DEVICES=0,1⁠. You can control how many sequences are loaded onto the GPU with the ⁠--max-seqs⁠ option. If your database is larger than the available GPU memory, the GPU server will efficiently swap the required data in and out of the GPU memory, overlapping data transfer and computation. The GPU server will be started in the background and will continue to run until you stop it explicitly via killing the process via ⁠kill <span class="katex"><span class="katex-mathml"><math xmlns="http://www.w3.org/1998/Math/MathML"><semantics><mrow><mi>P</mi><mi>I</mi><mi>D</mi><mn>1</mn><mi mathvariant="normal">‘</mi><mtext>⁠</mtext><mi>a</mi><mi>n</mi><mi>d</mi><mtext>⁠</mtext><mi mathvariant="normal">‘</mi><mi>k</mi><mi>i</mi><mi>l</mi><mi>l</mi></mrow><annotation encoding="application/x-tex">PID1⁠ and ⁠kill </annotation></semantics></math></span><span class="katex-html" aria-hidden="true"><span class="base"><span class="strut" style="height:0.6944em;"></span><span class="mord mathnormal" style="margin-right:0.13889em;">P</span><span class="mord mathnormal" style="margin-right:0.07847em;">I</span><span class="mord mathnormal" style="margin-right:0.02778em;">D</span><span class="mord">1‘⁠</span><span class="mord mathnormal">an</span><span class="mord mathnormal">d</span><span class="mord">⁠‘</span><span class="mord mathnormal">ki</span><span class="mord mathnormal" style="margin-right:0.01968em;">ll</span></span></span></span>PID2⁠.

You can then run ⁠ colabfold_search ⁠ with the ⁠--gpu⁠ and ⁠--gpu-server⁠ option enabled:

colabfold_search --mmseqs /path/to/bin/mmseqs --gpu 1 --gpu-server 1 input_sequences.fasta /path/to/db_folder msas

You can also run the search only with the ⁠--gpu⁠ option enabled if you do not want to start a GPU server, but the GPU server option is generally faster. Similarly to the GPU server, you can control with GPUs are used for the search via the ⁠CUDA_VISIBLE_DEVICES environment variable.

Tutorials & Presentations

Projects based on ColabFold or helpers

Acknowledgments

How do I reference this work?

DOI


OLD Updates

31Jul2023: 2023/07/31: The ColabFold MSA server is back to normal It was using older DB (UniRef30 2202/PDB70 220313) from 27th ~8:30 AM CEST to 31st ~11:10 AM CEST. 27Jul2023: ColabFold MSA server issue: We are using the backup server with old databases (UniRef30 2202/PDB70 220313) starting from ~8:30 AM CEST until we resolve the issue. Resolved on 31Jul2023 ~11:10 CEST. 12Jun2023: New databases! UniRef30 updated to 2302 and PDB to 230517. We now use PDB100 instead of PDB70 (see notes in the main notebook). 12Jun2023: We introduced a new default pairing strategy: Previously, for multimer predictions with more than 2 chains, we only pair if all sequences taxonomically match ("complete" pairing). The new default "greedy" strategy pairs any taxonomically matching subsets. 30Apr2023: Amber is working again in our ColabFold Notebook 29Apr2023: Amber is not working in our Notebook due to Colab update 18Feb2023: v1.5.2 - fixing: fixing memory leak for large proteins - fixing: --use_dropout (random seed was not changing between recycles) 06Feb2023: v1.5.1 - fixing: --save-all/--save-recycles 04Feb2023: v1.5.0 - ColabFold updated to use AlphaFold v2.3.1! 03Jan2023: The MSA server's faulty hardware from 12/26 was replaced. There were intermittent failures on 12/26 and 1/3. Currently, there are no known issues. Let us know if you experience any. 10Oct2022: Bugfix: random_seed was not being used for alphafold-multimer. Same structure was returned regardless of defined seed. This has been fixed! 13Jul2022: We have set up a new ColabFold MSA server provided by Korean Bioinformation Center. It provides accelerated MSA generation, we updated the UniRef30 to 2022_02 and PDB/PDB70 to 220313. 11Mar2022: We use in default AlphaFold-multimer-v2 weights for complex modeling. We also offer the old complex modes "AlphaFold-ptm" or "AlphaFold-multimer-v1" 04Mar2022: ColabFold now uses a much more powerful server for MSAs and searches through the ColabFoldDB instead of BFD/MGnify. Please let us know if you observe any issues. 26Jan2022: AlphaFold2_mmseqs2, AlphaFold2_batch and colabfold_batch's multimer complexes predictions are now in default reranked by iptmscore0.8+ptmscore0.2 instead of ptmscore 16Aug2021: WARNING - MMseqs2 API is undergoing upgrade, you may see error messages. 17Aug2021: If you see any errors, please report them. 17Aug2021: We are still debugging the MSA generation procedure... 20Aug2021: WARNING - MMseqs2 API is undergoing upgrade, you may see error messages. To avoid Google Colab from crashing, for large MSA we did -diff 1000 to get 1K most diverse sequences. This caused some large MSA to degrade in quality, as sequences close to query were being merged to single representive. We are working on updating the server (today) to fix this, by making sure that both diverse and sequences close to query are included in the final MSA. We'll post update here when update is complete. 21Aug2021 The MSA issues should now be resolved! Please report any errors you see. In short, to reduce MSA size we filter (qsc > 0.8, id > 0.95) and take 3K most diverse sequences at different qid (sequence identity to query) intervals and merge them. More specifically 3K sequences at qid at (0→0.2),(0.2→0.4), (0.4→0.6),(0.6→0.8) and (0.8→1). If you submitted your sequence between 16Aug2021 and 20Aug2021, we recommend submitting again for best results! 21Aug2021 The use_templates option in AlphaFold2_mmseqs2 is not properly working. We are working on fixing this. If you are not using templates, this does not affect the the results. Other notebooks that do not use_templates are unaffected. 21Aug2021 The templates issue is resolved! 11Nov2021 [AlphaFold2_mmseqs2] now uses Alphafold-multimer for complex (homo/hetero-oligomer) modeling. Use [AlphaFold2_advanced] notebook for the old complex prediction logic. 11Nov2021 ColabFold can be installed locally using pip! 14Nov2021 Template based predictions works again in the Alphafold2_mmseqs2 notebook. 14Nov2021 WARNING "Single-sequence" mode in AlphaFold2_mmseqs2 and AlphaFold2_batch was broken starting 11Nov2021. The MMseqs2 MSA was being used regardless of selection. 14Nov2021 "Single-sequence" mode is now fixed. 20Nov2021 WARNING "AMBER" mode in AlphaFold2_mmseqs2 and AlphaFold2_batch was broken starting 11Nov2021. Unrelaxed proteins were returned instead. 20Nov2021 "AMBER" is fixed thanks to Kevin Pan