GitHub - konstin/ColabFold: Making Protein folding accessible to all via Google Colab! (original) (raw)

ColabFold

Making Protein folding accessible to all via Google Colab!

Notebooks monomers complexes mmseqs2 jackhmmer templates
AlphaFold2_mmseqs2 Yes Yes Yes No Yes
AlphaFold2_batch Yes Yes Yes No Yes
RoseTTAFold Yes No Yes No No
AlphaFold2 (from Deepmind) Yes Yes No Yes No
BETA (in development) notebooks
OmegaFold Yes No No No No
AlphaFold2_advanced Yes Yes Yes Yes No
OLD retired notebooks
AlphaFold2_complexes No Yes No No No
AlphaFold2_jackhmmer Yes No Yes Yes No
AlphaFold2_noTemplates_noMD
AlphaFold2_noTemplates_yesMD

FAQ

Running locally

_Note: Checkout localcolabfold too

Install ColabFold using the pip commands below. pip will resolve and install all required dependencies and ColabFold should be ready within a few minutes to use. Please check the JAX documentation for how to get JAX to work on your GPU or TPU.

pip install "colabfold[alphafold] @ git+https://github.com/sokrypton/ColabFold" pip install -q "jax[cuda]>=0.3.8,<0.4" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

For template-based predictions also install kalign and hhsuite

conda install -c conda-forge -c bioconda kalign2=2.04 hhsuite=3.3.0

For amber also install openmm and pdbfixer

conda install -c conda-forge openmm=7.5.1 pdbfixer

colabfold_batch

If no GPU or TPU is present, colabfold_batch can be executed (slowly) using only a CPU with the --cpu parameter.

Generating MSAs for large scale structure/complex predictions

First create a directory for the databases on a disk with sufficient storage (940GB (!)). Depending on where you are, this will take a couple of hours:

./setup_databases.sh /path/to/db_folder

Download and unpack mmseqs (Note: The required features aren't in a release yet, so currently, you need to compile the latest version from source yourself or use a static binary). If mmseqs is not in your PATH, replace mmseqs below with the path to your mmseqs:

This needs a lot of CPU

colabfold_search input_sequences.fasta /path/to/db_folder msas

This needs a GPU

colabfold_batch msas predictions

This will create intermediate folder msas that contains all input multiple sequence alignments formated as a3m files and a predictions folder with all predicted pdb,json and png files.

Searches against the ColabFoldDB can be done in two different modes:

(1) Batch searches with many sequences against the ColabFoldDB quires a machine with approx. 128GB RAM. The search should be performed on the same machine that called setup_databases.sh since the database index size is adjusted to the main memory size. To search on computers with less main memory delete the index by removing all .idx files, this will force MMseqs2 to create an index on the fly in memory. MMSeqs2 is optimized for large input sequence sets sizes. For batch searches use the --db-load-mode 0 option.

(2) single query searches require the full index (the .idx files) to be kept in memory. This can be done with e.g. by using vmtouch. Thus, this type of search requires a machine with at least 768GB RAM for the ColabfoldDB. If the index is in memory use to --db-load-mode 3 parameter in colabfold_search to avoid index loading overhead. If they database is already in memory use --db-load-mode 2 option.

Tutorials & Presentations

Projects based on ColabFold or helpers

Acknowledgments

How do I reference this work?

DOI


OLD Updates

11Mar2022: We use in default AlphaFold-multimer-v2 weights for complex modeling. We also offer the old complex modes "AlphaFold-ptm" or "AlphaFold-multimer-v1" 04Mar2022: ColabFold now uses a much more powerful server for MSAs and searches through the ColabFoldDB instead of BFD/MGnify. Please let us know if you observe any issues. 26Jan2022: AlphaFold2_mmseqs2, AlphaFold2_batch and colabfold_batch's multimer complexes predictions are now in default reranked by iptmscore0.8+ptmscore0.2 instead of ptmscore 16Aug2021: WARNING - MMseqs2 API is undergoing upgrade, you may see error messages. 17Aug2021: If you see any errors, please report them. 17Aug2021: We are still debugging the MSA generation procedure... 20Aug2021: WARNING - MMseqs2 API is undergoing upgrade, you may see error messages. To avoid Google Colab from crashing, for large MSA we did -diff 1000 to get 1K most diverse sequences. This caused some large MSA to degrade in quality, as sequences close to query were being merged to single representive. We are working on updating the server (today) to fix this, by making sure that both diverse and sequences close to query are included in the final MSA. We'll post update here when update is complete. 21Aug2021 The MSA issues should now be resolved! Please report any errors you see. In short, to reduce MSA size we filter (qsc > 0.8, id > 0.95) and take 3K most diverse sequences at different qid (sequence identity to query) intervals and merge them. More specifically 3K sequences at qid at (0→0.2),(0.2→0.4), (0.4→0.6),(0.6→0.8) and (0.8→1). If you submitted your sequence between 16Aug2021 and 20Aug2021, we recommend submitting again for best results! 21Aug2021 The use_templates option in AlphaFold2_mmseqs2 is not properly working. We are working on fixing this. If you are not using templates, this does not affect the the results. Other notebooks that do not use_templates are unaffected. 21Aug2021 The templates issue is resolved! 11Nov2021 [AlphaFold2_mmseqs2] now uses Alphafold-multimer for complex (homo/hetero-oligomer) modeling. Use [AlphaFold2_advanced] notebook for the old complex prediction logic. 11Nov2021 ColabFold can be installed locally using pip! 14Nov2021 Template based predictions works again in the Alphafold2_mmseqs2 notebook. 14Nov2021 WARNING "Single-sequence" mode in AlphaFold2_mmseqs2 and AlphaFold2_batch was broken starting 11Nov2021. The MMseqs2 MSA was being used regardless of selection. 14Nov2021 "Single-sequence" mode is now fixed. 20Nov2021 WARNING "AMBER" mode in AlphaFold2_mmseqs2 and AlphaFold2_batch was broken starting 11Nov2021. Unrelaxed proteins were returned instead. 20Nov2021 "AMBER" is fixed thanks to Kevin Pan