Add suport to --batch directories. by pslacerda · Pull Request #384 · ccsb-scripps/AutoDock-Vina (original) (raw)

@pslacerda

On the way to maybe fix #382.

@pslacerda

Olá, @diogomart, saudações em língua portuguesa!

I don't know how to test this large set of options that vina has without proper automated testing. If it's desirable, I may hack some tests scripts for the command line usage covering indirectly most aspects of the C++ code.

rwxayheee

@pslacerda

hey @rwxayheee,

I don't know if it's ok because I don't understand the code well. I usually program only in Python and R.

@rwxayheee

Hi @pslacerda
Thanks for the updates! I want to assist with the review process as much as possible. I will take a look, do some testing and leave a few comments, but it will take some time. Just wanted to let you know that we haven't given up on this, and lets keep in touch :)

rwxayheee

rwxayheee

rwxayheee

rwxayheee

rwxayheee

@rwxayheee

Hi @pslacerda
No hurry on this or the other PR. I haven't finished looking at the PR yet, but would like to take a shortbreak so you have time to reply to my comments. You could look at my comments and see if they make sense, take the time to run some tests (compile and run some calculations) by yourself, and you're more than welcome to message me back at any time.
After you have made some changes, I will come back if no one is reviewing this. Thanks!

@pslacerda

@pslacerda1

@pslacerda1

@pslacerda

@rwxayheee

Hi @pslacerda

Thanks for the updates. I have just one minor edit request:

In your code, when batch_ligand_names.size() == 1 and the single input is not a valid directory (either because it is just a file or because the directory does not exist), the condition if (batch_ligand_names.size() == 1 && is_directory(batch_ligand_names[0])) will be false. I think these two conditions need to be handled differently. This is more of a limitation with the is_directory function Vina has in its utils.cpp.

I have a suggestion with minimal edition:
pslacerda/AutoDock-Vina@batch_dir...rwxayheee:batch_dir_edited

Could you try this and if it's working, feel free to pull. I haven't got the time to do any actual testing, sorry, but if you accept the edits I will come back and test both PRs.

A few more comments (not requesting more work, don't worry):

I noticed the lambda expression for processing ligand input that you initially implemented. I liked the reduction of code redundancy and that it can possibly enable multi threading of ligand input processing. If you need it for your own project, feel free to implement in your fork. Doing that needs more work and might not be the real bottleneck of the screening performance (but if you think so, please show us a benchmark and feel free to open an issue/discussion on it). Don't worry about including it in this PR

For your reference, we have another docking program, AD-GPU, with the option to iterate PDBQT files from a directory as @diogomart have mentioned before. Here's the code:
https://github.com/ccsb-scripps/AutoDock-GPU/blob/a46ab564d2ac5f1a1523f65239b43505a1c29364/host/src/getparameters.cpp#L931-L950
The behavior of AD-GPU batch option is slightly different (it does recursive search). But I'm not requesting more work to implement recursive iteration. Actually I'm not sure how repeated names will be handled if we enable recursive iteration.. please don't worry and don't add to this PR, unless you need it or more people request it

If we merge this PR, I will update the documentation. Thank you

@pslacerda

Hey @rwxayheee

In your code, when batch_ligand_names.size() == 1 and the single input is not a valid directory (either because it is just a file or because the directory does not exist), the condition if (batch_ligand_names.size() == 1 && is_directory(batch_ligand_names[0])) will be false. I think these two conditions need to be handled differently. This is more of a limitation with the is_directory function Vina has in its utils.cpp..

I have a suggestion with minimal edition: pslacerda/AutoDock-Vina@batch_dir...rwxayheee:batch_dir_edited

Could you try this and if it's working, feel free to pull. I haven't got the time to do any actual testing, sorry, but if you accept the edits I will come back and test both PRs.

Ok, I agree and will checkout your code soon...

I noticed the lambda expression for processing ligand input that you initially implemented. I liked the reduction of code redundancy and that it can possibly enable multi threading of ligand input processing. If you need it for your own project, feel free to implement in your fork. Doing that needs more work and might not be the real bottleneck of the screening performance (but if you think so, please show us a benchmark and feel free to open an issue/discussion on it). Don't worry about including it in this PR

Vina already does a great job in parallelization but maybe it can be improved because there are times that it is using less than 100%. But I don't know if I'll try to do it.

If we merge this PR, I will update the documentation. Thank you

Would be awesome and a honor this code be merged into such a relevant software.

@pslacerda

@rwxayheee

This is the behavior of --batch ./input when a non-existent ./input directory or file is given:

Error: could not open "./input" for reading. 

So your changes pslacerda/AutoDock-Vina@batch_dir...rwxayheee:batch_dir_edited aren't needed.

Technically, yes, when trying to open a non-existing file/folder there's always an error. But the error will be raised further down the code (probably set_ligand_from_file?) because boost::filesystem::directory_iterator doesn't throw the exception. I think it might be better to exit earlier.

Honestly I just did it to avoid unknown behaviors.. but I also don't know where the code exits when there are no PDBQT files under the given directory.

@rwxayheee

I want to add another exit after:

                    batch_ligand_names.clear();
                    for (const auto& entry : boost::filesystem::directory_iterator(in_dir)) {
                        if (entry.path().extension() == ".pdbqt") {
                            batch_ligand_names.push_back(entry.path().string());
                        }

In case batch_ligand_names becomes empty. I am only doing this for precaution, because I can't foresee too much what will happen further down the code. But it's ok if you don't want to change. I agree it's not necessary, and you have achieved the original purpose of this PR.

I will add this in another PR.

@rwxayheee

Don't worry about the changes. I will test the current state of this PR and approve if it passes the checks. Thanks!

@pslacerda

Technically, yes, when trying to open a non-existing file/folder there's always an error. But the error will be raised further down the code (probably set_ligand_from_file?) because boost::filesystem::directory_iterator doesn't throw the exception. I think it might be better to exit earlier.

Honestly I just did it to avoid unknown behaviors.. but I also don't know where the code exits when there are no PDBQT files under the given directory.

When the first entry is nonexistent, the exit path flows the same as any nonexistent --batch input . As you observed, it fails at set_ligand_from_file.

In case batch_ligand_names becomes empty. I am only doing this for precaution, because I can't foresee too much what will happen further down the code.

This seems a defensive approach to programming, what is desirable if possible.

@pslacerda

I noticed the lambda expression for processing ligand input that you initially implemented. I liked the reduction of code redundancy and that it can possibly enable multi threading of ligand input processing.

In my machine, the CPU consumption is below 100% for a few instants before the next batch molecule fully starts. On the level of that lambda function, this low CPU usage time fraction may be improved. Maybe this means a large gain when doing large batches.

What do you think?

@rwxayheee

Thanks again for contributing @pslacerda. This is a useful function. After testing on Mac and Windows, I think it behaves as expected and addressed the original purpose of this PR. I'm approving with the following comments.

1- Nonexistent file/folder
Program exits when attempting to set ligand. Exits with 1 and an acceptable error message. This is fine.

2- Valid folder, no matched PDBQT files
Program exits after computing grids. No error and exits with 0. This might be a problem for people who have Vina in a pipeline and need to check the exit status.

3- Usage notes
--batch currently takes single directory. When multiple directory names are given, they will be parsed as if they were ligand files.
--batch doesn't search recursively in the folder.

These are possible upgrades for devs to consider in future, or to warn users about in the documentation. Not requesting more changes.

@pslacerda Just a heads-up, we are merging this to develop, but it won't be available immediately in the release. Please give us some more time to wrap up other open issues, and we will have a patch for v1.2.6, or v1.2.7 with this feature. Thanks again for contributing.

In addition, if you want to incorporate features and use the modified Vina immediately in your own workflow, you can open a PR to yourself in your own fork. As you might have already noticed, our pre-configured Git Actions workflow is able to compile the modified Vina on a few different platforms, upon an open PR. If you expect to use this feature immediately, you can get the artifacts and continue developing with the feature. In future, if you need to modify Vina for your own purpose, you can also give users access to your own version of Vina :D

@rwxayheee

I noticed the lambda expression for processing ligand input that you initially implemented. I liked the reduction of code redundancy and that it can possibly enable multi threading of ligand input processing.

In my machine, the CPU consumption is below 100% for a few instants before the next batch molecule fully starts. On the level of that lambda function, this low CPU usage time fraction may be improved. Maybe this means a large gain when doing large batches.

What do you think?

Yes, I have seen that before because my jobs are monitored on a HPC. I think it's the initialization process, and that only a part of the execution is parallelized when processing an individual ligand. You are more than welcome to explore options to optimize the performance for batch calculation :D

@pslacerda1

@pslacerda

Hey, I have some points.

2- Valid folder, no matched PDBQT files Program exits after computing grids. No error and exits with 0. This might be a problem for people who have Vina in a pipeline and need to check the exit status.

This is the standard behavior according to the Unix philosophy. It is saying that all files were processed correctly, zero in this case.

3- Usage notes --batch currently takes single directory. When multiple directory names are given, they will be /parsed as if they were ligand files. --batch doesn't search recursively in the folder.

A small correction, when multiple --batch options are given, it's always tried to parse as ligand files. It's read as directory only when a single --batch is given as stated by the clause batch_ligand_names.size() == 1.

@pslacerda Just a heads-up, we are merging this to develop, but it won't be available immediately in the release. Please give us some more time to wrap up other open issues, and we will have a patch for v1.2.6, or v1.2.7 with this feature. Thanks again for contributing.

I don't know why, but you labeled this PR as "documentation". Do you intend to do some documentation on this branch? When do you will merge this PR?

In future, if you need to modify Vina for your own purpose, you can also give users access to your own version of Vina :D

I'm not so ambitious but it's funny to think about this. =]

@rwxayheee

Hi @pslacerda

A small correction, when multiple --batch options are given, it's always tried to parse as ligand files. It's read as directory only when a single --batch is given as stated by the clause batch_ligand_names.size() == 1.

No the flag itself doesn't need to be repeated, and the size is not counted by how many times it repeats. --batch file1 file2 is equivalent to --batch file1 --batch file2. This is why:

            ("batch", value< std::vector<std::string> >(&batch_ligand_names)->multitoken(), "batch ligand (PDBQT)")

Currently after this PR, --batch takes single directory name, or one or more ligand files. I made the notes to say that multiple directory names are not supported, which might be different from others' expectations. But I'm not requesting changes.

Actually --batch filename* is already working before this PR. See the current documentation: https://autodock-vina.readthedocs.io/en/latest/docking_in_batch.html#command-line-options
Before this PR, if people want to pass all files under a directly, they do something like --batch directory_name/*.pdbqt.

I don't know why, but you labeled this PR as "documentation". Do you intend to do some documentation on this branch? When do you will merge this PR?

Because of this PR I will need to update the documentation, but only after this feature is available in release. So that will be worked on in another PR, and i will take care of it.

I'm merging if you're satisfied with the current state of this PR. Let me know when it's ready. I have done the tests and I'm happy with the current state.

@pslacerda

Oh, I see. Had thought that multiple files required multiple --batch because the option description was on singular. So sorry for the noise, it solve less cases than I initially thought. Still relevant (not so much) because of the related issue. Em sex., 14 de fev. de 2025 16:46, Amy He ***@***.***> escreveu:

Hi @pslacerda <https://github.com/pslacerda> A small correction, when multiple --batch options are given, it's always tried to parse as ligand files. It's read as directory only when a single --batch is given as stated by the clause batch_ligand_names.size() == 1. No the flag itself doesn't need to be repeated, and the size is not counted by how many times it repeats. --batch file1 file2 is equivalent to --batch file1 --batch file2. This is why: ("batch", value< std::vectorstd::string >(&batch_ligand_names)->multitoken(), "batch ligand (PDBQT)") It takes single directory name, or one or more ligand files. I made the notes to say that multiple directory names are not supported, which might be different from others' expectations. But I'm not requesting changes. I'm merging if you're satisfied with the current state of this PR. Actually --batch filename* is already working before this PR. See the current documentation:https://autodock-vina.readthedocs.io/en/latest/docking_in_batch.html#command-line-options Before this PR, if people want to pass all files under a directly, they do something like --batch directory_name/*.pdbqt. I don't know why, but you labeled this PR as "documentation". Do you intend to do some documentation on this branch? When do you will merge this PR? Because of this PR I will need to update the documentation, but only after this feature is available in release. So that will be worked on in another PR, and i will take care of it — Reply to this email directly, view it on GitHub <#384 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABEO3S2267NH577K3WY7ZT2PZBYRAVCNFSM6AAAAABWLLL5X2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRQGEZTOMBWGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***> [image: rwxayheee]*rwxayheee* left a comment (ccsb-scripps/AutoDock-Vina#384) <#384 (comment)> Hi @pslacerda <https://github.com/pslacerda> A small correction, when multiple --batch options are given, it's always tried to parse as ligand files. It's read as directory only when a single --batch is given as stated by the clause batch_ligand_names.size() == 1. No the flag itself doesn't need to be repeated, and the size is not counted by how many times it repeats. --batch file1 file2 is equivalent to --batch file1 --batch file2. This is why: ("batch", value< std::vectorstd::string >(&batch_ligand_names)->multitoken(), "batch ligand (PDBQT)") It takes single directory name, or one or more ligand files. I made the notes to say that multiple directory names are not supported, which might be different from others' expectations. But I'm not requesting changes. I'm merging if you're satisfied with the current state of this PR. Actually --batch filename* is already working before this PR. See the current documentation:https://autodock-vina.readthedocs.io/en/latest/docking_in_batch.html#command-line-options Before this PR, if people want to pass all files under a directly, they do something like --batch directory_name/*.pdbqt. I don't know why, but you labeled this PR as "documentation". Do you intend to do some documentation on this branch? When do you will merge this PR? Because of this PR I will need to update the documentation, but only after this feature is available in release. So that will be worked on in another PR, and i will take care of it — Reply to this email directly, view it on GitHub <#384 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABEO3S2267NH577K3WY7ZT2PZBYRAVCNFSM6AAAAABWLLL5X2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMNRQGEZTOMBWGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

@rwxayheee

Hi @pslacerda

Thanks! I'm merging this to develop.

Just one last note on this:

We can enable --batch directory1 --batch directory2 (equivalently, --batch directory1 directory2) in the future if more people need it. But it requires more edits. Don't worry about adding it.

In my opinion it's still reasonable to maintain the single-directory, non-recursive behavior for now :D When ligands have identical names (e.g., --batch folder1/ligand.pdbqt folder2/ligand.pdbqt), Vina does not retain the input file’s path in the output—only appending a suffix, making it difficult to trace the source of the file.

This was referenced

Feb 18, 2025

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})