Issue 31324: support._match_test() used by test.bisect is very inefficient (original) (raw)

Issue31324

Created on 2017-09-01 15:11 by vstinner, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 4420 closed serhiy.storchaka,2017-11-16 14:35
PR 4421 merged vstinner,2017-11-16 15:16
PR 4505 merged vstinner,2017-11-22 18:05
PR 4523 merged vstinner,2017-11-23 15:41
PR 4524 merged vstinner,2017-11-23 16:54
Messages (19)
msg301122 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-09-01 15:11
support._match_test() uses a nested loop calling fnmatch.fnmatchcase(). This function creates a temporary regular expression object. The cache of the re module works around the performance... if the length of support.match_tests fits into the cache. But when the list is longer, the function becomes dead slow... def _match_test(test): global match_tests if match_tests is None: return True test_id = test.id() for match_test in match_tests: if fnmatch.fnmatchcase(test_id, match_test): return True for name in test_id.split("."): if fnmatch.fnmatchcase(name, match_test): return True return False Maybe we should build a giant regex matching test_id at each, but cache the regex since support.match_tests can be modified anytime. I implemented this once, but I lost the code :-) Currently, it's possible to match 3 things: * test method name: test_exit * test class name: SysModuleTest * full test id: test.test_sys.SysModuleTest.test_exit It's also possible to use "*" joker character in a test name. I would like to keep these convenient CLI.
msg301123 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-09-01 15:13
Example of the performance issue: $ ./python -m test.bisect --fail-env-changed -o bisect test_asyncio [+] Iteration 1: run 756 tests/1512 (...) The first iteration takes forever because it runs tests with support.match_tests which contains 756 patterns.
msg301126 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-09-01 15:14
Workaround to run test.bisect on test_asyncio: diff --git a/Lib/test/support/__init__.py b/Lib/test/support/__init__.py index 522804be60..201d0665b2 100644 --- a/Lib/test/support/__init__.py +++ b/Lib/test/support/__init__.py @@ -1911,15 +1911,7 @@ def _match_test(test): if match_tests is None: return True test_id = test.id() - - for match_test in match_tests: - if fnmatch.fnmatchcase(test_id, match_test): - return True - - for name in test_id.split("."): - if fnmatch.fnmatchcase(name, match_test): - return True - return False + return (test_id in match_tests) def run_unittest(*classes): With this patch, running 756 tests takes 12 seconds, as expected.
msg301128 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-09-01 15:33
Oh, by the way, -m test -m TestCaseName doesn't work fully on test_asyncio. It only works if I pass the full test identifier. With TestCaseName, no test is run: haypo@selma$ ./python -m test -u all test_asyncio --fail-env-changed -m PollEventLoopTests Run tests sequentially 0:00:00 load avg: 6.97 [1/1] test_asyncio 1 test OK. Total duration: 458 ms Tests result: SUCCESS With a full test identifier, a test is run: haypo@selma$ ./python -m test -u all test_asyncio --fail-env-changed -m test.test_asyncio.test_events.PollEventLoopTests.test_create_datagram_endpoint_sock Run tests sequentially 0:00:00 load avg: 7.06 [1/1] test_asyncio Warning -- threading_cleanup() detected 1 leaked threads (count: 1, dangling: 2) test_asyncio failed (env changed) 1 test altered the execution environment: test_asyncio Total duration: 561 ms Tests result: ENV CHANGED
msg301129 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-09-01 15:34
Oops. Ignore my previous comment, I was using my temporary workaround! ()
msg301140 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-09-01 17:05
_match_test1 = re.compile('|'.join(map(fnmatch.translate, match_tests)), re.I).match def _match_test(test): test_id = test.id() return bool(_match_test1(test_id) or any(map(_match_test1, test_id.split(".")))
msg306362 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-16 14:38
I tested manually Serhiy's PR 4420: it works as expected! haypo@selma$ ./python -m test test_os --list-cases -m test.test_os.FileTests.test_access test.test_os.FileTests.test_access haypo@selma$ ./python -m test test_os --list-cases -m test_access test.test_os.FileTests.test_access haypo@selma$ ./python -m test test_os --list-cases -m FileTests test.test_os.FileTests.test_access test.test_os.FileTests.test_closerange test.test_os.FileTests.test_fdopen test.test_os.FileTests.test_large_read test.test_os.FileTests.test_open_keywords test.test_os.FileTests.test_read test.test_os.FileTests.test_rename test.test_os.FileTests.test_replace test.test_os.FileTests.test_symlink_keywords test.test_os.FileTests.test_write test.test_os.FileTests.test_write_windows_console haypo@selma$ ./python -m test test_os --list-cases -m 'FileTest*' test.test_os.FileTests.test_access test.test_os.FileTests.test_closerange test.test_os.FileTests.test_fdopen test.test_os.FileTests.test_large_read test.test_os.FileTests.test_open_keywords test.test_os.FileTests.test_read test.test_os.FileTests.test_rename test.test_os.FileTests.test_replace test.test_os.FileTests.test_symlink_keywords test.test_os.FileTests.test_write test.test_os.FileTests.test_write_windows_console haypo@selma$ ./python -m test test_os --list-cases -m 'test_acc*' test.test_os.FileTests.test_access test.test_os.StatAttributeTests.test_access_denied haypo@selma$ ./python -m test test_os --list-cases -m '*FileTests.test_access' test.test_os.FileTests.test_access
msg306363 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-16 14:51
$ time ./python -m test.bisect --fail-env-changed -o bisect test_asyncio -N1 ... Bisection failed after 2 iterations and 0:00:11 real 0m10,441s user 0m8,726s sys 0m0,292s There is good opportunity for optimizing the regular expression. All cases have the literal 'test.test_asyncio.' prefix, and even longer literal prefixes are common for tens cases.
msg306382 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-16 17:04
> There is good opportunity for optimizing the regular expression. All cases have the literal 'test.test_asyncio.' prefix, and even longer literal prefixes are common for tens cases. I implemented such code long time ago :-) https://github.com/haypo/hachoir3/blob/master/hachoir/regex/regex.py I'm not sure that this code works on Python 3, it was written for Python 2 but not only ported to Python 3 partially. It only supports a subset of regular expressions.
msg306387 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-16 17:14
I know :-) Unfortunately it is not obviously that these optimization actually optimize regular expressions. Due to the implementation particularities sometimes "unoptimized" code is faster than "optimized". I'm going to implement these optimizations in the re module, but it needs to be very careful, and may need changing the matching engine.
msg306640 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-21 14:01
I created a list of all test cases using: ./python -m test --list-cases > all_cases The list contains 29,569 test cases. Sadly, the set().__contains__ matcher of my PR 4421 cannot be taken because test_json produces two test cases called "json". IMHO it's a bug in test_json, all test identifiers should be "fully qualified". test_builtin has a similar issue: it produces identifiers like "builtins.bin" instead of "test.test_builtin.builtins.bin". If I removed the two "json" from all_cases, "./python -u -m test.bisect -i all_cases -v" takes less than one second before running the first test method, and so it's fast enough. With PR 4420, it takes around 17 seconds before running the first test. The compilation of the giant regex and matching test identifiers are slow. My PR has the same bad performance when the regex path is taken.
msg306666 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-11-21 16:22
We could split patterns on two parts and create both matchers. Then the final matching function could look like: return _match_test_patterns is None or test_id in id_set or regex_match(test_id) or any(map(regex_match, test_id.split("."))) I don't know whether it is worth to do.
msg306693 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-21 23:34
New changeset 803ddd8ce22f0de3ab42fb98a225a704c000ef06 by Victor Stinner in branch 'master': bpo-31324: Optimize support._match_test() (#4421) https://github.com/python/cpython/commit/803ddd8ce22f0de3ab42fb98a225a704c000ef06
msg306694 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-21 23:38
I merged my PR 4421 which is based on Serhiy's PR 4420. Thank you Serhiy for your reviews! -- Serhiy Storchaka: "We could split patterns on two parts and create both matchers (...) I don't know whether it is worth to do." My use bisect is to find a regression using the test.bisect tool. According to my tests, the fastest set().__contains__ matcher is always used, except of test_json. If you bisect a bug in test_json, it will work, but just be a little bit slower. Honestly, I don't think that we can go further in term of optimization. In my tests, it seems like creating the matcher function and filter all tests take less than 1 second, so it's enough for my use case ;-)
msg306743 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-22 18:06
Oops, the commit 803ddd8ce22f0de3ab42fb98a225a704c000ef06 broke the GCC job on Travis CI: --- Run tests sequentially 0:00:00 load avg: 48.91 [ 1/404] test_grammar 0:00:00 load avg: 48.91 [ 2/404] test_opcodes 0:00:00 load avg: 48.91 [ 3/404] test_dict 0:00:00 load avg: 48.91 [ 4/404] test_builtin 0:00:00 load avg: 48.91 [ 5/404] test_exceptions 0:00:01 load avg: 48.91 [ 6/404] test_types 0:00:01 load avg: 48.91 [ 7/404] test_unittest 0:00:03 load avg: 48.91 [ 8/404] test_doctest 0:00:04 load avg: 48.11 [ 9/404] test_doctest2 0:00:04 load avg: 48.11 [ 10/404] test_support 0:00:07 load avg: 48.11 [ 11/404] test___all__ Traceback (most recent call last): File "/home/travis/build/python/cpython/Lib/test/__main__.py", line 2, in main() File "/home/travis/build/python/cpython/Lib/test/libregrtest/main.py", line 585, in main Regrtest().main(tests=tests, **kwargs) File "/home/travis/build/python/cpython/Lib/test/libregrtest/main.py", line 510, in main self._main(tests, kwargs) File "/home/travis/build/python/cpython/Lib/test/libregrtest/main.py", line 536, in _main self.run_tests() File "/home/travis/build/python/cpython/Lib/test/libregrtest/main.py", line 458, in run_tests self.run_tests_sequential() File "/home/travis/build/python/cpython/Lib/test/libregrtest/main.py", line 379, in run_tests_sequential result = runtest(self.ns, test) File "/home/travis/build/python/cpython/Lib/test/libregrtest/runtest.py", line 105, in runtest support.set_match_tests(ns.match_tests) File "/home/travis/build/python/cpython/Lib/test/support/__init__.py", line 1958, in set_match_tests _match_test_patterns = tuple(patterns) TypeError: 'NoneType' object is not iterable --- Attached PR 4505 fixes it.
msg306756 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-22 19:59
New changeset bb11c3c967afaf263e00844d4ab461b7fafd6d36 by Victor Stinner in branch 'master': bpo-31324: Fix test.support.set_match_tests(None) (#4505) https://github.com/python/cpython/commit/bb11c3c967afaf263e00844d4ab461b7fafd6d36
msg306839 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-23 16:42
New changeset 70b2f8797146a56a6880743424f0bedf4fc30c62 by Victor Stinner in branch '3.6': [3.6] bpo-31324: Optimize support._match_test() (#4523) https://github.com/python/cpython/commit/70b2f8797146a56a6880743424f0bedf4fc30c62
msg306849 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-23 17:35
New changeset 35d99830f1878867e3964577741d9a2d5a7fc8f7 by Victor Stinner in branch '2.7': bpo-31324: Optimize support._match_test() (#4523) (#4524) https://github.com/python/cpython/commit/35d99830f1878867e3964577741d9a2d5a7fc8f7
msg306850 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-11-23 17:36
I tested set_match_tests() on python2.7 with: python2.7 -m test --list-cases > cases python2.7 -m test.bisect -i cases Good news: the set().__contains__ matcher can be used!
History
Date User Action Args
2022-04-11 14:58:51 admin set github: 75505
2017-11-23 17:36:02 vstinner set messages: +
2017-11-23 17:35:05 vstinner set messages: +
2017-11-23 16:54:19 vstinner set pull_requests: + <pull%5Frequest4461>
2017-11-23 16:42:11 vstinner set messages: +
2017-11-23 15:41:51 vstinner set pull_requests: + <pull%5Frequest4460>
2017-11-22 19:59:05 vstinner set messages: +
2017-11-22 18:06:32 vstinner set messages: +
2017-11-22 18:05:50 vstinner set pull_requests: + <pull%5Frequest4443>
2017-11-21 23:38:18 vstinner set status: open -> closedresolution: fixedmessages: + stage: patch review -> resolved
2017-11-21 23:34:05 vstinner set messages: +
2017-11-21 16:22:07 serhiy.storchaka set messages: +
2017-11-21 14:01:17 vstinner set messages: +
2017-11-16 17:14:15 serhiy.storchaka set messages: +
2017-11-16 17:04:10 vstinner set messages: +
2017-11-16 15:16:45 vstinner set pull_requests: + <pull%5Frequest4369>
2017-11-16 14:51:12 serhiy.storchaka set messages: +
2017-11-16 14:38:56 vstinner set messages: +
2017-11-16 14:35:25 serhiy.storchaka set keywords: + patchstage: patch reviewpull_requests: + <pull%5Frequest4368>
2017-09-01 17:05:37 serhiy.storchaka set messages: +
2017-09-01 15:34:03 vstinner set messages: +
2017-09-01 15:33:30 vstinner set messages: +
2017-09-01 15:14:42 vstinner set messages: +
2017-09-01 15:13:14 vstinner set messages: +
2017-09-01 15:12:27 vstinner set nosy: + louielu
2017-09-01 15:11:02 vstinner create