msg181903 - (view) |
Author: Guido Reina (guido.reina) |
Date: 2013-02-11 16:11 |
In the file: Lib/_markupbase.py, function: "_parse_doctype_element" there is: if '>' in rawdata[j:]: return rawdata.find(">", j) + 1 rawdata[j:] is being scanned twice. It would be better to do: pos = rawdata.find(">", j) if pos != -1: return pos + 1 Same thing in the function: "_parse_doctype_attlist": if ")" in rawdata[j:]: j = rawdata.find(")", j) + 1 else: return -1 It would be better to do: pos = rawdata.find(")", j) if pos != -1: j = pos + 1 else: return -1 |
|
|
msg181904 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2013-02-11 16:18 |
> if '>' in rawdata[j:]: > return rawdata.find(">", j) + 1 See for this idiom. |
|
|
msg182179 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2013-02-15 21:42 |
We should add some benchmarks to see if there is any difference between the two forms. |
|
|
msg182183 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2013-02-15 22:10 |
'Enhancement' issues are for visible behavior additions (or occasionally, changes). This is intended to be an invisible small speedup, hence it is a 'performance' issue, and gets a different title. As explained in #17170, the change will not be a speedup if the substring being looked for is usually not there. The reason is the .find lookup and function call versus the direct syntax. Even if it is faster, I strongly doubt it would be hardly noticeable in the context of this function, which itself is a small piece of parsing an entire document, and it is against our policy to make such micro-optimizations in working code. The complete block in question Lib/_markupbase.py, 254:7 is rawdata = self.rawdata if '>' in rawdata[j:]: return rawdata.find(">", j) + 1 return -1 [Ugh. Localizing rawdata negates some of whatever advantage is gained from the double scan.] If I were to rewrite it, I would replace it with try: return self.rawdata.index(">", j) + 1 except ValueError: return -1 as better style, and a better example for readers, regardless of micro speed differences. But style-only changes in working code is also against our policy. So I would be closing this if Ezio had not grabbed it ;-). |
|
|
msg182185 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2013-02-15 22:28 |
I would still do a benchmark, for these reasons: 1) IIRC rawdata might be the whole document (or at least everything that has not been parsed yet); 2) the '>' is very likely to be found; This situation is fairly different from the one presented in #17170, where the strings are shorts and the character is not present in the majority of the strings. Profiling and improving html.parser (and hence _markupbase) was already on my todo list (even if admittedly not anywhere near the top :), so writing a benchmark for it might be useful for further enhancements too. (Note: HTMLParser is already fairly fast, parsing ~1.3MB/s according to http://www.crummy.com/2012/02/06/0, but I've never done anything to make it even faster, so there might still be room for improvements.) |
|
|
msg182215 - (view) |
Author: Guido Reina (guido.reina) |
Date: 2013-02-16 10:36 |
I am attaching a .tgz file with the tests I have performed. The .tgz file contains also a README.txt file with more detailed information. I have done the following test: The script loads the HTML file 'search.html' in 'rawdata' and searches '>' in a loop from the position 'i', being i in: range(len(rawdata)). with the three variants: "in" + "find" (test1.py), "find" (test2.py), "index" (test3.py). Result: Script First run Second run Third run --------------------------------------------------------- test1.py 2.33 2.32 2.33 test2.py 0.75 0.74 0.76 test3.py 0.75 0.74 0.74 I don't know if the test is representative and whether it helps. If you think that the test could be improved/changed, just let me know, I will be happy to help. |
|
|
msg182626 - (view) |
Author: Ezio Melotti (ezio.melotti) *  |
Date: 2013-02-22 03:46 |
I did some macro-benchmarks and the proposed changes don't seem to affect the result (most likely because they are in _parse_doctype_element and _parse_doctype_attlist which should be called only once per document). I did some profiling, and this is the result: 4437196 function calls (4436748 primitive calls) in 36.582 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 92931 7.400 0.000 17.082 0.000 parser.py:320(parse_starttag) 202 6.363 0.032 36.281 0.180 parser.py:171(goahead) 673285 5.302 0.000 5.302 0.000 {method 'match' of '_sre.SRE_Pattern' objects} 369418 3.272 0.000 4.554 0.000 _markupbase.py:48(updatepos) 83243 2.698 0.000 4.639 0.000 parser.py:421(parse_endtag) 308882 2.006 0.000 2.006 0.000 {method 'group' of '_sre.SRE_Match' objects} 270074 1.521 0.000 1.521 0.000 {method 'search' of '_sre.SRE_Pattern' objects} 92931 1.150 0.000 2.643 0.000 parser.py:378(check_for_whole_start_tag) 291079 1.028 0.000 1.028 0.000 {method 'count' of 'str' objects} 295892 0.883 0.000 0.883 0.000 {method 'startswith' of 'str' objects} 387439 0.733 0.000 0.733 0.000 {method 'lower' of 'str' objects} 403922 0.642 0.000 0.642 0.000 {method 'end' of '_sre.SRE_Match' objects} 124512 0.406 0.000 1.156 0.000 parser.py:504(unescape) 186775 0.326 0.000 0.326 0.000 {method 'start' of '_sre.SRE_Match' objects} 96213 0.255 0.000 0.255 0.000 {method 'endswith' of 'str' objects} 59522 0.253 0.000 0.253 0.000 {method 'rindex' of 'str' objects} 83226 0.215 0.000 0.215 0.000 parser.py:164(clear_cdata_mode) 6428 0.194 0.000 0.337 0.000 parser.py:507(replaceEntities) 106487 0.183 0.000 0.183 0.000 parser.py:484(handle_data) Excluding string and regex methods, the 3 slowest methods are parse_starttag, goahead, and updatepos. The attached patch adds a couple of simple optimizations to the first two -- I couldn't think a way to optimize updatepos. The resulting speedup is however fairly small, so I'm not sure it's worth applying the patch. I might try doing other benchmarks in future (should I add them somewhere in Tools?). |
|
|
msg381297 - (view) |
Author: Irit Katriel (iritkatriel) *  |
Date: 2020-11-17 22:20 |
I there anything left to do here or should we close it? |
|
|
msg381347 - (view) |
Author: Guido Reina (guido.reina) |
Date: 2020-11-18 14:30 |
It can be closed. |
|
|