msg72910 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-09 21:07 |
This is a major reworking of the re module in Python 2.5.2. Added atomic groups. Added possessive quantifiers. Lookbehinds can now be variable length. Typically x2 faster. More changes to follow. |
|
|
msg72915 - (view) |
Author: Benjamin Peterson (benjamin.peterson) *  |
Date: 2008-09-09 21:16 |
Very interesting! Have you seen #3626? Another thing to note is that this will have to wait for 2.7 before it could potentially be integrated into the trunk. |
|
|
msg72916 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-09-09 21:16 |
Hi, This looks impressive. You should really work from the current SVN trunk, not from the 2.5 sources, as there were some additions to the re module (mainly, a bytecode verifier contributed by Google). Also, if it can be split into several functionally independent patches, it will certainly help reviewing and (perhaps) integrating. Last remark: we are currently in the last phases of the release process for 2.6 and 3.0, so your work will probably not get a lot of attention in the next few weeks. Don't be discouraged though! |
|
|
msg72920 - (view) |
Author: Gregory P. Smith (gregory.p.smith) *  |
Date: 2008-09-09 21:26 |
This sounds really neat. but as Anotine said it'll be several weeks before any of us can give this serious attention. Definitely update to trunk and base your work off of that. quick comments: Your _sre.c diff appears to remove and replace the entire file. I don't think thats actually what you did. Chances are that diff is a result of changed newline f lea. Can you please make sure your file uses the previous line ending style so that the diff is more managable. Also, please include unit test updates to validate all new functionality in Lib/test/test_re.py. thanks for your efforts! -gps |
|
|
msg72921 - (view) |
Author: Gregory P. Smith (gregory.p.smith) *  |
Date: 2008-09-09 21:27 |
weird typo: s/f lea/formats/ |
|
|
msg72933 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-10 00:10 |
Corrected the diff file. I worked from Python 2.5.2 because that's what I'm currently using. I'll work from the trunk in future. |
|
|
msg72953 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2008-09-10 10:10 |
The correct link is #2636. Is it the same work? |
|
|
msg72965 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-10 14:51 |
This is different work from a different author than #2636. I've submitted what I've done so far in case my computer gets hit by a bus. :-) I still have more work to do on it, so I'm not concerned that it might not get any attention for a while. |
|
|
msg73162 - (view) |
Author: Terry J. Reedy (terry.reedy) *  |
Date: 2008-09-13 02:29 |
Atomic groups and possessive quantifiers appear to be relatively new: http://en.wikipedia.org/wiki/Regular_expressions for instance, has no mention of either that I found. http://www.regular-expressions.info/atomic.html http://www.regular-expressions.info/possessive.html seemed pretty clear to me. If they accurately describe what you are adding, they might be a starting point for the needed new manual sections. They should include a warning about carefully ordering alternatives lest one prune too much. |
|
|
msg73184 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-09-13 13:40 |
By the way, the patch must be pretty incomplete, since there are almost no changes to _sre.c. Am I missing something? |
|
|
msg73189 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-13 15:20 |
Corrected the diff file, again. :-( The atomic groups and possessive quantifiers are as described at http://www.regular-expressions.info. |
|
|
msg73191 - (view) |
Author: Fredrik Lundh (effbot) *  |
Date: 2008-09-13 15:35 |
A bit more information on the changes to the core engine that are responsible for the 2x speedup (on what?) would be nice to have, I think (especially since you seem to have removed the KMP prefix scanner). (Isn't there a RE benchmark suite somewhere under tests?) |
|
|
msg73255 - (view) |
Author: Jeffrey C. Jacobs (timehorse) |
Date: 2008-09-15 11:21 |
Well, I implemented this months ago, but have been busy with other things so I haven't updated in a while. I noticed that the current version is missing my patches for Atomic Grouping / Possessive Qualifiers and a number of other patches I added in #2636 , but I do have working test cases and documentation updates for all the features I've so far implemented as well as splitting my work into separate sub-issues to make individual selection easier -- though with a number of my modifications, I found that there are SO MANY co-dependencies between, say, an engine modification (item 9) and adding Atomic Grouping / Possessive Qualifiers (item 1) and using shared Engine Constants (item 10) that I need a branch for Atomic, a branch for Atomic + Engine Mod 1, Atomic + Engine Mod 2, Atomic + Shared Constants, Atomic + Engine Mod 1 + Shared Constants AND Atomic + Engine Mod 2 + Shared Constants, and those were just THREE item co-dependencies. My code is all off of the trunk line for 2.6 and is currently available via my Bazaar repository under https://code.launchpad.net/~timehorse, where you can access any source tree via the bazaar version control client. The main reason I got stumped in my development which might otherwise have implemented ALL the issues I intended by now is that very situation I just described where development of new features is NOT mutually independent. You can see by all my branches that the multiplicity of A or B or C is just nightmarish, but what had to be done to keep issues independent. Anyway, I'm looking forward to having a look at your suggestions and think we may take best advantage with combining our work visa vi these things; after all, there's no point re-inventing the wheel. Thanks again for your contribution, Matthew! |
|
|
msg73261 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-15 13:19 |
I know what you mean about the dependencies! My current problem is that now I'm working with the current trunk, which means using Visual C++ Express 2008 instead of 2005. When debugging it's behaving like the debug info is out of date (showing the wrong values for variables in the debugger, single-stepping to the wrong line, etc). Rebuilding doesn't fix it. This might take a while. :-( |
|
|
msg73262 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2008-09-15 13:24 |
If you use Visual C++ Express 2005, you can build python from the PC\VS8.0 directory. |
|
|
msg73272 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-15 17:18 |
Used Visual C++ Express 2005 and the PC\VS8.0 directory. Same problem. |
|
|
msg73274 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2008-09-15 17:30 |
Do you have some big source files, of more than 10000 lines? |
|
|
msg73279 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-15 17:55 |
_sre.c is over 6000, but it does contain macros. I didn't have this problem when based on Python 2.5.2 in Express 2005. |
|
|
msg73280 - (view) |
Author: Jeffrey C. Jacobs (timehorse) |
Date: 2008-09-15 18:58 |
I have uploaded my test cases for Atomic Grouping / Possessive Qualifier, which is the common code we seem to have developed, as this may be of use to you. I also have documentation, but for now, would you mind running these tests against your code to see what the test outputs and also, how did you come up with the 2x result? Was that running the test suite? Usually, the regexp module is benchmarked against its test suite and there are timings built into that, so it may be useful if you could run the unmodified Lib/test/test_re.py you got from the trunk against the original code before modification and your modification, and do so a few times to get a good average result on multi-tasking systems, and post the results here so we can get a good statistical feel for how your new engine improves efficiency. Certainly, I support any Engine that works faster, as I myself have tried to make it faster but ended up with something 8% slower instead, alas. Also, good thinking on fixing the Negative Look-behind variable-width issue; I wish I'd thought of that, but I am curious about something: did you remove the optimization for fixed-width look-behind? The old code only allowed fixed with because that test can be done quickly; I noticed your code adds a lot of new REV opcodes to handle back-tracking and I assume look-behind logic for variable-width look-behind. It would be handy if the compiler and engine would be able to differentiate between fixed-width look-behind (optimized as was originally) and variable-width (using your advanced code). Thanks to AMK for some of these suggestions. Your changes are quite radical though so I am still trying to wade through them all and I still don't have a full-picture of how you've changed things, but there are some good ideas here, IMHO, especially if you do indeed get 2x speedup. |
|
|
msg73466 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-20 16:43 |
This patch is now based on Python 2.6rc2. I've reduced the number of macros and used functions instead, provided that it didn't cost much in terms of speed. In many cases it should be faster than the current release, and at worst no slower. More speed tests and tweaks needed. BTW, the impression I got was that look-behind was fixed width because the matching operations could only match forwards through the text, so in order to look behind it had to step back through the text and then match forwards. For simplicity and speed it insisted that it must be able to determine the size of the step beforehand, hence fixed-width. My addition was to add matching operations which worked matched backwards and also reverse the order of the matching for look-behinds, an idea which I got from a page on how it could be implemented in Perl 6! |
|
|
msg73472 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-20 20:32 |
Bugfix. |
|
|
msg73524 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-21 19:51 |
Fixed the matching of word boundaries when searching and matching in substrings. |
|
|
msg73543 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-22 00:15 |
regex_2.6rc2+2.diff is a bugfix for capture groups in look-behinds. |
|
|
msg73546 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-22 00:43 |
Needed to correct regex_2.6rc2+2.diff. |
|
|
msg73548 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-22 00:49 |
regex_2.6rc2+3.diff adds reverse searching with the re.REVERSE/re.R and "(?r)" flag. This gives results such as: >>> re.findall("(\w+)", "one two three") ['one', 'two', 'three'] >>> re.findall("(?r)(\w+)", "one two three") ['three', 'two', 'one'] See #516762. |
|
|
msg73583 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-22 16:02 |
regex_2.6rc2+4.diff fixes the ordering of the capture groups for reverse searching. |
|
|
msg73584 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-22 16:04 |
Correction of regex_2.6rc2+4.diff. (Aargh!) |
|
|
msg73684 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-24 00:37 |
Patch regex_2.6rc2+5.diff adds scoped and 'negative' flags for (?i), (?m) and (?s). The other flags remain unchanged in behaviour. See #433024, #433027 and #433028. |
|
|
msg73690 - (view) |
Author: Matthew Barnett (mrabarnett) *  |
Date: 2008-09-24 02:20 |
Patch regex_2.6rc2+6.diff is a bugfix. |
|
|
msg73708 - (view) |
Author: Jeffrey C. Jacobs (timehorse) |
Date: 2008-09-24 12:06 |
Matthew, I am really happy that you are making such progress on your engine, but can I PLEASE ask you to slow down for a moment? We have a lot of issues already listed in issue 2636 that is a catch-all for any Python 2.7 Regexp improvements, including your new engine, and I have been working frantically to try and document all the changes YOU are making here into the general Regexp 2.7 modification thread and setting up development trees in my Bazaar VCS repository for your work. There is also a recommended process for patching which makes it easier for the moderators to accept your patches which is to provide dis-entangled functionality and letting each improvement stand on its own two feet. In other words, let your engine stand ONLY on it's 2x speed improvements. We already have an implementation of Atomic Grouping / Possessive Qualifiers in issue 2636 but you have a version of your engine with both. We have no such 'feature-only' implementation for Variable-Length Look-Behind, for a Reverse flag, for Positionally Dependent modifier flags or modifier negation flags, as well as the zero-width Regular Expression split feature, though you and I completely agree these would all be great things to have! The more features you add to your engine as an all-or-nothing proposition, the less likely the moderators are going to be to adapt it because it's harder for them to examine the merits of each individual piece. That is why issue 2636 is broken up into items (currently 1 - 18, with your proposals bringing that up toward 22) and where alternate, combined features are provided if implementing 1 features would affect the implementation of another. Please understand that I personally have no problem with you redesigning large swaths of the Python Regular Expression engine. I would personally, like to see one of the design goals of any new engine not only be speed but better source comments because my main beef with the current engine is that it took me a month to understand and part of my redesign in issue 2636 9-1 was to add copious comments to the engine so that future developers would understand what was going on and be able to pick up from my work. I am not proposing we use my 9-1 engine because it is 8% slower than the current engine and I don't intend to propose anything slower. But it would be nice if you could add lots of comments to your engine so that others could help develop features against it. None the less, I will fully support your engine if it does indeed perform substantially and measurably faster and am happy to see all the Regexp issues you are finding are finally being implemented, all be it entangled with your engine. But let's return to the fundamentals of what you propose IN THIS THREAD, which simply to propose a new Regexp Engine which is 2x faster than the existing engine (Which I have allocated item 9-2 in the issue 2636 thread). I am not trying to put more work on your hands -- in fact, what I am trying to do is get us to co-operate on a better python Regexp Engine so that I can help you to achieve your goals. Please read issue 2636 and join the discussion there; feel free to add any new items you feel are missing from my existing list. And remember, each new feature needs tests and documentation changes. I have been doing each for any feature I undertake and would be happy to share those skills with you. Let's work together to see your engine be the new model, okay? Thanks. |
|
|
msg114610 - (view) |
Author: Georg Brandl (georg.brandl) *  |
Date: 2010-08-21 23:46 |
Work has gone on in #2636. |
|
|