Issue 1160: Medium size regexp crashes python (original) (raw)

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	adi, akuchling, effbot, ezio.melotti, gpolo, greg@gregdetre.co.uk, gvanrossum, johnsonm, kristall, mathieu.clabaut, mrabarnett, ostkamp, pitrou, python-dev, rsc, serhiy.storchaka, timehorse
Priority:	normal	Keywords:	easy, patch

Created on 2007-09-13 08:00 by ostkamp, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
sre_code_ucs4-2.7.patch	serhiy.storchaka,2012-11-11 18:51	review
sre_code_ucs4-3.2.patch	serhiy.storchaka,2012-11-11 18:52	review
sre_code_ucs4_test.patch	serhiy.storchaka,2012-11-11 19:06	Should be applied for all versions	review

Messages (21)
msg55885 - (view)	Author: Guido Ostkamp (ostkamp)	Date: 2007-09-13 08:00
Hello, a medium size regular expression crashes Python 2.5.1 as follows: Traceback (most recent call last): File "./regtest.py", line 14, in m = rematch(pats) File "./regtest.py", line 12, in rematch return re.compile(pat).match File "/export/home/ostkamp/local/lib/python2.5/re.py", line 180, in compile return _compile(pattern, flags) File "/export/home/ostkamp/local/lib/python2.5/re.py", line 231, in _compile p = sre_compile.compile(pattern, flags) File "/export/home/ostkamp/local/lib/python2.5/sre_compile.py", line 530, in compile groupindex, indexgroup OverflowError: regular expression code size limit exceeded This is apparently caused by some code in Modules/_sre.c and Modules/sre.h as follows: self->code[i] = (SRE_CODE) value; if ((unsigned long) self->code[i] != value) { PyErr_SetString(PyExc_OverflowError, "regular expression code size limit exceeded"); break; } An 'unsigned int' value is unnecessarily squeezed into an 'unsigned short' field defined in sre.h: #ifdef Py_UNICODE_WIDE #define SRE_CODE Py_UCS4 #else #define SRE_CODE unsigned short #endif On all systems I'm working on (SuSE Linux SLES 9, Solaris 8 etc.) the else case of the ifdef applies which chooses 'unsigned short'. I don't understand the relationship between 'unicode' and what is apparently the size of the regular expression stack here. Some experiments have shown that changing the 'unsigned short' to 'unsigned long' and rebuilding Python fixes the problem. Here is a test program to reproduce the error: #!/usr/bin/env python import re, random, sys def randhexstring(): return "".join(["%04x" % random.randint(0, 0xffff) for x in range(20)]) pats = [randhexstring() for x in range(1000)] def rematch(pats): pat = '(?:%s)' % '\|'.join(pats) return re.compile(pat).match m = rematch(pats) count = 0 for s in pats * 100: if m(s): count += 1 print count Regards Guido
msg55916 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2007-09-14 19:44
/F?
msg56098 - (view)	Author: Fredrik Lundh (effbot) *	Date: 2007-09-23 19:09
Well, I'm not sure 81k qualifies as "medium sized", really. If you look at the size distribution for typical RE:s (which are usually handwritten, not machine generated), that's one or two orders of magnitude larger than "medium". (And even if this was guaranteed to work on all Python builds, my guess is that performance would be pretty bad compared to a using a minimal RE and checking potential matches against a set. The "\|" operator is mostly O(N), not O(1).) As for fixing this, the "byte code" used by the RE engine uses a word size equal to the Unicode character size (sizeof(Py_UNICODE)) for the given platform. I don't think it would be that hard to set it to 32 bits also on platforms using 16-bit Unicode characters (if anyone would like to experiment, just set SRE_CODE to "unsigned long" in sre.h and see what happens when you run the test suite).
msg59775 - (view)	Author: A.M. Kuchling (akuchling) *	Date: 2008-01-12 01:52
Trying effbot's suggested experiment is easy, at least, and would provide useful info. If it fails, then fixing this bug might be difficult.
msg62143 - (view)	Author: Guilherme Polo (gpolo) *	Date: 2008-02-07 12:00
I tried Frederik's solution against trunk and it works. I compiled python with ucs2 so it is surely setting SRE_CODE to unsigned long. Before this change I got the same exception as pointed by Guido Ostkamp.
msg65651 - (view)	Author: Greg Detre (greg@gregdetre.co.uk)	Date: 2008-04-20 18:16
Dear all, I've just switched from linux to a mac, and I'm suddenly starting to experience this issue with a machine-generated regexp that I depend on. Are there any plans to fix this in a future version of python? Thank you, Greg
msg72319 - (view)	Author: Adi (adi)	Date: 2008-09-02 08:30
Is there any progress on this bug? I am currently considering maintaining a branch of 2.5.2 which includes the patch suggested by effbot.
msg73781 - (view)	Author: Jeffrey C. Jacobs (timehorse)	Date: 2008-09-25 12:19
It seems that changing the size type of the Regular Expression Byte-code is a nice quick-fix, even though it doubles the size of a pattern. It may have the added benefit that most machine architectures available today are at least partially, if not fully, 32-bit oriented so that retrieving op codes may in fact be faster if we make this change. OTOH, it implies something interesting IMHO with the repeat count limits we currently have. Repeat counts can be explicitly set up to 65534 times because 65535, being the largest number you can express in a 16-bit unsigned integer, is currently reserved to mean Infinite. It seems to me this is a great opportunity to set that limit to (unsigned long)-1, since that repeat count is incredibly large. OTOH, if size is an issue, we could change the way sizes are expressed in the Regexp Op Codes (typically in skip counts) to be 15-bit, with the Most Significant Bit being reserved for 'extended' expressions. In this way, a value of 0xFFFFFFFF could be expressed as: 0xFFFF 0xFFFF 0x0003 Of course, parsing number in this form is a pain, to say the least, and unlike in Python, the C-library would not play nicely if someone tried to express a number that could not fit into what the architecture defined an int to be. Plus, there is the problem of how you express Infinite with this scheme. The advantage though would be we don't have to change the op-code size and these 'extended' counts would be very rare indeed. Over all, I'm more of an Occam's Razor fan in that the simplest solution is probably the best: just change the op-code size to unsigned long (which, on SOME architectures would actually make it 64-bits!) and define the 'Infinite' constant as (unsigned long)-1. Mind you, I prefer defining the constant in Python, not C, and it would be hard for Python to determine that particular value being that Python is meant to be 'the same' regardless of the underlying architecture, but that's another issue. Anyway, as 2.6 is in Beta, this will have to wait for Python 2.7 / 3.1, and so I will add an item to Issue 2636 with respect to it.
msg94192 - (view)	Author: (kristall)	Date: 2009-10-17 22:48
aloha, I also use large RE's. re.compile() worked fine under python2.6 (OS ubuntu-linux). After moving the code to python3.0 I get the same error as ostkamp did. Under 3.1 also. Under 3.1 I tried to the fix that ostkamp described (setting 'short' to 'long' in Modules/sre.h) and rebuild python, but still the error occurs. I want to change to 3.x since my variables contain german text with Umlauten (ä, ö, ü etc.) and its pain to work with unicode under 2.x. Is there anything else I can try, or is there a planned date when this bug will be fixed. I am thankful in advance for any help. kristall
msg94233 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-10-19 12:40
Kristall, can you post the troublesome regular expression?
msg95187 - (view)	Author: Michael K Johnson (johnsonm)	Date: 2009-11-13 17:00
I also ran into this issue, and dealt with it as suggested here by changing to sets. Because I have underlying code that has to deal both with small hand-crafted regular expressions and arbitrarily-large machine-generated sets of paths, I subclassed set and implemented the match and search methods in my subclass so that the underlying code would work both against the hand-generated regular expressions and the machine-generated sets of paths. Hope this helps someone else who runs into this restriction.
msg95221 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2009-11-14 00:34
Michael, can you provide the regex or even better a testcase that shows the problem?
msg95320 - (view)	Author: Michael K Johnson (johnsonm)	Date: 2009-11-16 01:13
The test case at the top of this issue reproduces just fine; if you are looking for a different test case you'll have to specify what you don't like about it so that it's clear what you are looking for. I don't think there's any mystery about this issue; it seems perfectly well understood. I commented merely to encourage others who run into this issue to consider one way of using sets if they are running into the same case I was, in which I was trying to use a regular expression to match a candidate string against a large set of exact matches. I was doing this because the initial purpose of the interface I was working with was to allow small, hand-specified regular expressions; this interface was later additionally wrapped in code that automatically created regular expressions for this interface originally (and still also) intended for use with hand-crafted regular expressions. That's why the interface was not originally crafted to use sets, and why it was not appropriate to simply change the interface to use sets. However, my interface also allows passing a callable which resolves the object at the time of use, and so I merely passed a reference to a method which returned an object derived from set but which implemented the match and search methods. If you REALLY want a simpler reproducer, this does it for me in the restricted case (i.e., not using UCS4 encoding): import re r = re.compile('\|'.join(('%d'%x for x in range(7000)))) But I really don't think that additional test cases are a barrier here. Again, my goal was merely to suggest an easy way to use sets as a replacement for regexps, for machine-generated regexps intended to match against exact strings; subclass set and add necessary methods such as search and/or match.
msg99151 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2010-02-10 03:22
As stated in , this is being addressed in issue #2636. My regex module handles the test case without complaint: >>> import regex >>> r = regex.compile('\|'.join('%d'%x for x in range(7000))) >>> r.match("1000") <_regex.REGEX_Match object at 0x015D2920> >>> r.match("6999") <_regex.REGEX_Match object at 0x016DDC20>
msg163863 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-06-24 21:18
This has probably been fixed in 3.3 in c67b7e0c818a.
msg164372 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-06-30 06:40
> This has probably been fixed in 3.3 in c67b7e0c818a. Then the issue may be closed?
msg164391 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-06-30 15:36
> Then the issue may be closed? Well, it's still present in 2.7 and 3.2 (assuming we consider it's important enough to fix).
msg175382 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-11-11 18:51
If we will considered it's important enough to fix, here are patches.
msg176029 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-11-20 21:31
New changeset 10314c9b7c5a by Antoine Pitrou in branch '2.7': Issue #1160: Fix compiling large regular expressions on UCS2 builds. http://hg.python.org/cpython/rev/10314c9b7c5a
msg176030 - (view)	Author: Roundup Robot (python-dev)	Date: 2012-11-20 21:38
New changeset a3579d766fb6 by Antoine Pitrou in branch '3.2': Issue #1160: Fix compiling large regular expressions on UCS2 builds. http://hg.python.org/cpython/rev/a3579d766fb6 New changeset 8b73a069ae4f by Antoine Pitrou in branch '3.3': Merge test from issue #1160. http://hg.python.org/cpython/rev/8b73a069ae4f New changeset 10c1f9e05f4f by Antoine Pitrou in branch 'default': Merge test from issue #1160. http://hg.python.org/cpython/rev/10c1f9e05f4f
msg176031 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-11-20 21:39
I've committed the patch to 3.2 and 2.7, and added the test to 3.3 and default. Thank you!

History
Date	User	Action	Args
2022-04-11 14:56:26	admin	set	github: 45501
2012-11-20 21:39:45	pitrou	set	status: open -> closedresolution: fixedmessages: + stage: patch review -> resolved
2012-11-20 21:38:10	python-dev	set	messages: +
2012-11-20 21:31:45	python-dev	set	nosy: + python-devmessages: +
2012-11-11 19:06:21	serhiy.storchaka	set	files: + sre_code_ucs4_test.patchstage: test needed -> patch review
2012-11-11 18:52:01	serhiy.storchaka	set	files: + sre_code_ucs4-3.2.patch
2012-11-11 18:51:28	serhiy.storchaka	set	files: + sre_code_ucs4-2.7.patchkeywords: + patchmessages: +
2012-06-30 15:36:54	pitrou	set	messages: +
2012-06-30 06:40:59	serhiy.storchaka	set	nosy: + serhiy.storchakamessages: +
2012-06-24 21🔞09	pitrou	set	assignee: effbot -> messages: + versions: + Python 3.2, - Python 3.1
2010-02-10 03:22:48	mrabarnett	set	nosy: + mrabarnettmessages: +
2009-11-16 01:13:25	johnsonm	set	messages: +
2009-11-14 00:34:19	ezio.melotti	set	messages: + stage: test needed
2009-11-13 17:00:30	johnsonm	set	nosy: + johnsonmmessages: +
2009-10-19 12:40:33	pitrou	set	nosy: + pitroumessages: +
2009-10-19 02:55:03	ezio.melotti	set	nosy: + ezio.melottitype: crash -> behaviorversions: + Python 3.1
2009-10-17 22:48:57	kristall	set	nosy: + kristallmessages: +
2008-09-25 12:19:01	timehorse	set	messages: + versions: + Python 2.7, - Python 2.5
2008-09-25 12:01:57	timehorse	set	nosy: + timehorse
2008-09-02 08:30:18	adi	set	nosy: + adimessages: +
2008-04-24 21:04:00	rsc	set	nosy: + rsc
2008-04-20 18:16:21	greg@gregdetre.co.uk	set	nosy: + greg@gregdetre.co.ukmessages: +
2008-02-07 12:00:26	gpolo	set	nosy: + gpolomessages: +
2008-01-12 01:52:18	akuchling	set	keywords: + easynosy: + akuchlingmessages: +
2007-11-28 15:12:55	mathieu.clabaut	set	nosy: + mathieu.clabaut
2007-10-07 23:40:32	brett.cannon	set	priority: normal
2007-09-23 19:10:00	effbot	set	messages: +
2007-09-14 19:44:54	gvanrossum	set	assignee: effbotmessages: + nosy: + effbot, gvanrossum
2007-09-13 08:00:56	ostkamp	create