Issue 856617: HTMLParser parsers AT&T to AT (original) (raw)
Created on 2003-12-09 02:47 by lhy719, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Messages (4) | ||
---|---|---|
msg19332 - (view) | Author: Hammer Lee (lhy719) | Date: 2003-12-09 02:47 |
I use HTMLParser to parse HTML files. There is an mistake when HTML contents have '&', like AT&T Research Labs Cambridge - WinVNC Version 3, 3, 3, 7. HTMLParser parses "AT&T Research" to "AT Research". It happens on "ETTC&P EpSCTWeb_Fr Application Version 1, 0, 0, 1" also. I'm a newbie in Python, I don't know how to solve it. |
||
msg19333 - (view) | Author: Jim Jewett (jimjjewett) | Date: 2003-12-11 18:32 |
Logged In: YES user_id=764593 Technically, that isn't legal html; they're supposed to write & (follow the & with the word "amp;"), because & is an escape character. That said, it is a pretty common error in web pages. The parser already recovers at the next space (instead of waiting for a ";", and I think it would be reasonable to just return the "&T" when T doesn't turn out to be a known entity. You would do this by overriding handle_entityref -- but to be honest, I suspect that you're "really" using some other library (or local code) which already does this, so you may have to make the modification there. | ||
msg19334 - (view) | Author: Jordan R McCoy (jrm) | Date: 2003-12-24 17:40 |
Logged In: YES user_id=813983 The HTML being parsed should use '&' for the '&'; however, HTMLParser uses this regexp to identify entity references (line 20): entityref = re.compile( '&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]') which doesn't match the ';' required at the end by the HTML specification. This may or may not be intentional. | ||
msg19335 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2003-12-30 11:21 |
Logged In: YES user_id=21627 What do you mean, "it parses it to AT Research". It most certainly does no such thing. Instead, it invokes handle_entityref with the "T" entity, which you should process. Closing as not-a-bug |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:01 | admin | set | github: 39682 |
2003-12-09 02:47:41 | lhy719 | create |