Issue 856617: HTMLParser parsers AT&T to AT (original) (raw)

Created on 2003-12-09 02:47 by lhy719, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (4)
msg19332 - (view) Author: Hammer Lee (lhy719) Date: 2003-12-09 02:47
I use HTMLParser to parse HTML files. There is an mistake when HTML contents have '&', like
AT&T Research Labs Cambridge - WinVNC Version 3, 3, 3, 7. HTMLParser parses "AT&T Research" to "AT Research". It happens on "ETTC&P EpSCTWeb_Fr Application Version 1, 0, 0, 1" also. I'm a newbie in Python, I don't know how to solve it.
msg19333 - (view) Author: Jim Jewett (jimjjewett) Date: 2003-12-11 18:32
Logged In: YES user_id=764593 Technically, that isn't legal html; they're supposed to write & (follow the & with the word "amp;"), because & is an escape character. That said, it is a pretty common error in web pages. The parser already recovers at the next space (instead of waiting for a ";", and I think it would be reasonable to just return the "&T" when T doesn't turn out to be a known entity. You would do this by overriding handle_entityref -- but to be honest, I suspect that you're "really" using some other library (or local code) which already does this, so you may have to make the modification there.
msg19334 - (view) Author: Jordan R McCoy (jrm) Date: 2003-12-24 17:40
Logged In: YES user_id=813983 The HTML being parsed should use '&' for the '&'; however, HTMLParser uses this regexp to identify entity references (line 20): entityref = re.compile( '&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]') which doesn't match the ';' required at the end by the HTML specification. This may or may not be intentional.
msg19335 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2003-12-30 11:21
Logged In: YES user_id=21627 What do you mean, "it parses it to AT Research". It most certainly does no such thing. Instead, it invokes handle_entityref with the "T" entity, which you should process. Closing as not-a-bug
History
Date User Action Args
2022-04-11 14:56:01 admin set github: 39682
2003-12-09 02:47:41 lhy719 create