Issue 590682: New codecs: html, asciihtml (original) (raw)

Issue590682

Created on 2002-08-04 04:58 by orenti, last changed 2022-04-10 16:05 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
htmlcodecs.patch	orenti,2002-08-04 04:58
htmlescapecodec.diff	orenti,2002-08-09 15:38

Messages (13)
msg40815 - (view)	Author: Oren Tirosh (orenti)	Date: 2002-08-04 04:58
These codecs translate HTML character &entity; references. The html codec may be applied after other codecs such as utf-8 or iso8859_X and preserves their encoding. The asciihtml encoder produces 7-bit ascii and its output is therefore safe for insertion into almost any document regardless of its encoding.
msg40816 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-08-04 08:54
Logged In: YES user_id=21627 This patch is superceded by PEP 293 and patch #432401, which allows you to write unitext.encode("ascii", errors = "xmlcharrefreplace") This probably should be left open until PEP 293 is pronounced upon, and then either rejected or reviewed in detail. I'd encourage a patch that uses Unicode in htmlentitydefs directly, and computes entitydefs from that, instead of vice-versa (or atleast exposes a unicode_entitydefs, perhaps even lazily) - perhaps also with a reverse mapping.
msg40817 - (view)	Author: Oren Tirosh (orenti)	Date: 2002-08-04 11:00
Logged In: YES user_id=562624 Yes, the error callback approach handles strange mixes better than my method of chaining codecs. But it only does encoding - this patch also provides full decoding of named, decimal and hexadecimal character entity references. Assuming PEP 293 is accepted, I'd like to see the asciihtml codec stay for its decoding ability and renamed to xmlcharref. The encoding part of this codec can just call .encode("ascii", errors="xmlcharrefreplace") to make it a full two-way codec. I'd prefer htmlentitydefs.py to use unicode, too. It's not so useful the way it is. Another problem is that it uses mixed case names as keys. The dictionary lookup is likely to miss incoming entities with arbitrary case so it's more-or-less broken. Does anyone actually use it the way it is? Can it be changed to use unicode without breaking anyone's code?
msg40818 - (view)	Author: Oren Tirosh (orenti)	Date: 2002-08-04 11:10
Logged In: YES user_id=562624 PEP 293 and patch #432401 are not a replacement for these codecs - it does decoding as well as encoding and also translates <, >, and & which are valid in all encodings and therefore won't get translated by error callbacks.
msg40819 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-08-04 11:50
Logged In: YES user_id=21627 You can easily enough arrange to get errors on <, >, amd &, by using codecs.charmap_encode with an appropriate encoding map. Infact, with that, you can easily get all entity refereces into the encoded data, without any need for an explicit iteration. However, I am concerned that you offer decoding as well. People may be tricked into believing that they can decode arbitrrary HTML with your codec - when your codec would incorrectly deal with CDATA sections.
msg40820 - (view)	Author: Oren Tirosh (orenti)	Date: 2002-08-04 15:07
Logged In: YES user_id=562624 >People may be tricked into believing that they can >decode arbitrary HTML with your codec - when your >codec would incorrectly deal with CDATA sections. You don't even need to go as far as CDATA to see that tags must be parsed first and only then tag bodies and attribute values can be individually decoded. If you do it in the reverse order the tag parser will try to parse < as a tag. It should be documented, though. For encoding it's also obvious that encoding must be done first and then the encoded strings can be inserted into tags - < in strings is encoded into < preventing it from being interpreted as a tag. This is a good thing! it prevents insertion attacks. > You can easily enough arrange to get errors on <, >, > and &, by using codecs.charmap_encode with an > appropriate encoding map. If you mean to use this as some internal implementation detail it's ok. Are actually proposing that this is the way end users should use it? How about this: Install an encoder registry function that responds to any codec name matching "xmlcharref.SPAM" and does all the internal magic you describe to create a codec instance that combines xmlcharref translation including <,>,& and the SPAM encoding. This dynamically-generated codec will do both encoding and decoding and be cached, of course. "Namespaces are one honking great idea -- let's do more of those!"
msg40821 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-08-04 15:54
Logged In: YES user_id=21627 I'm in favour of exposing this via a search functions, for generated codec names, on top of PEP 293 (I would not like your codec to compete with the alternative mechanism). My dislike for the current patch also comes from the fact that it singles-out ASCII, which the search function would not. You could implement two forms: html.codecname and xml.codecname. The html form would do HTML entity references in both directions, and fall back to character references only if necessary; the XML form would use character references all the time, and entity references only for the builtin entities. And yes, I do recommend users to use codecs.charmap_encode directly, as this is probably the most efficient, yet most compact way to convert Unicode to a less-than-7-bit form. In anycase, I'd encourage you to contribute to the progress of PEP 293 first - this has been an issue for several years now, and I would be sorry if it would fail. While you are waiting for PEP 293 to complete, please do consider cleaning up htmlentitydefs to provide mappings from and to Unicode characters.
msg40822 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2002-08-05 07:59
Logged In: YES user_id=38388 On the htmlentitydefs: yes, these are in use as they are defined now. If you want a mapping from and to Unicode, I'd suggest to provide this as a new table. About the cased key in the entitydefs dict: AFAIK, these have to be cased since entities are case-sensitive. Could be wrong though. On PEP 293: this is going in the final round now. Your patch doesn't compete with it though, since PEP 293 is a much more general approach. On the general idea: I think the codecs are misnamed. They should be called htmlescape and asciihtmlescape since they don't provide "real" HTML encoding/decoding as Martin already mentioned. There's something wrong with your approach, BTW: the codec should only operate on Unicode (taking only Unicode input and generating Unicode). If you apply it to an 8-bit UTF-8 encoded strings you'll get garbage !
msg40823 - (view)	Author: Oren Tirosh (orenti)	Date: 2002-08-05 12:11
Logged In: YES user_id=562624 Yes, entities are supposed to be case sensitive but I'm working with manually-generated html in which > is not so uncommon... I guess life is different in XML world. Case-smashing loses the distinction between some entities. I guess I need a more intelligent solution. > If you apply it to an 8-bit UTF-8 encoded strings you'll get garbage! Actually, it works great. The html codec passes characters 128-255 unmodified and therefore can be chained with other codecs. But I now have a more elegant and high-performance approach than codec chaining. See my python-dev posting.
msg40824 - (view)	Author: Oren Tirosh (orenti)	Date: 2002-08-05 12:11
Logged In: YES user_id=562624 Yes, entities are supposed to be case sensitive but I'm working with manually-generated html in which > is not so uncommon... I guess life is different in XML world. Case-smashing loses the distinction between some entities. I guess I need a more intelligent solution. > If you apply it to an 8-bit UTF-8 encoded strings you'll get garbage! Actually, it works great. The html codec passes characters 128-255 unmodified and therefore can be chained with other codecs. But I now have a more elegant and high-performance approach than codec chaining. See my python-dev posting.
msg40825 - (view)	Author: Oren Tirosh (orenti)	Date: 2002-08-09 15:38
Logged In: YES user_id=562624 Case insensitivity fixed. General cleanup. Codecs renamed to htmlescape and htmlescape8bit. Improved error handling. Update unicode_test.
msg40826 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2002-12-12 10:11
Logged In: YES user_id=21627 Oren, is this patch still needed, as we now have the xmlcharrefreplace error handler?
msg40827 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2003-03-28 23:34
Logged In: YES user_id=21627 Apparently, this patch is not needed anymore, so I'm rejecting it.

History
Date	User	Action	Args
2022-04-10 16:05:33	admin	set	github: 36976
2002-08-04 04:58:58	orenti	create