[Python-Dev] Smuggling bytes into text (was Re: RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5) (original) (raw)
Steven D'Aprano steve at pearwood.info
Mon Jan 13 03:03:15 CET 2014
- Previous message: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
- Next message: [Python-Dev] Smuggling bytes into text (was Re: RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Changing the subject line to better describe what we're talking about. I hope it is of interest to others apart from Ethan and I -- mixed bytes and text is hard to get right. (And if I've got something wrong, I'd like to know about it.)
On Sat, Jan 11, 2014 at 08:38:49PM -0800, Ethan Furman wrote:
On 01/11/2014 06:29 PM, Steven D'Aprano wrote: [...] Since you're talking to me, it would be nice if you addressed the same use-case I was addressing, which is mixed: ascii-encoded text, ascii-encoded numbers, ascii-encoded bools, binary-encoded numbers, and misc-encoded text.
I thought I had addressed it. But since your use-case is underspecified, please excuse me if I get some of it wrong.
And no, your example will not work with any text, it would completely moji-bake my dbf files.
I don't think it will. Admittedly, I don't know all the ins and outs of your files, but as far as I can tell, nothing you have said so far suggests that my plan will fail.
Code code speaks louder than words: http://www.pearwood.info/ethan_demo.py
This code produces a string containing smuggled bytes. There is:
a header containing raw bytes;
metadata consisting of the name of some encoding in ASCII;
A series of tagged fields. Each field has a name, which is always ASCII, and terminated with a colon. It is then followed by a single ASCII character and some data:
- T for some arbitrary chunk of text, encoded in the metadata encoding, with a length byte prefix (that is, like a Pascal string);
- F for a boolean flag "true" or "false" in ASCII;
- N for an integer, a C long;
- D for an integer, in ASCII, terminated at the first non-digit;
- B for a chunk of arbitrary bytes, with a two-byte length prefix.
And the whole thing is written out to a file, then read back in, without data corruption or mojibake. I wrote this about 1am this morning, so it may or may not be a shining example of idiomatic Python code, but it works and is readable.
I understand that this won't match your actual use-case precisely, but I hope it contains the same sorts of mixed binary data and ASCII text that you're talking about. There are fixed width fields, variable length fields, binary fields, ASCII fields, non-ASCII text, and multiple encodings, all living in perfect harmony :-)
And it runs unchanged under both Python 2.7 and 3.3.
As so often happens, what seems good in principle is less useful in practce. Once I actually started writing code, I quickly moved beyond the simple model:
template = "some text" data = template % ("text", 42, b'\x16foo'.decode('latin-1'))
that I thought would be easy to a more structured approach. So I wrote reader and writer classes and abstracted away the messy bits, although in truth none of it is very messy. The worst is dealing with the 2 versus 3 differences, and even that requires only a handful of small helper functions.
I don't claim that the code I tossed together is the optimal design, or bug-free, or even that the exact same approach will work for your specific case. But it is enough to demonstrate that the basic idea is sound, you can process mixed text and bytes in a clean way, it doesn't generate mojibake, and can operate in both 2.7 and 3.3 without even using a future directive.
>>>Only the binary blobs need to be decoded. We don't need to encode the >>>template to bytes, and the textual data doesn't get encoded until we're >>>ready to send it across the wire or write it to disk.
No! When I have text, part of which gets ascii-encoded and part of which gets, say, cp1251 encoded, I cannot wait till the end!
I think we are talking about different textual data. It's a bit ambiguous, my apologies. You're talking about taking individual fields and deciding how to process them. I'm talking about doing your processing in the text domain, which means at the end of the process I have a Unicode string object rather than a bytes object. Before that str can be written to disk, it needs to be encoded.
>>And what if your name field has data not representable in latin-1? >> >>--> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8') >>u'\u0441\u0440\u0403' > >Where did you get those bytes from? You got them from somewhere.
For the sake of argument, pretend a user entered them in. >Who knows? Who cares? Once you have bytes, you can treat them as a blob of >arbitrary bytes and write them to the record using the Latin-1 trick. No, I can't. See above. > If >you're reading those bytes from some stream that gives you bytes, you >don't have to care where they came from. You're kidding, right? If I don't know where they came from (a graphics field? a note field?) how am I going to know how to treat them?
As I understand it, you want the ability to store arbitrary bytes in the file, right? Here are nine arbitrary bytes:
b'\x82\xE1\xC2\0\0\x7B\0\xFF\xA8'
You don't need to know how I generated them, whether they are sound samples, data from a serial port, three RGB values, or some strange C struct. I need to know how to generate them, but you can treat them as an opaque blob. They're already bytes, you're not responsible for converting whatever the data was into bytes, because it's already done. It's just a blob of bytes as far as you're concerned. All you need to do is smuggle them into a text string.
>But what if you don't start with bytes? If you start with a bunch of >floats, you'll probably convert them to bytes using the struct module.
Yup, and I do. >If you start with non-ASCII text, you have to convert them to bytes too. >No difference here. Really?
Again, I fear I failed to explain myself in sufficient detail. If your non-ASCII text doesn't match the encoding specified, how else are you going to include it? See below.
You just said above that "it will work with any text data" -- you can't have it both ways.
I have been unclear, I apologise. Let me try again with an example.
As the end-user, I get to specify the encoding, that's what you said. Okay, I specify ISO-8859-7, which is Greek. Now obviously if I hand you a bunch of Russian letters in a string, and you try to encode them using ISO-8859-7, you're going to get an exception. That's okay, as presumably I'm sensible enough to only include characters which exist in the encoding I choose, and if not, its my own damn fault.
But suppose I have a reason for this strange behaviour. If I pre-encode those Russian letters to bytes, using (say) UTF-16, then I can hand you the raw bytes to store as a binary blob. Later, I get the binary blob back again, and I can decode them using UTF-16, to get the original Russian text back again. So long as you don't mangle the binary blob, the process is completely reversable.
That is what I am talking about.
>You ask the user for their name, they answer "срЃ" which is given to you >as a Unicode string, and you want to include it in your data record. The >specifications of your file format aren't clear, so I'm going to assume >that: > >1) ASCII text is allowed "as-is" (that is, the name "George" will be > in the final data file as b'George');
User data is not (typically) where the ASCII data is, but some of the metadata is definitely and always ASCII. The user text data needs to be encoded using whichever codec is specified by the file, which is only occasionally ASCII.
>2) any other non-ASCII text will be encoded as some fixed encoding > which we can choose to suit ourselves; Well, the user chooses it, we have to abide by their choice. (It's kept in the file metadata.) >3) arbitrary binary data is allowed "as-is" (i.e. byte N has to end up > being written as byte N, for any value of N between 0 and 255). In a couple field types, yes. Usually the binary data is numeric or date related and there is conversion going on there, too, to give me the bytes I need.
The above all sounds reasonable. But the following does not -- I think it shows some fundamental confusion on your part.
[snip]
>>--> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1') >>Traceback (most recent call last): >> File "", line 1, in >>UnicodeEncodeError: 'latin-1' codec can't encode characters in position >>0-2: ordinal not in range(256) > >That is backwards to what I've shown. Look at my earlier example again: And you are not paying attention: '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8').encode('latin1') --------------------------------------/ -------------/ a non-ascii compatible unicode string to latin1 bytes
You can't decode Unicode strings. Try it in Python 3, and it breaks:
py> '\xd1\x81\xd1\x80\xd0\x83'.decode('utf8') Traceback (most recent call last): File "", line 1, in AttributeError: 'str' object has no attribute 'decode'
For your code to work, you can't be using Python 3, you have to be using Python 2, where "..." is already bytes, not Unicode. Since it's a byte string, there's no point in decoding it into UTF-8, then encoding it back to bytes. All you are doing is running the risk of UnicodeEncodingError:
Python 2.7 this time
py> '\xd0\x94'.decode('utf-8').encode('latin-1') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0414' in position 0: ordinal not in range(256)
Latin-1 does not work with arbitrary characters, but it does work with arbitrary bytes. You're trying to take a UTF-8 encoded byte string, decode back to arbitrary Unicode characters, then encode to Latin-1, which may fail.
What I am doing is taking arbitrary bytes, then decode to Latin-1 as a way of smuggling those bytes into a str.
("срЃ".encode('somenonasciiencodingsuchascp1251').decode('latin-1'), 42, blob.decode('latin-1')) ----------------------------------------------/ --------------/ getting the actual bytes I need and back into unicode until I write them later
In Python 3, that works, but I'm not sure if it does what you intend (I don't know what you intend). You have encode and decode the right way around this time, for Python 3 strings.
In Python 2, the interpreter (wrongly) accepts "срЃ" as a byte-string literal, but the results are poorly defined. What you actually get (probably) depends on your enviroment. On my system, I seem to get UTF-8 encoded bytes, but that's not guaranteed.
You did say to use a text template to manipulate my data, and then write it later, no? Well, this is what it would look like.
If the text strings the user gives you are compatible with the encoding they specify, you don't need that. Just use:
("срЃ", 42, blob.decode('latin-1'))
It's the user's responsibility if they choose to specify an encoding which is more restrictive than the contents of some field. If they do that, they have to encode that field somehow, so they can treat it as a binary blob. You don't have to do this, and you certainly don't have to take perfectly good text and turn it into bytes then back to text just so you can insert it back into text. That would be silly.
>Bytes get DECODED to latin-1, not encoded. > >Bytes -> text is decoding >Text -> bytes is encoding
Pretend for a moment I know that, and look at my examples again.
Sorry to be harsh, but based on your swapping decode and encode around above in the examples above, I would have to pretend :-)
I am demonstrating the contortions needed when my TEXTual data is not ASCII-compatible: It must be ENcoded using the appropriate codec to BYTES, then DEcoded back to unicode using latin1, all so later I can ENcode the bloomin' unicode data structure back to bytes using latin1 again. Dizzy yet?
No.
If I, the end user, insist on using a stupid legacy encoding, then YES absolutely of course I have to jump through hoops to store arbitrary Unicode characters using a legacy encoding that only supports a tiny subset of Unicode. This should not surprise you.
And you must know this, because it is what your bytify function does. Are you trolling?
No.
-- Steven
- Previous message: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
- Next message: [Python-Dev] Smuggling bytes into text (was Re: RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]