Issue 20699: Document that binary IO classes work with bytes-likes objects (original) (raw)

Created on 2014-02-20 09:49 by Henning.von.Bargen, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (31)

msg211714 - (view)

Author: Henning von Bargen (Henning.von.Bargen)

Date: 2014-02-20 09:49

Regression: Behavior of ZipFile with file-like object and BufferedWriter.

The following code worked with Python 2.6:

LOB_BLOCKSIZE = 1024*1024 # 1 MB

class UnbufferedBlobWriter(io.RawIOBase): """ A file-like wrapper for a write-only cx_Oracle BLOB object. """ def init(self, blobLocator): self.blobLocator = blobLocator self.offset = 0 self.blobLocator.open()

def seekable(self):
    return True
    
def seek(self, offset, whence):
    if whence == 0:
        self.offset = offset
    elif whence == 1:
        self.offset += offset
        if self.offset < 0:
            self.offset = 0
    elif whence == 2:
        if offset <= 0 and -offset <= self.blobLocator.size():
            self.offset = self.blobLocator.size() + offset
        else:
            raise IOError(96, "Invalid offset for BlobWriter")
    else:
        self._unsupported("seek")
    return self.offset
    
def writable(self):
    return True

def write(self, data, offset=None):
    if offset is None:
        offset = self.offset
    self.blobLocator.write(bytes(data), offset + 1)
    self.offset = offset + len(data)
    return len(data)

def close(self):
    self.flush()
    self.blobLocator.close()        

def BlobWriter(blobLocator): """ A file-like wrapper (buffered) for a write-only cx_Oracle BLOB object. """ return io.BufferedWriter(UnbufferedBlobWriter(blobLocator), LOB_BLOCKGROESSE)

Note: The cx_Oracle BLOB object is used to store binary content inside a database. It's basically a file-like-like object.

I'm using it in conjunction with a ZipFile object to store a ZipFile as a BLOB inside the DB, like this:

curs.execute("""
     insert into ...  values (..., Empty_BLOB())
     returning BDATA into :po_BDATA
     """,
     [..., blobvar])
blob = BlobWriter(blobvar.getvalue())
archive = ZipFile(blob, "w", ZIP_DEFLATED)
for filename in ...:
self.log.debug("Saving to ZIP file in the DB: %s", filename)
archive.write(filename, filename)
archive.close()

This used to work with Python 2.6.

With Python 2.7.5 however, somethin like this gets written into the blob: <memory at 0x......>

Digging deeper, I found out that when using the UnbufferedBlobWriter directly (without BufferedWriter), the problem does not occur.

It seems like the behaviour of the BufferedWriter class changed from 2.6 to 2.7, most probably caused by the internal optimization of using the memoryview class.

As a workaround, I had to change my write method, calling tobytes() if necessary:

def write(self, data, offset=None):
    if offset is None:
        offset = self.offset
    if hasattr(data, "tobytes"):
        self.blobLocator.write(data.tobytes(), offset + 1)
    else:
        self.blobLocator.write(bytes(data), offset + 1)
    self.offset = offset + len(data)
    return len(data)

I'm not sure if this is actually a bug in 2.7 or if my usage of BufferedWriter is incorrect (see remark).

For understanding the problem it is important to know that the ZipFile.write method often calls write and seek.

Remark: If I am mis-using BufferedWriter: What precisely is wrong? And if so, why is it so complicated to support a buffered-random-writer? I cannot use io.BufferedRandom because I don't have a read method (and ZipFile.write does not need that).

msg220019 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2014-06-08 06:28

I have a related issue in Python 3.4. I suspect it is the same underlying problem as Henning’s. BufferedWriter is trying to write memoryview() objects, but the documentation for RawIOBase.write() implies it only has to accept bytes() and bytearray() objects.

from io import BufferedWriter, RawIOBase class Raw(RawIOBase): ... def writable(self): return True ... def write(self, b): print(b.startswith(b"\n")) ... b = BufferedWriter(Raw()) b.write(b"abc") 3 b.close() Traceback (most recent call last): File "", line 1, in File "", line 3, in write AttributeError: 'memoryview' object has no attribute 'startswith'

msg233802 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2015-01-10 07:44

I think the simplest thing to do here would be to update the documentation to match the usage. This patch does so, saying that all write() methods, as well as the BytesIO() constructor, have to accept bytes-like objects. It also expands some tests to verify this, and fixes a resulting bug in _pyio.

msg235805 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2015-02-12 11:13

Posting patch v2:

msg235820 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2015-02-12 13:21

The patch LGTM. But native speaker should check documentation part.

msg236251 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2015-02-20 04:27

Posting patch v3; thanks for the reviews!

msg236279 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2015-02-20 13:04

What is your objection to "len(b)"? When I read "len(b)" I know exactly what it means. When I read "the number of bytes in b", I have to think about it it, because it could mean "the number of bytes that that b is long" or "the number of bytes that have been already written to b", and the latter is the meaning my mind goes to first, so it takes time for my mind to realize it is the first.

msg236282 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2015-02-20 13:37

Using len(b) is fine if b is a bytes() or bytearray() object, but a bytes-like object can be anything that you can pass to memoryview(). In general len(b) is not part of that protocol, so can return some other value, or even be unimplemented:

from array import array b = array("H", range(2)) len(b) 2 bytes(b) # Actually 4 bytes = 2 items × 2 bytes b'\x00\x00\x01\x00' from ctypes import c_int b = c_int(100) len(b) Traceback (most recent call last): File "", line 1, in TypeError: object of type 'c_long' has no len() bytes(b) # 4 bytes despite not implementing len() b'd\x00\x00\x00'

I see your point that “the number of bytes in b” can be misleading. I will think about the wording some more. Maybe we can come up with a third alternative, like “the number of bytes given” or something.

msg236283 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2015-02-20 13:41

Because it is not len(b). I fixed several bugs in Python code which called len() for bytes-like argument and failed with array.array or memoryview with non-byte items.

The term "bytes-like object" is slightly misleading. In some cases it implies indexing and len, and iterating, and may be slicing -- common operations for bytes, bytearray, array('B'), memoryview().cast('B'). In more narrow meaning it may require such operations as concatenation (operator +) and .startswith() -- common for bytes and bytearray. In more general meaning it requires only the support of buffer protocol and contiguity. In more general meaning it may be even non-contiguous.

msg236287 - (view)

Author: R. David Murray (r.david.murray) * (Python committer)

Date: 2015-02-20 14:25

How about "the length of b in bytes"?

msg236426 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2015-02-23 04:37

Posting bytes-like-param.v4.patch, with David’s suggested wording

msg263247 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2016-04-12 13:25

After thinking about Issue 26720 (see also Issue 15994), I think it might be worth documenting not only that bytes-like objects may be passed, but in some cases the class should not be access the object after the method returns. This applies to at least RawIOBase.readinto() and especially RawIOBase.write(). If you want to save the write() data in memory, you have to make a copy, because the original may be lost when BufferedWriter overwrites its internal buffeer.

msg263483 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2016-04-15 11:55

Here is an updated patch merged with recent changes. I also added a sentence to the RawIOBase.write() and BufferedIOBase.write() documentation about access to the buffer after the method returns. And I added tests for this.

msg265850 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-05-19 06:46

Added few minor comments on Rietveld.

msg266043 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2016-05-22 02:09

I updated my patch thanks to Serhiy’s comments. Also omitted Arg Clinic generated code in this version to avoid future merge conflicts with the hashes.

msg266055 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-05-22 06:35

bytes-like-param.v6.patch LGTM.

msg266523 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2016-05-28 00:34

Porting this to Python 2 uncovered a few flaws that I am unsure how to best handle. In this patch I have added workarounds, but perhaps we can find better fixes.

  1. The io.BytesIO.write() C implementation does not accept array.array() objects. I suspect it only accepts objects implementing the “new” memoryview() buffer protocol, and array() only implements the “old” buffer() protocol.

In Python 2, is array.array() considered a bytes-like object? If so, the proposed update to the write() documentation is not consistent. Perhaps it is a real bug to fix in the BytesIO implementation. Or is test_array_writes() just going too far for Python 2?

I added code to skip the relevant test in test_array_writes(). Maybe it would be simplest to document that in general, BufferedIOBase.write() only has to accept “new” memoryview()-able objects.

  1. The _pyio.BufferedIOBase.readinto() native Python implementation cannot assign into arbitrary bytes-like objects. Python 2 does not have memoryview.cast("B"), buffer() objects seem to be read-only, and I can’t think of any other easy way.

I reverted back to only testing native Python readinto() methods with bytearray(). But perhaps it would be best to document that for Python 2, only memoryview()-able arrays of bytes are acceptable, which should be practical to handle in native Python code.

  1. The _pyio.BytesIO.write() and BufferedWriter.write() native Python implementations did not handle obscure bytes-like objects well. I added buffer() calls to the code. But I am not sure this is a good idea, because it will raise warnings with “python -3”. An alternative might be to again only require write() to accept memoryview()-able arrays of bytes.

Perhaps the Python 2 definition of “bytes-like object” should be restricted to the “new” memoryview() buffer protocol, and to 1-D arrays of bytes. I.e. memoryview(byteslike).ndim == 1 and memoryview(byteslike).format == "B".

msg266525 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2016-05-28 01:35

New changeset bd41151a7429 by Martin Panter in branch '3.5': Issue #20699: Document that “io” methods accept bytes-like objects https://hg.python.org/cpython/rev/bd41151a7429

New changeset c1b40a29dc7a by Martin Panter in branch 'default': Issue #20699: Merge io bytes-like fixes from 3.5 https://hg.python.org/cpython/rev/c1b40a29dc7a

msg266537 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-05-28 06:44

Things are more complex in 2.7. Not all objects supporting old buffer protocol support new buffer protocol, and not all functions accepting old buffer protocol work with new buffer protocol.

buffer(array.array('I')) <read-only buffer for 0xb70fec00, size -1, offset 0 at 0xb70feb60> memoryview(array.array('I')) Traceback (most recent call last): File "", line 1, in TypeError: cannot make memory view because object does not have the buffer interface zlib.compress(buffer('abc')) "x\x9cKLJ\x06\x00\x02M\x01'" zlib.compress(memoryview('abc')) Traceback (most recent call last): File "", line 1, in TypeError: compress() argument 1 must be string or read-only buffer, not memoryview

The definition of the "bytes-like object" term (changed in 74e1c50498f8) looks not correct in 2.7, because it mentions memoryview (rarely supported in 2.7 besides backported from 3.x features), but not buffer (widely supported in 2.7). It is referred only in one place (in Doc/library/hmac.rst, copied from 3.x in ). I would suggest to avoid using this term, and be more specific about supported protocols in every concrete case.

msg267057 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2016-06-03 06:25

I agree with avoiding the term “bytes-like object” in 2.7. It is easy to be specific when talking about concrete classes like BufferedWriter and BytesIO. But the harder question is what to specify for the abstract base classes.

In patch v8, I propose to change the documentation to refer to “sequences of bytes which are usable with ‘memoryview’ ”. Technically, some of the _pyio implementations assume more than this (e.g. len(), iteration and slice assignment), but I have left them alone.

msg267069 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-06-03 07:18

Since the _pyio implementation doesn't support general memoryview-like objects, I consider using the new buffer protocol in the C implementation an implementation detail. Other implementations can lack this support. I think it would be better to left the wording more general ("preallocated writable array of bytes"), but mention other concrete example (array('b'), are there other examples in the stdlib?).

msg267072 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2016-06-03 07:46

Actually this bug report was opened because an implementation that lacked memoryview support was broken by 2.7. The point is to document that (a subset of) memoryview objects may be passed to custom implementations.

Does this need a “changed in 2.7” notice? I am not so familiar with the history of Python 2 or the various buffer APIs to be sure.

Also see <https://bugs.python.org/issue20699#msg266523>, where I discovered array('b') does not work with io.BytesIO.write(). It seems you have to wrap it in buffer() first, presumably because array() does not support the “new” protocol, but buffer() does.

msg267079 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-06-03 08:09

array is supported explicitly in the Python implementation of readinto().

What if write not "usable with memoryview", but mention memoryview as an example?

The object b should be a pre-allocated, writable array of bytes, such as :class:bytearray, :class:array.array or :class:memoryview.

msg267084 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2016-06-03 09:59

I don’t see the point of mentioning array() objects at all. It’s hard to support array in a Python 2 implementation, as demonstrated by readinto(). And the special support for array('b') won’t help if you pass in array('B') with values 128–255. I see this as an implementation detail of _pyio, rather than a need for others to follow its lead.

The array('b') code was added by r56225. I guess it is experimental code from an early phase of the io module. Also, test_array_writes() is disabled for _pyio in 2.7 (https://hg.python.org/cpython/diff/760a710eb6c1/Lib/test/test_io.py), and in 2.6 (http://svn.python.org/view/python/branches/trunk-bytearray/Lib/test/test_io.py?r1=61775&r2=61774&pathrev=61775&view=patch).

I think it is better to avoid “such as” and be specific about what has to be supported. Perhaps:

readinto(b): The object b should be a pre-allocated, writable array of bytes, either :class:bytearray or :class:memoryview.

.. versionchanged:: 2.7 Support for :class:memoryview.

write(b): The object b should be an array of bytes, either :class:str, :class:bytearray, or :class:memoryview.

.. versionchanged:: 2.7 Support for :class:memoryview.

msg267086 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-06-03 10:10

Why :class:str and not :class:bytes?

msg267087 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2016-06-03 10:15

I thought :class:str is better documented in Python 2, but I can write bytes if you prefer. I guess it is more consistent with the rest of io.rst.

msg267088 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-06-03 10:35

Current docs use "bytes or bytearray object" (without references). "bytes" is just an alias to "str" in 2.7, but I prefer to use this name, for consistency with 3.x and for accenting the binary nature of the data.

Your new wording LGTM.

msg267791 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2016-06-08 05:41

New patch that I plan to commit:

msg268107 - (view)

Author: Roundup Robot (python-dev) (Python triager)

Date: 2016-06-10 08:49

New changeset ec69518aeebc by Martin Panter in branch '2.7': Issue #20699: Document that “io” methods should accept memoryview https://hg.python.org/cpython/rev/ec69518aeebc

msg268185 - (view)

Author: Martin Panter (martin.panter) * (Python committer)

Date: 2016-06-11 05:28

Thanks for you help figuring this out Serhiy, especially for the Python 2 case.

msg268187 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2016-06-11 05:58

Oh, I forgot to say my LGTM on your last patch. It LGTM.

But I'm feeling the documentation epic about bytes-like objects is far from the end.

History

Date

User

Action

Args

2022-04-11 14:57:58

admin

set

github: 64898

2016-06-11 05:58:43

serhiy.storchaka

set

messages: +

2016-06-11 05:28:44

martin.panter

set

status: open -> closed
resolution: fixed
messages: +

stage: commit review -> resolved

2016-06-10 08:49:42

python-dev

set

messages: +

2016-06-08 05:41:27

martin.panter

set

files: + bytes-like-param.py2.v9.patch

messages: +
stage: patch review -> commit review

2016-06-03 10:35:24

serhiy.storchaka

set

messages: +

2016-06-03 10:15:54

martin.panter

set

messages: +

2016-06-03 10:10:01

serhiy.storchaka

set

messages: +

2016-06-03 09:59:13

martin.panter

set

messages: +

2016-06-03 08:09:47

serhiy.storchaka

set

messages: +

2016-06-03 07:46:51

martin.panter

set

messages: +

2016-06-03 07🔞49

serhiy.storchaka

set

messages: +

2016-06-03 06:25:52

martin.panter

set

files: + bytes-like-param.py2.v8.patch

messages: +

2016-05-28 06:44:41

serhiy.storchaka

set

messages: +

2016-05-28 01:35:48

python-dev

set

nosy: + python-dev
messages: +

2016-05-28 00:34:35

martin.panter

set

files: + bytes-like-param.py2.patch

messages: +

2016-05-22 06:35:27

serhiy.storchaka

set

messages: +

2016-05-22 02:10:00

martin.panter

set

files: + bytes-like-param.v6.patch

messages: +
versions: + Python 3.6, - Python 3.4

2016-05-19 06:46:06

serhiy.storchaka

set

messages: +

2016-04-15 11:55:44

martin.panter

set

files: + bytes-like-param.v5.patch

messages: +

2016-04-12 13:25:06

martin.panter

set

messages: +

2015-02-23 04:37:52

martin.panter

set

files: + bytes-like-param.v4.patch

messages: +

2015-02-20 14:25:43

r.david.murray

set

messages: +

2015-02-20 13:41:16

serhiy.storchaka

set

messages: +

2015-02-20 13:37:46

martin.panter

set

messages: +

2015-02-20 13:04:07

r.david.murray

set

messages: +

2015-02-20 04:27:12

martin.panter

set

files: + bytes-like-param.v3.patch

messages: +

2015-02-12 13:21:24

serhiy.storchaka

set

messages: +
title: Behavior of ZipFile with file-like object and BufferedWriter. -> Document that binary IO classes work with bytes-likes objects

2015-02-12 11:13:08

martin.panter

set

files: + bytes-like-param.v2.patch

messages: +

2015-02-11 23:26:06

serhiy.storchaka

set

nosy: + r.david.murray

2015-02-11 23:01:51

serhiy.storchaka

link

issue12340 superseder

2015-01-10 09:34:53

serhiy.storchaka

set

nosy: + serhiy.storchaka
stage: patch review

versions: + Python 3.5

2015-01-10 07:44:25

martin.panter

set

files: + bytes-like-param.patch

assignee: docs@python
components: + Documentation

keywords: + patch
nosy: + docs@python
messages: +

2014-06-09 19:06:38

ned.deily

set

nosy: + pitrou, benjamin.peterson

2014-06-08 06:28:00

martin.panter

set

nosy: + martin.panter

messages: +
versions: + Python 3.4

2014-02-20 09:49:52

Henning.von.Bargen

create