Issue 31025: io.BytesIO: no way to get the length of the underlying buffer without copying data (original) (raw)

Created on 2017-07-25 12:03 by rthr, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (6)
msg299054 - (view)	Author: Arthur Darcet (rthr) *	Date: 2017-07-25 12:03
If I'm not mistaken, a BytesIO buffer can be in three states: (1) `b = BytesIO(b'data')` -> free of any constraints (2) `d = b'data'; b = BytesIO(d)` -> cannot modify the underlying bytes without copying them (3) `b = BytesIO(b'data'); d = b.getbuffer()` -> cannot return a "bytes" representation of the data without copying it (the underlying buffer might change) My use-case is "how to get the length of the data currently in the BytesIO object". And right now, there are two solutions: (a) `len(b.getvalue())` (b) `len(b.getbuffer())` but, solution (a) is copying data if the buffer is in state (3) ; and solution (b) is copying data for state (2). And I don't see any way to distinguish between the three states from Python code. So as far as I understand it, there is no way to get the size of the buffer in Python that would reliably not copy any data Should I open a PR to add a `size()` method on the BytesIO class? (simply returning `PyLong_FromSsize_t(self->string_size)`
msg299056 - (view)	Author: Martin Panter (martin.panter) *	Date: 2017-07-25 12:14
Can’t you use b.seek(0, SEEK_END)?
msg299060 - (view)	Author: Arthur Darcet (rthr) *	Date: 2017-07-25 12:21
it's a tiny bit slow, but that works, thank you. I guess we can close this % python -m timeit -s "import io; b = io.BytesIO(b'0' * 2 ** 30)" "p = b.tell(); b.seek(0, 2); b.tell(); b.seek(p)" 1000000 loops, best of 3: 0.615 usec per loop % python -m timeit -s "import io; b = io.BytesIO(b'0' * 2 ** 30)" "len(b.getvalue())" 10000000 loops, best of 3: 0.174 usec per loop
msg299125 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2017-07-25 17:43
I'm confused, I don't see how there can be any difference between (1) and (2).
msg299215 - (view)	Author: Arthur Darcet (rthr) *	Date: 2017-07-26 08:15
BytesIO is heavily optimised to avoid copying bytes when it can. For case (1), if you want to modify the data, then there is no need to actually copy it before overwriting it, because no-one else is using it For case (2), if you want to change something, then you need to copy it first, otherwise the original bytes object would get modified Case (1): % python -m timeit -s "import io; b = io.BytesIO(b'0' * 2 ** 30)" "b.getbuffer()" 1000000 loops, best of 3: 0.201 usec per loop Case (2): python -m timeit -s "import io; a = b'0' * 2 ** 30; b = io.BytesIO(a)" "b.getbuffer()" 10 loops, best of 3: 54.5 msec per loop
msg299233 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2017-07-26 14:09
So you are saying that BytesIO has code that checks that its argument only has a single reference and modifies the string in place when it can if so? You can't depend on that in any other implementation of Python, and shouldn't depend on it in CPython either. Even in CPython you can't guarantee that case 1 is case 1, since the argument could conceivably be an interned string. So the seek approach is the only one that makes semantic sense, I think.

History
Date	User	Action	Args
2022-04-11 14:58:49	admin	set	github: 75208
2017-07-26 14:09:58	r.david.murray	set	messages: +
2017-07-26 08:15:34	rthr	set	messages: + versions: - Python 3.7
2017-07-25 17:43:02	r.david.murray	set	nosy: + r.david.murraymessages: +
2017-07-25 12:21:58	rthr	set	status: open -> closedmessages: + stage: resolved
2017-07-25 12:14:58	martin.panter	set	nosy: + martin.pantermessages: + versions: - Python 2.7, Python 3.3, Python 3.4, Python 3.5, Python 3.6
2017-07-25 12:03:57	rthr	create