[Python-Dev] Adding bytes.frombuffer() constructor to PEP 467 (was: [Python-ideas] Adding bytes.frombuffer() constructor (original) (raw)

INADA Naoki songofacandy at gmail.com
Wed Oct 12 05:34:18 EDT 2016


On Wed, Oct 12, 2016 at 2:07 PM, Nick Coghlan <ncoghlan at gmail.com> wrote:

I don't think it makes sense to add any more ideas to PEP 467. That needed to be a PEP because it proposed breaking backwards compatibility in a couple of areas, and because of the complex history of Python 3's "bytes-as-tuple-of-ints" and Python 2's "bytes-as-str" semantics.

Other enhancements to the binary data handling APIs in Python 3 can be considered on their own merits.

I see. My proposal should be another PEP (if PEP is required).

* It isn't "one obvious way": Developers including me may forget to use context manager. And since it works on CPython, it's hard to point it out. To add to the confusion, there's also https://docs.python.org/3/library/stdtypes.html#memoryview.tobytes giving: line = memoryview(buf)[:n].tobytes() However, folks do need to learn that many mutable data types will lock themselves against modification while you have a live memory view on them, so it's important to release views promptly and reliably when we don't need them any more.

I agree. io.TextWrapper objects reports ResourceWarning for unclosed file. I think same warning for unclosed memoryview objects may help developers.

Quick benchmark:

(temporary bytes) $ python3 -m perf timeit -s 'buf = bytearray(b"foo\r\nbar\r\nbaz\r\n")' -- 'bytes(buf)[:3]' .................... Median +- std dev: 652 ns +- 19 ns (temporary memoryview without "with" $ python3 -m perf timeit -s 'buf = bytearray(b"foo\r\nbar\r\nbaz\r\n")' -- 'bytes(memoryview(buf)[:3])' .................... Median +- std dev: 886 ns +- 26 ns (temporary memoryview with "with") $ python3 -m perf timeit -s 'buf = bytearray(b"foo\r\nbar\r\nbaz\r\n")' -- ' with memoryview(buf) as m: bytes(m[:3]) ' .................... Median +- std dev: 1.11 us +- 0.03 us This is normal though, as memory views trade lower O(N) costs (reduced data copying) for higher O(1) setup costs (creating and managing the view, indirection for data access).

Yes. When data is small, benefit of less data copy can be hidden easily.

One big difficulty of I/O frameworks like asyncio is: we can't assume data size. Framework should be optimized for both of many small chunks and large data.

With memoryview, when we optimize for large data (e.g. downloading large file), performance for massive small data (e.g. small JSON API) become worse.

Actually, one pull request is gave up to use memoryview because of it.

https://github.com/python/asyncio/pull/395#issuecomment-249044218

Proposed solution ===============

Adding one more constructor to bytes: # when length=-1 (default), use until end of byteslike. bytes.frombuffer(byteslike, length=-1, offset=0) With ths API with memoryview(buf) as m: line = bytes(m[:n]) becomes line = bytes.frombuffer(buf, n) Does that need to be a method on the builtin rather than a separate helper function, though? Once you define: def snapshot(buf, length=None, offset=0): with memoryview(buf) as m: return m[offset:length].tobytes() then that can be replaced by a more optimised C implementation without users needing to care about the internal details.

I'm thinking about adding such helper function in asyncio speedup C extension. But there are some other non-blocking I/O frameworks: Tornado, Twisted, and curio.

And relying on C extention make harder to optimize for other Python implementation. If it is in standard library, PyPy and other Python implementation can optimize it.

That is, getting back to a variant on one of Serhiy's suggestions in the last PEP 467 discussion, it may make sense for us to offer a "buffertools" library that's specifically aimed at supporting efficient buffer manipulation operations that minimise data copying. The pure Python implementations would work entirely through memoryview, but we could also have selected C accelerated operations if that showed a noticeable improvement on asyncio's benchmarks.

It seems nice idea. I'll read the discussion.

Regards, Nick.

P.S. The length/offset API design is also problematic due to the way it differs from range() & slice(), but I don't think it makes sense to get into that kind of detail before discussing the larger question of adding a new helper module for working efficiently with memory buffers vs further widening the method API for the builtin bytes type -- Nick Coghlan | ncoghlan at gmail.com | Brisbane, Australia

I avoid slice API intentionally, because if it seems like slice, someone will propose adding step support only for consistency.

But, as Serhiy said, consistent with old buffer API is nice.

-- INADA Naoki <songofacandy at gmail.com>



More information about the Python-Dev mailing list