Performance of str.encode vs codecs.getwriter (original) (raw)

February 14, 2024, 5:34pm 1

For reasons* I need to get from a json.dump to BytesIO.

My first thought was:

out = io.BytesIO(json.dumps(data).encode('utf-8'))

But it occurred to me that json.dump accepts SupportsWrite[str] aka StringIO, and this SO answer suggests using codecs.getwriter to get from StringIO to BytesIO, e.g.:

out = io.BytesIO()
writer = codecs.getwriter('utf-8')(out)
json.dump(data, writer)

It works, but much to my surprise it is much slower than the encode version, at least according to my test** (results in gist).

Could someone sanity check whether that test is ‘realistic’
and offer ideas of why the latter implementation is so much slower?


* by using the ‘file like’ BytesIO I presume I get the benefit of wsgi.file_wrapper per:

** I also put an orjson implementation in there, and reassuringly that’s even faster.

Stefan2 (Stefan) February 14, 2024, 5:47pm 2

So much slower? Where did you tell us how much slower it is?

barry-scott (Barry Scott) February 14, 2024, 5:48pm 3

My guess is that its the number of function calls that is making the difference.
With encode() there is one call to changed the json encoded data from unicode to bytes.
With the getwritter it is called repeatedly for each piece of encoded data.

You should be able to confirm that by using a wrapper around writer that counts the number of calls made.

bschubert (Brian Schubert) February 14, 2024, 6:03pm 4

I’m not so sure about this premise. Typically, the point of using a file-like object is to avoid loading data all-at-once. But in this case, all of the data is already in a Python object. Without knowing anything about the library you’re using, I would guess that using BytesIO in this way is going to be strictly slower, since 1) you’re creating extra copies of the data, and 2) you’re forcing the library to make many function calls to access the data chunk-by-chunk instead of letting it use the already existing str/bytes object.

jamestwebber (James Webber) February 14, 2024, 6:08pm 5

Along those lines: it’s just trading speed for memory. The first version converts the whole thing in one go, the second reads it a bit at a time. For large data this can make a difference in timing but for very large data it’s necessary.

davetapley (Dave Tapley) February 14, 2024, 6:26pm 6

Ha, good point here’s my output:

13.968287179000981
2.118424963999132
0.22043878200020117

Rosuav (Chris Angelico) February 14, 2024, 6:30pm 7

UTF-8 is an incredibly common encoding, and str.encode() has a fast path for it. I guess codecs.getwriter() doesn’t, which might change if anyone cares enough, but since most people use str.encode(), that’s the one worth optimizing the most.

davetapley (Dave Tapley) February 14, 2024, 6:32pm 8

Thanks for addressing that part of the post 🙏t2:

So initially I was using dump (to str) and setting Falcon’s Response.text. When I did that and made a request with a large response body (MBs) it would block other requests.

I switched to Response.set_stream and a BytesIO that blocking behavior went away.

I assume GIL related as hinted here, so perhaps it would be slower (for reasons you mention), but for the GIL.

barry-scott (Barry Scott) February 14, 2024, 10:22pm 9

Sounds like an issue in Falcon’s (the library?) code that is not doing network I/O fairly.
Doubt if the GIL is involved as I think you are describing an I/O issue not a CPU bound problem.

Rosuav (Chris Angelico) February 14, 2024, 10:37pm 10

Or possibly, the correct way to do it is set_stream. It certainly seems plausible.