"Yielding" outside the context of generators (original) (raw)

I was looking at ways to iterate through directory entries, and came across Path.iterdir(), which according to the documentation:

When the path points to a directory, yield path objects of the directory contents:

Looking at the source code for this method, it does not “yield” anything in the literal Python-sense of the word. It actually does use something that resembles a generator under the hood, namely os.scandir() which does yield a directory entry one at a time, but Path.iterdir() seems to just collect everything it yields into a single collection and returns it.

I personally think this phrasing is confusing. This yielding-but-actually-not-really term is used in other places in documentation as well. It’s especially the case when, like here, the function or method returns an iterable - do we iterate through it one-by-one like a generator? Or might it store a million elements in a single go before we actually iterate through the thing (if we have a large directory structure in this particular case)? The latter poses a potential problem in low-RAM environments.

nedbat (Ned Batchelder) June 14, 2025, 7:51pm 2

I’ve used the verb “produces” when I want to describe what an iterator does without getting into implementation details.

AA-Turner (Adam Turner) June 14, 2025, 7:54pm 3

My interpretation of ‘yield’ here is an instruction to the programmer that results are not returned as a single collection, but that iterating over the returned object will be required. In this case, the function returns an iterator, which behaves similarly to yield in a loop.

I don’t think one can draw inference about memory consumption of a method or function from the return type in a general fashion, but perhaps we can consider improving documentation or guidance for memory usage. This would need some thought, though, as it could introduce unforseen constraints on refactoring in the future.

JamesParrott (James Parrott) June 15, 2025, 12:00pm 4

I think there’s just been a design compromise in that method. Using “produce” instead of “yield” in docs is unambiguous for it and other methods with iter in the name.

Regarding the implementation, the memory use from the list comp could be avoided by writing a slightly more elaborate generator. But the existing implementation is tried and tested.

might it store a million elements in a single go before we actually iterate through the thing (if we have a large directory structure in this particular case)? The latter poses a potential problem in low-RAM environments.

I’m genuinely curious, but I think my Beagle boards can handle this. How many low-RAM environments, that can run CPython, have access to file systems with with directories containing a million file objects, and have less than 200MB RAM available (200B is what sys.getsizeof tells me an os.DirEntry object is). And how many devs targeting such environments are going to be relying on pathlib, not optimising their code to handle millions of files in the same directory?

AlSweigart (Al Sweigart) June 15, 2025, 3:38pm 5

Grepping the docs for “generates” would probably produce similar points of possible confusion.

encukou (Petr Viktorin) June 16, 2025, 6:31am 6

I’ve always read “yield” in docs as “returns an iterator of”. (Very convenient with docstrings, to describe the function in the first line.)

A yield statement isn’t the only way to yield things, just like class isn’t the only way to define a class.

peterc (peter) June 16, 2025, 6:52am 7

I could still think up situations where this would bite.
If you assemble a list of 10.000 iterdir objects, my intuition would be that that operation should be pretty much free. But if all those iterdir objects require 2MB RAM, (for 10.000 files in the folder), that’s 20GB RAM, which could be a problem.

Granted, I’ve never assembled a list of 10.000 iterdir objects, and I don’t think I’d ever do this in production. But during development

my_iterators = [
  (f(p, y) for p in g(y).iterdir())
  for y in my_y
]

(or something vaguely in that direction) is something I might conceivably want to play around with. And I wouldn’t expect danger from memory overuse, because iter objects in python are “known” to be memory efficient.

JamesParrott (James Parrott) June 16, 2025, 7:27am 8

It’s not perfect, and I have the same intuition about iter objects. But I don’t think the core devs should feel obliged to optimise the library for such avoidable edge cases, e.g. that need 20GB RAM to process a million (the same million?) files 10,000 times.

peterc (peter) June 16, 2025, 7:32am 9

I thought you said a single os.DirEntry instance is 200B, and the iterdir object has all the DirEntry objects in memory, for a total size of number_of_addresses * 200B? which computes to 10_000 * 200B == 2000_000B == 2MB ?

JamesParrott (James Parrott) June 16, 2025, 7:34am 10

I did. I misunderstood - I’ve corrected it now. Apologies.

peterc (peter) June 16, 2025, 7:46am 11

I agree, this doesn’t desperately need a fix. The total time required to fix it is probably much more than the total time that will be lost by people running into this issue.

Google is quite capable of finding Issue 39907: `pathlib.Path.iterdir()` wastes memory by using `os.listdir()` rather than `os.scandir()` - Python tracker which documents the issue well enough, so the documentation doesn’t necessarily need an update either.

encukou (Petr Viktorin) June 16, 2025, 8:11am 12

That’s an overgeneralization. Iterators allow memory-efficient implementation; they don’t guarantee it. Even a real generator with genuine yield statements can build a list first and then yield its elements.
Of course, it would be “good” if CPython used a “good” implementation where it can, but it’s not a bug if it doesn’t.

IMO, if the docs made the guarantee you seek, you should see something like CPython implementation detail: a memory-efficient implementation is used on platforms X and Y. You don’t see those much, because we tend to not make such guarantees.