Issue 33081: multiprocessing Queue leaks a file descriptor associated with the pipe writer (original) (raw)

Created on 2018-03-15 18:05 by Henrique Andrade, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (21)

msg313899 - (view)

Author: Henrique Andrade (Henrique Andrade)

Date: 2018-03-15 18:05

A simple example like such demonstrates that one of the file descriptors associated with the underlying pipe will be leaked:

from multiprocessing.queues import Queue x = Queue() x.close()

Right after the queue is created we get (assuming the Python interpreter is associated with pid 8096 below):

ll /proc/8096/fd total 0 dr-x------ 2 hcma hcma 0 2018-03-15 14:03:23.210089578 -0400 . dr-xr-xr-x 9 hcma hcma 0 2018-03-15 14:03:23.190089760 -0400 .. lrwx------ 1 hcma hcma 64 2018-03-15 14:03:33.145998954 -0400 0 -> /dev/pts/25 lrwx------ 1 hcma hcma 64 2018-03-15 14:03:33.145998954 -0400 1 -> /dev/pts/25 lrwx------ 1 hcma hcma 64 2018-03-15 14:03:23.210089578 -0400 2 -> /dev/pts/25 lr-x------ 1 hcma hcma 64 2018-03-15 14:03:33.145998954 -0400 3 -> pipe:[44076946] l-wx------ 1 hcma hcma 64 2018-03-15 14:03:33.145998954 -0400 4 -> pipe:[44076946] lr-x------ 1 hcma hcma 64 2018-03-15 14:03:33.145998954 -0400 5 -> /dev/urandom

After close():

ll /proc/8096/fd total 0 dr-x------ 2 hcma hcma 0 2018-03-15 14:03:23.210089578 -0400 . dr-xr-xr-x 9 hcma hcma 0 2018-03-15 14:03:23.190089760 -0400 .. lrwx------ 1 hcma hcma 64 2018-03-15 14:03:33.145998954 -0400 0 -> /dev/pts/25 lrwx------ 1 hcma hcma 64 2018-03-15 14:03:33.145998954 -0400 1 -> /dev/pts/25 lrwx------ 1 hcma hcma 64 2018-03-15 14:03:23.210089578 -0400 2 -> /dev/pts/25 lr-x------ 1 hcma hcma 64 2018-03-15 14:03:33.145998954 -0400 3 -> pipe:[44076946] l-wx------ 1 hcma hcma 64 2018-03-15 14:03:33.145998954 -0400 4 -> pipe:[44076946] lr-x------ 1 hcma hcma 64 2018-03-15 14:03:33.145998954 -0400 5 -> /dev/urandom

msg313900 - (view)

Author: Henrique Andrade (Henrique Andrade)

Date: 2018-03-15 18:08

Correcting my original - after close():

ll /proc/8096/fd total 0 dr-x------ 2 hcma hcma 0 2018-03-15 14:03:23.210089578 -0400 . dr-xr-xr-x 9 hcma hcma 0 2018-03-15 14:03:23.190089760 -0400 .. lrwx------ 1 hcma hcma 64 2018-03-15 14:03:33.145998954 -0400 0 -> /dev/pts/25 lrwx------ 1 hcma hcma 64 2018-03-15 14:03:33.145998954 -0400 1 -> /dev/pts/25 lrwx------ 1 hcma hcma 64 2018-03-15 14:03:23.210089578 -0400 2 -> /dev/pts/25 l-wx------ 1 hcma hcma 64 2018-03-15 14:03:33.145998954 -0400 4 -> pipe:[44076946] lr-x------ 1 hcma hcma 64 2018-03-15 14:03:33.145998954 -0400 5 -> /dev/urandom

msg313921 - (view)

Author: Pablo Galindo Salgado (pablogsal) * (Python committer)

Date: 2018-03-15 23:26

Is not leaked as the object is still alive. Only one side is closed when you call "closed". If you destroy the object:

del x

You will see that there are no more file descriptors in /proc/PID/fd associated with that queue.

Also, notice that closed() is called when the queue is destroyed.

msg313922 - (view)

Author: Henrique Andrade (Henrique Andrade)

Date: 2018-03-15 23:46

Pablo, but there is no way to close the other side.

Indeed, if you look in the implementation, you will see that the writer file descriptor can't be closed.

msg313924 - (view)

Author: Pablo Galindo Salgado (pablogsal) * (Python committer)

Date: 2018-03-15 23:56

The way is to delete the object. IMHO I would not say is "leaked" as the object is still alive and holds resources and these resources are properly handled on destruction.

I cannot think of an immediate use case of closing both file descriptors but not deleting the object.

msg313926 - (view)

Author: Pablo Galindo Salgado (pablogsal) * (Python committer)

Date: 2018-03-16 00:06

Notice that the writer gets closed when it receives a sentinel value (which how the queue knows when to close as part of the design):

x.put(multiprocessing.queues._sentinel)

If you call close after this line there will not be any fd associated with the queue open.

msg313927 - (view)

Author: Henrique Andrade (Henrique Andrade)

Date: 2018-03-16 01:29

Unfortunately this is not the case.

I will shrink my repro down to a more manageable size and post it here.

msg314014 - (view)

Author: Henrique Andrade (Henrique Andrade)

Date: 2018-03-17 21:19

Here is the repro (I am running this on Ubuntu 16 with the stock Python version 2.7.12):

========================================================================================

#!/usr/bin/env python

import os
import subprocess
import sys

from multiprocessing import Process, Queue
from multiprocessing.queues import _sentinel

def run_external_application(application_name, queue):
"""Runs an Oxygen application"""
exit_status = 10
queue.put(exit_status)
# none of the following help as far as making the pipe go away
queue.put(_sentinel)
queue.close()

def run(application_name="external_application"):
print "Starting '%s'" % application_name

queue = Queue()                                                                                   
application_process = Process(target=run_external_application, args=(application_name, queue))      
                                                                                                  
application_process.start()         
                                         
try:                                     
    application_process.join()         
except KeyboardInterrupt:              
    application_process.terminate()      
                                       
exit_status = queue.get()                                                                                    
print "exit status", exit_status                                                                             
                                                                                                           
queue.close()                                                                                         
# the deletion below has no effect                                                                      
del queue                                                                                               
# the only thing that will make the pipe go away is to uncomment the below statement                      
# queue._writer.close()                                                                                 
                                                                                                        
print "\nthe '%s' application finished with exit status '%s'...\n" % (application_name, exit_status)      
 
print "Note the file descriptor #4 below"     
subprocess.call(["ls", "-la", "/proc/%d/fd" % os.getpid()])      
                                    
return exit_status                  
                                                                                                                   
                                                                                                                   

if name == "main":
print "starting ", os.getpid()
exit_status = run()
sys.exit(exit_status)

msg314037 - (view)

Author: Pablo Galindo Salgado (pablogsal) * (Python committer)

Date: 2018-03-18 12:19

When I run your script I do not see any file descriptor associated with the queue when

subprocess.call(["ls", "-la", "/proc/%d/fd" % os.getpid()])

is executed.

This is my output if I just execute your program:

starting 3728 Starting 'external_application' exit status 10

the 'external_application' application finished with exit status '10'...

Note the file descriptor #4 below total 0 dr-x------ 2 pablogsal pablogsal 0 Mar 18 12:17 . dr-xr-xr-x 9 pablogsal pablogsal 0 Mar 18 12:17 .. lrwx------ 1 pablogsal pablogsal 64 Mar 18 12:17 0 -> /dev/pts/1 lrwx------ 1 pablogsal pablogsal 64 Mar 18 12:17 1 -> /dev/pts/1 lrwx------ 1 pablogsal pablogsal 64 Mar 18 12:17 2 -> /dev/pts/1 lr-x------ 1 pablogsal pablogsal 64 Mar 18 12:17 5 -> /dev/urandom

This is my output if I remove the call to "del queue":

starting 3892 Starting 'external_application' exit status 10

the 'external_application' application finished with exit status '10'...

Note the file descriptor #4 below total 0 dr-x------ 2 pablogsal pablogsal 0 Mar 18 12:18 . dr-xr-xr-x 9 pablogsal pablogsal 0 Mar 18 12:18 .. lrwx------ 1 pablogsal pablogsal 64 Mar 18 12:18 0 -> /dev/pts/1 lrwx------ 1 pablogsal pablogsal 64 Mar 18 12:18 1 -> /dev/pts/1 lrwx------ 1 pablogsal pablogsal 64 Mar 18 12:18 2 -> /dev/pts/1 l-wx------ 1 pablogsal pablogsal 64 Mar 18 12:18 4 -> 'pipe:[104568]' lr-x------ 1 pablogsal pablogsal 64 Mar 18 12:18 5 -> /dev/urandom

So at least on my side "del queue" is having the desired effect.

Notice that calling

del queue

does not destroy the object but just decrement in 1 the number of references to the object. To force collection of any possible cycles once you have destroyed all references on your side you can call

import gc gc.collect()

msg314043 - (view)

Author: Henrique Andrade (Henrique Andrade)

Date: 2018-03-18 15:34

@pablo: I am using Python 2.7.12 (distributed with Ubuntu 16), what are you using? This might explain the difference between what we see.

Yet, irrespective of this difference, imho, it would be a better design to have "close" actually closing the underlying resources.

In general, if one has to delete and/or invoke the garbage collector on an object, it's an indication that the design needs a bit of polish. Just picture the small scenario I described amplified to a situation where a large number of queues is used, which is perhaps an artificial scenario, but one would end up with a bunch of file descriptors hanging around for no reason.

This is what files and sockets, for example, would do.

msg314045 - (view)

Author: Pablo Galindo Salgado (pablogsal) * (Python committer)

Date: 2018-03-18 16:46

Notice that the documentation for close says:

Indicate that no more data will be put on this queue by the current process. The background thread will quit once it has flushed all buffered data to the pipe. This is called automatically when the queue is garbage collected.

The method does not promise to close any pipe, just "Indicate that no more data will be put on this queue by the current process". Closing prematurely the writer side could lead to issues. I still do not understand why you would want to close the pipes but maintain the queue alive.

I could be missing something, so let's see if other people think differently about this.

msg314047 - (view)

Author: Henrique Andrade (Henrique Andrade)

Date: 2018-03-18 17:04

I don't want to "close the pipes but maintain the queue alive" - I want to terminate the queue and make sure that no resources are leaked. It's that simple.

When one closes a file or a socket, there is no underlying OS resource being held. That's what I would like to have here too.

Apparently the design does not support that and, if that's the case, it's fine, it's just that it goes against most of the norm afaict.

On Sun, Mar 18, 2018 at 12:46 PM, Pablo Galindo Salgado < report@bugs.python.org> wrote:

Pablo Galindo Salgado <pablogsal@gmail.com> added the comment:

Notice that the documentation for close says:

Indicate that no more data will be put on this queue by the current process. The background thread will quit once it has flushed all buffered data to the pipe. This is called automatically when the queue is garbage collected.

The method does not promise to close any pipe, just "Indicate that no more data will be put on this queue by the current process". Closing prematurely the writer side could lead to issues. I still do not understand why you would want to close the pipes but maintain the queue alive.

I could be missing something, so let's see if other people think differently about this.



Python tracker <report@bugs.python.org> <https://bugs.python.org/issue33081>


--

Henrique Andrade | +1-530-426-2123 | hcma@unscrambl.com | https://unscrambl.com

msg314048 - (view)

Author: Pablo Galindo Salgado (pablogsal) * (Python committer)

Date: 2018-03-18 17:12

"I want to terminate the queue and make sure that no resources are leaked.

Then you don't need to do anything special, those will be cleared on object destruction. This is not an unusual pattern even in other languages. For example, RAII in C++ is one of the most used patterns for acquiring resources and that works cleaning those resources on object destruction.

msg314049 - (view)

Author: Henrique Andrade (Henrique Andrade)

Date: 2018-03-18 17:22

Your comparison is not correct.

RAII in C++ ensures that, on object destruction, resources that have been acquired will be closed and deallocated.

The closest analogy in Python is the use of a context manager, which, btw, a Queue does not provide.

Indeed, such a design (with a context manager) would have been cleaner because, on exit, both pipes would have been closed and file descriptors would not hang around.

And, yes, that is what I'd prefer too - but one can't have everything. :)

With the current design, which is more akin to Java, one is exposed to the vagaries of the garbage collector. Note that, even in Java, try-with-resources and the auto-closeable interface would also take care of this. In fact, most the Java classes that require external resources have migrated to this model.

For these reasons, I think the design could be substantially improved (i.e., with a context manager or with the provision of a method that really terminates the queue, so resources are properly closed immediately).

On Sun, Mar 18, 2018 at 1:12 PM, Pablo Galindo Salgado < report@bugs.python.org> wrote:

Pablo Galindo Salgado <pablogsal@gmail.com> added the comment:

"I want to terminate the queue and make sure that no resources are leaked.

Then you don't need to do anything special, those will be cleared on object destruction. This is not an unusual pattern even in other languages. For example, RAII in C++ is one of the most used patterns for acquiring resources and that works cleaning those resources on object destruction.



Python tracker <report@bugs.python.org> <https://bugs.python.org/issue33081>


--

Henrique Andrade | +1-530-426-2123 | hcma@unscrambl.com | https://unscrambl.com

msg314050 - (view)

Author: Pablo Galindo Salgado (pablogsal) * (Python committer)

Date: 2018-03-18 18:01

RAII in C++ ensures that, on object destruction, resources that have been acquired will be closed and deallocated.

Which is exactly what is happening here. When the queue gets destroyed (because the reference count reaches 0 or because of the garbage collector) resources that have been acquired by the queue will be closed an deallocated.

Sadly, I don't think I have anything different to apport to this discussion, so let's see what other people opinions are on this.

Of course, feel free to start a thread on python-dev or python-ideas on how to improve the design. :)

msg314066 - (view)

Author: Henrique Andrade (Henrique Andrade)

Date: 2018-03-18 22:55

You're a missing the fact that in C++, there is no garbage collector. The destructor both releases resources and deallocates memory.

There is no window between releasing resources and object destruction.

Python, while not exactly like Java, is similar to Java in the sense that there is a window of time between an object no longer having a reference and its being reaped by the garbage collector. During that window, resources can be held even if no longer in use.

In extreme cases, a lot of these resources can be held (think hundreds of Queues being created and closed in succession without an intervening GC run), even if not used. Sure, at some point, they will be reaped, but it might be a while.

And that's one of the reasons Python and Java have mechanisms to acquire/release resources in a more timely fashion. Context managers in the former and try-with-resources in the latter.

The mere presence of a proper close/shutdown method can make this work in an improved way in the case of a Queue, allowing OS resources (pipes) to be released in a more timely fashion.

But, sure, let's hear what the community thinks.

On Sun, Mar 18, 2018, 14:01 Pablo Galindo Salgado <report@bugs.python.org> wrote:

Pablo Galindo Salgado <pablogsal@gmail.com> added the comment:

RAII in C++ ensures that, on object destruction, resources that have been acquired will be closed and deallocated.

Which is exactly what is happening here. When the queue gets destroyed (because the reference count reaches 0 or because of the garbage collector) resources that have been acquired by the queue will be closed an deallocated.

Sadly, I don't think I have anything different to apport to this discussion, so let's see what other people opinions are on this.

Of course, feel free to start a thread on python-dev or python-ideas on how to improve the design. :)



Python tracker <report@bugs.python.org> <https://bugs.python.org/issue33081>


msg314068 - (view)

Author: Pablo Galindo Salgado (pablogsal) * (Python committer)

Date: 2018-03-18 23:37

The garbage collector in Python does not work like that. If an object reaches zero references is destroyed immediately. The only problem is when circular references exist and is in this case when object deletion is delayed until the garbage collector runs the algorithm to detect circular references and delete them. This time is longer depending on the generation in which the object is placed.

Although this is true, there might be a problem in the lines you explain because is not guaranteed to collect garbage containing circular references (see Data Model in the documentation). In this case is important as you state to have an interface to ensure releasing the resources.

msg314316 - (view)

Author: Antoine Pitrou (pitrou) * (Python committer)

Date: 2018-03-23 16:57

Thanks for reporting this. I agree this is a real issue, but it doesn't exist on Python 3 anymore:

q = multiprocessing.Queue() q.put(1) q.get() 1 threading.enumerate() [<_MainThread(MainThread, started 139978753529600)>, <Thread(QueueFeederThread, started daemon 139978667779840)>] q.close() threading.enumerate() [<_MainThread(MainThread, started 139978753529600)>] os.getpid() 17318

And in another terminal:

$ ls -la /proc/17318/fd total 0 dr-x------ 2 antoine antoine 0 mars 23 17:51 . dr-xr-xr-x 9 antoine antoine 0 mars 23 17:51 .. lrwx------ 1 antoine antoine 64 mars 23 17:52 0 -> /dev/pts/8 lrwx------ 1 antoine antoine 64 mars 23 17:52 1 -> /dev/pts/8 lrwx------ 1 antoine antoine 64 mars 23 17:51 2 -> /dev/pts/8

I'm uninterested in fixing this on Python 2, so I'm closing.

msg334640 - (view)

Author: Chris Langton (Chris Langton)

Date: 2019-02-01 00:31

@pitrou I am interested in a fix for Python 2.7 because in Python 3.x the manner in which arithmetic is output is not arbitrary precise.

So I will continue using Python 2.7 until another language I am familiar with that has superior arbitrary precise arithmetic compared to python 3.x reaches a point where the lib ecosystem is as mature as python 2.7

I heavily use multiprocessing and have many use cases where i work around this issue, because i encounter it almost every time i find i need multiprocessing, basically i decide i need multiprocessing when i have too many external resources being processed by 1 CPU, meaning that multiprocessing will be managing thousands of external resources on immediate use!

I work around this issue with this solution instead of the Queue with always failed!

========================================================================================

#!/usr/bin/env python

import multiprocessing, time

ARBITRARY_DELAY = 10

processes = [] for data in parse_file(zonefile_path, regex): t = multiprocessing.Process(target=write_to_json, args=(data, )) processes.append(t)

i = 0 for one_process in processes: i += 1 if i % 1000 == 0: time.sleep(ARBITRARY_DELAY) one_process.start()

for one_process in processes: one_process.join()

========================================================================================

At the time (years ago) i don't think i knew enough about fd to be good enough to solve the root cause (or be elegant) and i've reused this code every time Queue failed me (every time i use Queue basically)

To be frank, i ask anyone and they say Queue is flawed.

Now, I am older, and I had some free time, I decided to fix my zonefile parsing scripts and use more elegant solutions, i finally looked at the old code i reused in many projects and identified it was actually a fd issue (yay for knowledge) and was VERY disappointed to see here that you didn't care to solve the problem for python 2.7.. very unprofessional..

Now i am disappointed to by unpythonic and add to my script gc... you're unprofessionalism now makes me be unprofessional

msg334641 - (view)

Author: Chris Langton (Chris Langton)

Date: 2019-02-01 01:29

interestingly, while it is expected Process or Queue would actually close resource file descriptors and doesn't because a dev decided they prefer to defer to the user how to manage gc themselves, the interesting thing is if you 'upgrade' your code to use a pool, the process fd will be closed as the pool will destroy the object (so it is gc more often);

Say you're limited to a little over 1000 fd in your o/s you can do this

#######################################################################

import multiprocessing import json

def process(data): with open('/tmp/fd/%d.json' % data['name'], 'w') as f: f.write(json.dumps(data)) return 'processed %d' % data['name']

if name == 'main': pool = multiprocessing.Pool(1000) try: for _ in range(10000000): x = {'name': _} pool.apply(process, args=(x,)) finally: pool.close() del pool

#######################################################################

only the pool fd hangs around longer then it should, which is a huge improvement, and you might not find a scenario where you need many pool objects.

msg338588 - (view)

Author: Yongzhi Pan (fossilet) *

Date: 2019-03-22 07:47

On macOS with Python 3.7.2, using pitrou's code, I suspect Python does not delete some semaphores used by Queue.

Run these:

import multiprocessing import os import threading

os.system('lsof -p {} | grep -v txt'.format(os.getpid())) q = multiprocessing.Queue() q.put(1) q.get() threading.enumerate() os.system('lsof -p {} | grep -v txt'.format(os.getpid())) q.close() threading.enumerate() os.system('lsof -p {} | grep -v txt'.format(os.getpid()))

I see:

import multiprocessing import os import threading

os.system('lsof -p {} | grep -v txt'.format(os.getpid())) COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME Python 56029 tux cwd DIR 1,4 96 1927156 /Users/tux/Desktop Python 56029 tux 0u CHR 16,2 0t2867183 2393 /dev/ttys002 Python 56029 tux 1u CHR 16,2 0t2867183 2393 /dev/ttys002 Python 56029 tux 2u CHR 16,2 0t2867183 2393 /dev/ttys002 0 q = multiprocessing.Queue() q.put(1) q.get() 1 threading.enumerate() [<_MainThread(MainThread, started 4570830272)>, <Thread(QueueFeederThread, started daemon 123145368662016)>] os.system('lsof -p {} | grep -v txt'.format(os.getpid())) COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME Python 56029 tux cwd DIR 1,4 96 1927156 /Users/tux/Desktop Python 56029 tux 0u CHR 16,2 0t2867914 2393 /dev/ttys002 Python 56029 tux 1u CHR 16,2 0t2867914 2393 /dev/ttys002 Python 56029 tux 2u CHR 16,2 0t2867914 2393 /dev/ttys002 Python 56029 tux 3 PIPE 0x5ab56e2f13ca4abb 16384 ->0x5ab56e2f13ca5a7b Python 56029 tux 4 PIPE 0x5ab56e2f13ca5a7b 16384 ->0x5ab56e2f13ca4abb Python 56029 tux 5r PSXSEM 0t0 /mp-oa1x27kb Python 56029 tux 6r PSXSEM 0t0 /mp-khu1swie Python 56029 tux 7r PSXSEM 0t0 /mp-pwrgzmzz 0 q.close() threading.enumerate() [<_MainThread(MainThread, started 4570830272)>] os.system('lsof -p {} | grep -v txt'.format(os.getpid())) COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME Python 56029 tux cwd DIR 1,4 96 1927156 /Users/tux/Desktop Python 56029 tux 0u CHR 16,2 0t2869010 2393 /dev/ttys002 Python 56029 tux 1u CHR 16,2 0t2869010 2393 /dev/ttys002 Python 56029 tux 2u CHR 16,2 0t2869010 2393 /dev/ttys002 Python 56029 tux 5r PSXSEM 0t0 /mp-oa1x27kb Python 56029 tux 6r PSXSEM 0t0 /mp-khu1swie Python 56029 tux 7r PSXSEM 0t0 /mp-pwrgzmzz

The three PSXSEM persists even after some time. Is this some type of leakage?

History

Date

User

Action

Args

2022-04-11 14:58:58

admin

set

github: 77262

2019-03-22 07:47:37

fossilet

set

messages: +

2019-03-22 03:35:04

fossilet

set

nosy: + fossilet

2019-02-01 01:29:59

Chris Langton

set

messages: +

2019-02-01 00:31:03

Chris Langton

set

nosy: + Chris Langton
messages: +

2018-03-23 16:57:13

pitrou

set

status: open -> closed
resolution: wont fix
messages: +

stage: resolved

2018-03-18 23:37:32

pablogsal

set

messages: +

2018-03-18 22:55:14

Henrique Andrade

set

messages: +

2018-03-18 18:01:05

pablogsal

set

messages: +

2018-03-18 17:22:06

Henrique Andrade

set

messages: +

2018-03-18 17:12:15

pablogsal

set

messages: +

2018-03-18 17:04:53

Henrique Andrade

set

messages: +

2018-03-18 16:46:23

pablogsal

set

messages: +

2018-03-18 15:34:34

Henrique Andrade

set

messages: +

2018-03-18 12:19:14

pablogsal

set

messages: +

2018-03-17 21:19:47

Henrique Andrade

set

messages: +

2018-03-16 02:10:29

ned.deily

set

nosy: + pitrou, davin

2018-03-16 01:29:15

Henrique Andrade

set

messages: +

2018-03-16 00:06:49

pablogsal

set

messages: +

2018-03-15 23:56:09

pablogsal

set

messages: +

2018-03-15 23:46:47

Henrique Andrade

set

messages: +

2018-03-15 23:26:14

pablogsal

set

nosy: + pablogsal
messages: +

2018-03-15 18:08:51

Henrique Andrade

set

messages: +

2018-03-15 18:05:17

Henrique Andrade

create