msg300779 - (view) |
Author: desbma (desbma) * |
Date: 2017-08-24 10:53 |
When trying to connect a classic TCP socket to a non reachable peer, the exception reported is inconsistent if the socket has a timeout set. See the attached program, on my system (Arch Linux with Linux 4.9 & Python 3.6.2) it outputs: timeout timed out timeout timed out timeout timed out OSError [Errno 113] No route to host timeout timed out timeout timed out timeout timed out OSError [Errno 113] No route to host timeout timed out timeout timed out timeout timed out OSError [Errno 113] No route to host timeout timed out timeout timed out timeout timed out OSError [Errno 113] No route to host timeout timed out timeout timed out timeout timed out OSError [Errno 113] No route to host I expect one of the two exceptions to be thrown every time, not a mix of both. Thank you |
|
|
msg300783 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2017-08-24 13:25 |
Have you tried the equivalent C program? I'm guessing this is happening at the OS layer and Python is just reporting it. On my system a timeout of 5 will always report the OS error. |
|
|
msg300803 - (view) |
Author: desbma (desbma) * |
Date: 2017-08-24 21:11 |
Yes, you are right: I tried with a small C program, and compared with strace log of the Python program. In both cases poll sometimes returns -1 (error), or sometimes 0 (timeout). This is a weird behavior (at least for me) of the TCP stack, but clearly Python is not the cause so I am closing this issue. |
|
|
msg300806 - (view) |
Author: R. David Murray (r.david.murray) *  |
Date: 2017-08-24 21:45 |
I'm not a networking expert at this level, but I believe what is happening here is that the network stack does an arp, and has a timeout waiting for the arp response that is longer than your socket timeout. So at some point its arp timeout expires while the socket timeout hasn't, and it reports that there's no route to the host and resets its state. Then on the *next* socket request it sends another arp request (because the host may have appeared since the last time it checked), and the cycle repeats. I think this is a reasonable way for it to behave when the socket timeout is shorter than the arp response timeout, because otherwise you'd either lose the information that there's no route to the host, or you'd lose the association between "open the socket" and "send an arp". But like I said, I'm not an expert at the layer 2 stuff. I suppose in theory one could associate arp requests with socket operations one-for-one, but that would require more memory and I'm not surprised that the network stack doesn't go that route. |
|
|
msg300811 - (view) |
Author: desbma (desbma) * |
Date: 2017-08-24 22:24 |
Thanks for the insight. Well the most logical thing for me for the OS to do, would have been: 1. Send an ARP request 2. At the first poll call, report a timeout if no response was received 3. Repeat to 2. until the destination is considered unreachable 4. At the next connect call, fire off another ARP request 5. At the next poll call, if the response to the ARP sent in 5 was not received, report "No route to host" immediately because it is the last cached result (from 3) 6. Always report "No route to host" for the following calls (even if new ARP requests are sent) With "ip -4 neigh" I can see that the neighbor is in FAILED state when the OSError error is reported, but immediately goes to INCOMPLETE state at the next connect call (because another ARP request is sent). The behavior is the same with an IPv6 socket. By re-reading the NDP RFC (https://tools.ietf.org/html/rfc4861) I can understand why implementations behave like that. The RFC does not define the "FAILED" neighbor cache state, so "resolution failed" means "neighbor is not in the cache". Linux has a delayed garbage collection for the neighbor cache, which is why we can sometimes see entries in FAILED state, but when a new socket tries to connect to the peer, the resolution starts again like nothing happened before. So it's weird, but it conforms to the standard :) |
|
|