[medusa] Re: hideously long selects? (original) (raw)

Skip Montanaro skip@m...
Fri, 3 Sep 1999 12:50:10 -0500


I'm baaack...

Jim Fulton asked:

What version of Zope are you running? I beleave that this is due to a bug that was fixed in the Zope 2.0 beta 5.

I was running 1.something. I downloaded the 2.0.0 source today and installed it, then modified my old start.py to work with the new ZServer.

I tried it out a bit on my development machine. Things were real slow, but nothing was happening, so I applied strace (fast becoming one of my favorite debugging tools) to both my database server and the ZServer process. The results were instructive. The database server trace looks like:

12:12:36 select(6, [5], [], [], {2, 500000}) = 0 (Timeout) 12:12:39 select(6, [5], [], [], {2, 500000}) = 1 (in [5], left {0, 280000}) 12:12:41 accept(5, {sin_family=AF_INET, sin_port=htons(4321), sin_addr=inet_addr("127.0.0.1")}, [16]) = 12 ... chunka chunka ... 12:12:41 read(13, "POST /RPC2 HTTP/1.0\r\nHost: loc"..., 1024) = 43 12:12:41 read(13, "User-Agent: xmlrpclib.py/0.9.8 ("..., 1024) = 56 12:12:41 read(13, "Content-Type: text/xml\r\nConten"..., 1024) = 170 ... chunka chunka ... 12:12:41 write(14, "HTTP/1.0 200 OK\r\nServer: BaseH"..., 1024) = 1024 ... bunch more writes ... 12:12:41 close(15) = 0 ... a few munmaps ... 12:12:41 select(6, [5], [], [], {2, 500000}) = 0 (Timeout) ...

So, within the space of somewhat less than a second it accepts a connect, handles the request (in this case an XML-RPC call requesting the top 50 cities and performers the server has gotten requests for), then returns to its waiting state.

Now, lets have a look at the trace for the ZServer:

12:12:39 accept(5, {sin_family=AF_INET, sin_port=htons(4319), sin_addr=inet_addr("127.0.0.2")}, [16]) = 6 12:12:39 write(1, "info: adding channel <zhttp_chan"..., 72info: adding channel <zhttp_channel at 8268ae0 channel#: 0 requests:0>) = 72 12:12:39 fcntl(6, F_GETFL) = 0x2 (flags O_RDWR) 12:12:39 fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK) = 0 12:12:39 gettimeofday({936378759, 269643}, NULL) = 0 12:12:39 gettimeofday({936378759, 270168}, NULL) = 0 12:12:39 select(7, [5 6], [], [], {30, 0}) = 1 (in [6], left {30, 0}) 12:12:39 recv(6, "GET /filter?file=index.shtml HTT"..., 4096, 0) = 277 12:12:39 gettimeofday({936378759, 273572}, NULL) = 0 12:12:39 brk(0x827b000) = 0x827b000 12:12:39 pipe([11, 12]) = 0 12:12:39 clone() = 8159 12:12:39 write(12, "4\24%@\4\0\0\0\201\0\0\0 \205$@`"..., 148) = 148 12:12:39 sigprocmask(SIG_SETMASK, NULL, [USR1]) = 0 12:12:39 write(12, "4\24%@\0\0\0\0\0\0\0\0T\370\7\10"..., 148) = 148 12:12:39 sigprocmask(SIG_SETMASK, NULL, [USR1]) = 0 12:12:39 sigsuspend([] <unfinished ...> 12:12:39 --- SIGUSR1 (User defined signal 1) --- 12:12:39 <... sigsuspend resumed> ) = -1 EINTR (Interrupted system call) 12:12:39 sigreturn() = ? (mask now [USR1]) 12:12:39 kill(8160, SIGUSR1) = 0 12:12:39 kill(8160, SIGUSR1) = 0 12:12:39 sigprocmask(SIG_SETMASK, NULL, [USR1]) = 0 12:12:39 sigsuspend([] <unfinished ...> 12:12:39 --- SIGUSR1 (User defined signal 1) --- 12:12:39 <... sigsuspend resumed> ) = -1 EINTR (Interrupted system call) 12:12:39 sigreturn() = ? (mask now [USR1]) 12:12:39 select(7, [5 6], [], [], {30, 0}) = 0 (Timeout) 12:13:09 select(7, [5 6], [6], [], {30, 0}) = 1 (out [6], left {30, 0}) 12:13:09 send(6, "HTTP/1.0 200 OK\r\nServer: Zope/"..., 13297, 0) = 13297 12:13:09 select(7, [5 6], [6], [], {30, 0}) = 1 (out [6], left {30, 0}) 12:13:09 gettimeofday({936378789, 284258}, NULL) = 0 12:13:09 write(4, "127.0.0.2:4319 - - [03/Sep/1999:"..., 98) = 98 12:13:09 write(1, "info: closing channel 6:<zhttp_c"..., 99info: closing channel 6:<zhttp_channel connected 127.0.0.2:4319 at 8268ae0 channel#: 0 requests:1>) = 99 12:13:09 close(6) = 0 12:13:09 close(-1) = -1 EBADF (Bad file descriptor) 12:13:09 select(6, [5], [], [], {30, 0}) = 0 (Timeout)

I assume the chit-chat with the backend database server is happening in a separate thread, because I see no reads or writes around 12:12:41 in the trace. At any rate, strace seems not to be able to attach to those threads, so their chit-chat is unavailable.

Note the back-to-back selects at 12:12:39 and 12:13:09. First it has fd 6 only in the read set and the select timed out after 30 seconds. Then it added 6 to the write set for the second select which returned immediately, and it sent the response back to Apache (which is serving as a proxy in this case) and from there to the user.

Why'd it sit on that first select for 30 seconds? Any reason why fd 6 wasn't in the write set to begin with? I presume it is the socket which it uses to talk to the backend server. (This would be in httplib or stuff it calls.)

To make sure I was getting appropriate versions of important modules, I had start.py print out sys.path and the locations of those modules I thought most important when ZServer was started:

sys.path is (ordered first to last): 0: /usr/local/lib/python1.5/site-packages 1: /home/dolphin/skip/src/Zope-2.0.0-src/ZServer/../lib/python 2: /home/dolphin/skip/src/Zope-2.0.0-src/ZServer 3: /home/dolphin/skip/src/Zope-2.0.0-src/ZServer/.. 4: /home/dolphin/skip/src/Zope-2.0.0-src/ZServer 5: /home/dolphin/skip/src/vtk-linux/lib 6: /usr/local/lib/python1.5/ 7: /usr/local/lib/python1.5/plat-linux2 8: /usr/local/lib/python1.5/lib-tk 9: /usr/local/lib/python1.5/lib-dynload 10: /usr/local/lib/automatrix/python

Here are the locations of some key modules, as discovered by imp.find_module:

asyncore is at /home/dolphin/skip/src/Zope-2.0.0-src/ZServer/medusa/asyncore.py asynchat is at /home/dolphin/skip/src/Zope-2.0.0-src/ZServer/medusa/asynchat.py xmlrpclib is at /usr/local/lib/python1.5/site-packages/xmlrpclib.py sgmlop is builtin

Now, if I shorten up the timeout for asyncore.loop, the pregnant pause goes away, but there's still this fd 6 in the read set followed by fd 6 in both the read and write sets:

12:39:13 sigreturn() = ? (mask now [USR1]) 12:39:13 select(7, [5 6], [], [], {0, 800000}) = 0 (Timeout) 12:39:14 select(7, [5 6], [6], [], {0, 800000}) = 1 (out [6], left {0, 800000}) 12:39:14 send(6, "HTTP/1.0 200 OK\r\nServer: Zope/"..., 37112, 0) = 37112

I would agree that long timeouts are okay if we could just eliminate that first select.

Am I hosed because I'm mixing blocking (httplib) and non-blocking (asyncore) I/O? Should httplib be telling asyncore about its sockets?

Skip Montanaro | http://www.mojam.com/ skip@m... | http://www.musi-cal.com/~skip/ 847-971-7098 | Python: Programming the way Guido indented...