strace It

Feb 12, 2014 hack linux

We found an unexpected behavior during the deployment: it takes quite long for supervisord to stop a pyres worker. Since we don’t want to kill the worker process when the job is still being processed, the supervisord is configured to send QUIT signal to the worker process. The pyres signal handler is supposed to mark the global exit flag, and continue to finish the outstanding job, then exit. It takes much longer to wait for job finish though.

So I run the strace to instrument both supervisord and pyres worker:

# supervisor
sudo strace -p 2312 -o /tmp/supervisor.log
# pyres worker
sudo strace -p 7350 -r -o /tmp/pyres.log

Here is the snippet of supervisor.log:

gettimeofday({1392164214, 250457}, NULL) = 0
kill(7350, SIGQUIT)                     = 0
gettimeofday({1392164214, 250900}, NULL) = 0
gettimeofday({1392164214, 250949}, NULL) = 0
... ....
gettimeofday({1392164227, 22023}, NULL) = 0
select(24, [4 5 6 8 11 14 18 22 23], [], [], {1, 0}) = 1 (in [22], left {0, 956550})
--- SIGCHLD (Child exited) @ 0 (0) ---
rt_sigreturn(0xffffffff)                = 1
read(22, "", 131072)                    = 0
gettimeofday({1392164227, 66013}, NULL) = 0

The supervisor sent the SIGQUIT at 4:16:54 PM, and the child exits at 4:17:7, 13s in total. The pyres.log is more interesting:

 0.000000 recvfrom(3, 0x3793044, 8192, 0, 0, 0) = ? ERESTARTSYS (To be restarted)
 2.983523 --- SIGQUIT (Quit) @ 0 (0) ---
 0.000119 rt_sigreturn(0xffffffff)  = -1 EINTR (Interrupted system call)
 0.000289 recvfrom(3, "*-1\r\n", 8192, 0, NULL, NULL) = 5
12.193469 prctl(PR_SET_NAME, 0x7ffff96d5e82, 0x29, 0x2, 0x7f2ae32b80b2) = 0
 0.000793 gettimeofday({1392164734, 82373}, NULL) = 0
 0.000420 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
 0.000567 write(1, "00:25:34,082 INFO  [7469] [pyres"..., 73) = 73
 0.000336 sendto(3, "*3\r\n$4\r\nSREM\r\n$14\r\nresque:worker"..., 65, 0, NULL, 0) = 65
 0.000162 recvfrom(3, ":1\r\n", 8192, 0, NULL, NULL) = 4
 0.000264 sendto(3, "*2\r\n$3\r\nDEL\r\n$45\r\nresque:worker:"..., 65, 0, NULL, 0) = 65
 0.000288 recvfrom(3, ":1\r\n", 8192, 0, NULL, NULL) = 4
 0.000696 sendto(3, "*2\r\n$3\r\nDEL\r\n$45\r\nresque:stat:pr"..., 65, 0, NULL, 0) = 65
 0.000296 recvfrom(3, ":0\r\n", 8192, 0, NULL, NULL) = 4
 0.000591 sendto(3, "*2\r\n$3\r\nDEL\r\n$42\r\nresque:stat:fa"..., 62, 0, NULL, 0) = 62
 0.000264 recvfrom(3, ":0\r\n", 8192, 0, NULL, NULL) = 4
 0.000310 gettimeofday({1392164734, 86646}, NULL) = 0
 0.000519 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
 0.000501 write(1, "00:25:34,086 INFO  [7469] [panel"..., 90) = 90
 0.000905 rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f2ae6a74cb0}, {0x5699b0, [], SA_RESTORER, 0x7f2ae6a74cb0}, 8) = 0
 0.000569 rt_sigaction(SIGQUIT, {SIG_DFL, [], SA_RESTORER, 0x7f2ae6a74cb0}, {0x5699b0, [], SA_RESTORER, 0x7f2ae6a74cb0}, 8) = 0
 0.000517 rt_sigaction(SIGUSR1, {SIG_DFL, [], SA_RESTORER, 0x7f2ae6a74cb0}, {0x5699b0, [], SA_RESTORER, 0x7f2ae6a74cb0}, 8) = 0
 0.000457 rt_sigaction(SIGTERM, {SIG_DFL, [], SA_RESTORER, 0x7f2ae6a74cb0}, {0x5699b0, [], SA_RESTORER, 0x7f2ae6a74cb0}, 8) = 0
 0.000526 close(3)                  = 0
 0.051131 exit_group(0)             = ?

The QUIT is signaled at 2.98s, then recvfrom is recovered from the signal handling and keep waiting until 5 bytes *-1\r\n are received at 12.19s. I think this is the smoking gun, but why?

The pyres implementation shows that the Worker.reserve is implemented as BLPOP command with timeout parameter. If it times out, and all lists are empty, the redis just simply return *-1\r\n. Hmm, it looks the global exit flag won’t take effect until the reserve method times out.

It could become worse if the QUIT is signaled just after the reserve is invoked, and a job is pushed just before the timeout expires, see the infinite loop in Worker.run method:

#!python
while True:
    if self._shutdown:
        logger.info('shutdown scheduled')
        break

    job = self.reserve(interval)

    if job:
        self.fork_worker(job)
    else:
        if interval == 0:
            break
        #procline @paused ? "Paused" : "Waiting for #{@queues.join(',')}"
        self._setproctitle("Waiting")
        #time.sleep(interval)
self.unregister_worker()

The shutdown has to wait for almost the full intervalseconds, then keep waiting for forked child to finish the job. I think a more intuitive approach might be: in the SIGQUIT handler, if the worker is still waiting, we should trigger the cleanup and exit immediately.

Also the doc string is totally misleading:

 ``interval`` -- Number of seconds the worker will wait until processing the next job. Default is "5".

The interval has no effect on job scheduling: the reserve method will return as soon as a job is pushed to the queue. The loop in the Worker.work method should use non-blocking list pop command, LPOP with sleep to fulfill its commitment.