[cmds] Further tricky fixes for change in wait4/waitpid operation from #2653 by ghaerr · Pull Request #2668 · ghaerr/elks (original) (raw)

These fixes correct what was originally thought a simpler issue found by AI in the operation of wait4 (and waitpid C library wrapper) changed in #2653.

The tricky issue is that the wait4 syscall when passed the WNOHANG option now returns -ECHILD in the case of no children instead of 0 (zero is now only returned when there are unexited children). The problem is that wait4 returns type pid_t which is unsigned, for which many programs compare with an int. The programs which did not now fail, since -ECHILD is then interpreted as a large unsigned number rather than < 0. The construct causing failure includes code like the following:

   while (waitpid(-1, NULL, WNOHANG) > 0) ...

which was likely meant to include -ECHILD and 0, which works when waited is type int bug fails when unsigned. Now that #2653 returns -ECHILD first instead of 0, these programs fail.

The original bug found by AI did not correctly realize the effect this change would have on existing user programs, likely because elkscmd/ was not specified for analysis, but also because AI assumed POSIX compliance, which the user programs were (also) not.

A further fix is included for comparing vfork/fork() with a pid_t variable as "< 0", which will also always fail, since pid_t is unsigned.

Thus, the lesson here is to be very careful when using pid_t functions or variables, and not to make any assumptions as to their signedness. This can be tricky. There remains lots of code that compares a fork() return value with an 'int' pid, which would fail if the pid variable were changed to type pid_t to match the fork() return value.

Another lesson is that changing syscall behavior even in seemingly small ways can have large side unintended or unnoticed side effects. Without an extensive suite of regression tests, this can be riskier than perceived.