* * * * *

                      The Case Of The Missing Core Files

At work, I test the various components of “Project: Wolowizard [1].” These
tests usually require running multiple copies of a program on a single
computer. I use Lua [2] (with help from a module [3]) to start and monitor
the programs being tested. The code starts N copies, and if any of the
programs crash, the reason is logged. It's fairly straight forward code.

Now, one of the compents of “Project: Wolowizard” was updated to support a
new project (“Project: Sippy-Cup”) and that component is occasionally
crashing on an assert [4], but the problem is: there are no core files to
check.

And I've spent the past two days trying to figure out why there are no core
files to check.

The first culprit—have we told the system not to generate core files? Yup.
The account under which the program runs (root) has a core file size limit of
zero bytes. There are a few ways to fix this, and I picked what to me, was
the simplest solution: in the Lua script that runs the programs, set the core
file size to “unlimited.” And this is easy enough to do:

> proc = require "org.conman.process"
> proc.limits.hard.core = "inf"
> proc.limits.soft.core = "inf"
>

Slight digression: you can set various resource limits for things like
maximum memory usage to core file size. The hard limit normally can't be
changed, but the soft limit can—any process an lower a limit. But a process
running as root can raise a limit, and raise the hard limit. Since the
program I'm running is running as root, setting both the hard and soft limits
to “infinity” is easy.

But there was still a disturbing lack of core files.

I checked the code of the Lua module I was using, and yes, I flubbed the
parsing code. I made the fix, my tests showed I got the logic right,
installed the updated module and still, no core files.

I did a bunch more tests and checked off the following reasons for the lack
of core files: it wasn't because the program dropped permissions; it wasn't
because the program couldn't write the core file in its current working
directory; and the program is not setuid [5]. It was clear there was
something wrong the module.

I was able to isolate the issue to the following:

> struct rlimit limit;
> lua_Number    ival;
>
> /* ... */
>
> if (lua_isnumber(L,3))
>   ival = lua_tonumber(L,3);
>
> /* ... */
>
> if (ival >= RLIM_INFINITY)
>   ival = RLIM_INFINITY;
>
> limit.rlim_cur = ival;
>

Now, lua_Number is of type double (a floating point value), and imit.rlim_cur
is some form of integer. ival was properly HUGE_VAL (the C floating point
equivalent of “infinity”) but limit.rlim_cur was 0.

But it worked on my home system just fine.

Then it dawned on me—my home system was a 32-bit system! That was the system
I did the patch and initial test; the systems at work are all 64-bit systems.
Some digging revealed that the definition of RLIM_INFINITY on the 64-bit
system was

> ((unsigned long int)(~0UL))
>

or in other words: the largest unsigned long integer value. And on a 64-bit
system, an unsigned long integer is 64-bits in size.

I do believe I was bit by an IEEE (Institute of Electrical and Electronics
Engineers) 754 floating point implementation detail.

Lua treats all numbers as type double, and on modern systems, that means IEEE
754 floating point [6]. A double can store 53-bit integers without loss [7],
and on a 32-bit system, you can pass integer values into and out of a double
without issue (32 being less than 53) and because I did my initial testing of
the Lua module on a 32-bit system, there was no issue.

But on a 64-bit system … it gets interesting. Doing some empirical testing, I
found the largest integer value you can store into a double and get something
out is 18,446,744,073,709,550,591 (and what you get out is
18,446,744,073,709,549,568—I'll leave the reason for the discrepancy for the
reader); anything larger, you get zero back out.

So, no wonder I wasn't getting any core files! I was inadvertantly setting
the core file size to zero bytes!

Sigh.

Off to fix the code …

[1] gopher://gopher.conman.org/0Phlog:2010/10/11.1
[2] http://www.lua.org/
[3] https://github.com/spc476/lua-conmanorg/blob/master/src/process.c
[4] http://en.wikipedia.org/wiki/Assertion_(software_development)
[5] http://en.wikipedia.org/wiki/Setuid
[6] http://en.wikipedia.org/wiki/IEEE_floating_point
[7] http://en.wikipedia.org/wiki/Double-precision_floating-point_format

Email author at [email protected]