My favorite bug: segfaults in Java
Update: Two years later, I wrote a more detailed version of this article: My favorite bug: segfaults in Java (redux).
I’ve told this story orally a number of times, but realized that I have never written it down. This is my favorite bug story; it might not be my hardest bug, but it is the one I most like to tell.
The context
In 2012, I was a Senior programmer on the FIRST Robotics Competition team 1024. For the unfamiliar, the relevant part of the setup is that there are 2 minute and 15 second matches in which you have a 120 pound robot that sometimes runs autonomously, and sometimes is controlled over WiFi from a person at a laptop running stock “driver station” software and modifiable “dashboard” software.
That year, we mostly used the dashboard software to allow the human driver and operator to monitor sensors on the robot, one of them being a video feed from a web-cam mounted on it. This was really easy because the new standard dashboard program had a click-and drag interface to add stock widgets; you just had to make sure the code on the robot was actually sending the data.
That’s great, until when debugging things, the dashboard would suddenly vanish. If it was run manually from a terminal (instead of letting the driver station software launch it), you would see a core dump indicating a segmentation fault.
This wasn’t just us either; I spoke with people on other teams, everyone who was streaming video had this issue. But, because it only happened every couple of minutes, and a match is only 2:15, it didn’t need to run very long, they just crossed their fingers and hoped it didn’t happen during a match.
The dashboard was written in Java, and the source was available (under a 3-clause BSD license), so I dove in, hunting for the bug. Now, the program did use Java Native Interface to talk to OpenCV, which the video ran through; so I figured that it must be a bug in the C/C++ code that was being called. It was especially a pain to track down the pointers that were causing the issue, because it was hard with native debuggers to see through all of the JVM stuff to the OpenCV code, and the OpenCV stuff is opaque to Java debuggers.
Eventually the issue lead me back into the Java code—there was a
native pointer being stored in a Java variable; Java code called the
native routine to free()
the structure, but then tried to
feed it to another routine later. This lead to difficulty again—tracking
objects with Java debuggers was hard because they don’t expect the
program to suddenly segfault; it’s Java code, Java doesn’t segfault, it
throws exceptions!
With the help of println()
I was eventually able to see
that some code was executing in an order that straight didn’t make
sense.
The bug
The issue was that Java was making an unsafe optimization (I never bothered to figure out if it is the compiler or the JVM making the mistake, I was satisfied once I had a work-around).
Java was doing something similar to tail-call optimization with
regard to garbage collection. You see, if it is waiting for the return
value of a method m()
of object o
, and code in
m()
that is yet to be executed doesn’t access any other
methods or properties of o
, then it will go ahead and
consider o
eligible for garbage collection before
m()
has finished running.
That is normally a safe optimization to make… except for when a
destructor method (finalize()
) is defined for the object;
the destructor can have side effects, and Java has no way to know
whether it is safe for them to happen before m()
has
finished running.
The work-around
The routine that the segmentation fault was occurring in was something like:
public type1 getFrame() {
type2 child = this.getChild();
type3 var = this.something();
// `this` may now be garbage collected
return child.somethingElse(var); // segfault comes here
}
Where the destructor method of this
calls a method that
will free()
native memory that is also accessed by
child
; if this
is garbage collected before
child.somethingElse()
runs, the backing native code will
try to access memory that has been free()
ed, and receive a
segmentation fault. That usually didn’t happen, as the routines were
pretty fast. However, running 30 times a second, eventually bad luck
with the garbage collector happens, and the program crashes.
The work-around was to insert a bogus call to this to keep
this
around until after we were also done with
child
:
public type1 getFrame() {
type2 child = this.getChild();
type3 var = this.something();
type1 ret = child.somethingElse(var);
this.getSize(); // bogus call to keep `this` around
return ret;
}
Yeah. After spending weeks wading through though thousands of lines of Java, C, and C++, a bogus call to a method I didn’t care about was the fix.