JEP 132: More-prompt finalization (original) (raw)

Peter Levart peter.levart at gmail.com
Thu May 28 17:12:14 UTC 2015

Previous message: JEP 132: More-prompt finalization
Next message: JEP 132: More-prompt finalization
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

Did you know that the following simple loop:

public class FinalizableBottleneck { static boolean no;

 @Override
 protected void finalize() throws Throwable {
     // empty finalize() method does not make the object finalizable
     // (it is not even registered on the finalizer's list)
     if (no) {
         throw new AssertionError();
     }
 }

 public static void main(String[] args) {
     while (true) {
         new FinalizableBottleneck();
     }
 }

}

...quickly fills the entire heap with FinalizableBottleneck and internal Finalizer objects and brings the JVM to a halt? After a few seconds of running the above program, jmap -histo:live reports:

num #instances #bytes class name

1:      50048325     2001933000  java.lang.ref.Finalizer
2:      50048278      800772448  FinalizableBottleneck

There are a couple of bottlenecks that make this happen:

ReferenceHandler thread synchronizes with VM to unhook Reference(s) from the pending chain one be one and dispatches them to their respected ReferenceQueue(s) which also use synchronization for equeueing each Reference.
Enqueueing synchronizes with the finalization thread which removes the Finalizer(s) (FinalReferences) from the finalization queue and executes them.
Executing the Finalizer(s) removes them from the doubly-linked list of all Finalizer(s) which is used to retain them until they are needed and this synchronizes with the threads that link new Finalizer(s) into the doubly-linked list as new finalizable objects get registered.

We see that the creation of a finalizable object only takes one synchronization (registering into the doubly-linked list) and is performed synchronously, while finalization takes 4 synchronizations among 4 different threads (in pairs) and happens when the Finalizer instance "travels" over from VM thread to ReferenceHandler thread and then to finalization thread. No wonder that finalization can not keep up with allocation in a single thread. The situation is even worse when finalize() methods do some actual work.

I have experimented with various approaches to widen these bottlenecks and found out that I can not beat the ForkJoinPool when combined with some improvements to internal data structures used in reference processing. Here's a prototype I came up with:

http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/webrev.01/

And this is the benchmark I use for measuring the throughput:

http://cr.openjdk.java.net/~plevart/misc/JEP132/ReferenceHandling/FinalizerThroughput.java

The benchmark shows (results inline in source) that using unpatched JDK, on my PC (i7-2700K, Linux, JDK8) I can not construct more than 1500 finalizable objects per ms in a single thread and that while doing so, finalization only manages to process approx. 100 - 120 objects at the same time. Objects "in-flight" quickly accumulate and bring the VM to a halt, where it is not doing anything but full GC cycles.

When constructing in 4 threads, there's not much difference. Construction of finalizable objects simply doesn't scale.

Patched JDK shows something completely different. Single thread construction achieves a rate of 3600 objects / ms. Number of "in-flight" objects is kept constant at about 5-6M instances which amounts to approx 1.5 s of allocation. I think this is about the rate of GC cycles during which VM also processes the references. The benchmark also shows the ForkJoinPool statistics which shows that the number of queued tasks is also kept low.

Increasing the allocation threads to 4 increases allocation rate to about 4300 objects / ms and finalization keeps up. Increasing allocation threads to 8, further increases allocation rate to about 4600 objects / ms and finalization still keeps up. The increase in rate is not linear, but keep in mind that i7 is a 4-core CPU.

About the implementation...

1st improvement I did was for the doubly-linked list of Finalizer instances that is used to keep them alive until they are needed. I ripped-off the wonderful ConcurrentLinkedDeque by Doug Lea and Martin Buchholz and just kept the internal link/unlink methods while specializing them to Finalizer entries (very straight-forward). I experimented with throughput and got some improvement, but throughput has increased much more when I used several instances of independent lists and distributed registrations among them randomly (unlinking consequently is also distributed randomly).

I found out that no matter how hard I try to optimize ReferenceQueue while keeping the API unchanged, I can only do so much and that was not enough. I have been surprised by how well ForkJoinPool distributes tasks among threads, so I concluded that leveraging it is the best choice. I re-designed the pending-list unhooking loop to unhook pending references in chunks which greatly improves the throughput. Since unhooking can be performed by a single thread while holding a lock which is mandated by interface between VM and Java, I didn't employ multiple threads, but a single eternal ForkJoinTask that unhooks in chunks and forks-off other processing tasks that process chunks. When there are just a couple of References pending at one time and a not-full chunk is unhooked, then the processing is performed by the same thread that unhooked the refrences, but when there are more, worker tasks are forked off and the unhooking thread continues with full peace. This processing includes execution of Cleaners, forking the finalizer tasks and enqueue-ing other references. Finalizer(s) are always executed as separate ForkJoinTask(s).

It's interesting how Runtime.runFinalizers() is implemented in this patch - it basically amounts to ForkJoinPool.awaitQuiescence() ...

I also tweaked the ReferenceQueue implementation a bit (it is still used for other kinds of references) so that it avoids synchronization with a monitor lock when there are no blocking waiters and uses CAS to enqueue/dequeue. This improves throughput when the queue is not empty. Since in the prototype multiple threads can enqueue into the same queue, I thought this would improve throughput in such situations.

Comments, suggestions, criticism are welcome.

Regards, Peter

Previous message: JEP 132: More-prompt finalization
Next message: JEP 132: More-prompt finalization
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the core-libs-dev mailing list