JEP 189: Shenandoah: An Ultra-Low-Pause-Time Garbage Collector (original) (raw)

Thomas Schatzl thomas.schatzl at oracle.com
Tue Jan 21 09:38:34 UTC 2014


Hi Christine,

great to see somebody hacking GCs in Hotspot. Welcome :)

On Mon, 2014-01-20 at 16:17 -0500, Christine Flood wrote:

----- Original Message ----- > From: "Kirk Pepperdine" <kirk at kodewerk.com> > Which we all happily grant :-) > > BTW, the history of G1 is that is started without a generational perturbation > but it was quickly determined that the gains to be made with having a > separate nursery was too great to ignore. IBM also tried to keep a single > space for quite some time but eventually backed out to a generational > arrangement. My question is; what is in this collector that makes you > believe you can ignore the weak generational hypothesis? > The benefit of generational systems is as much about remembered set size as it is about object lifetimes. Generational G1 is a win because you don't need to keep track of pointers from young regions to older regions. The young regions are collected at every GC pause and will be traced then.

It's a simple, reasonably robust heuristic that tells G1 where there is most memory in an efficient way to reclaim. Without the need for a full liveness analysis or other means to determine object death. That is not free either.

The remembered set savings (and in general data structure and runtime savings) are secondary imo. The optimizations pose a nice additional benefit, but since probably the biggest point of a GC is to reclaim memory, it seems beneficial to look first where there is most likely much unused space.

Also, the number of references from old to young can be large too.

After introducing this heuristic and knowing that it is here to stay, you will certainly try to exploit it as much as possible, won't you?

Some ideas off the top of the head to save additional remembered set size are that you could arbitrarily group regions into more "young gens", or just lazily create the remembered sets for regions as needed (again, using the marking thread). Or use a different remembered set implementation that has a different memory usage/performance tradeoff.

That would be something interesting to explore, as the remembered set itself can be made as small or large as you desire, depending on the cost/benefit tradeoff you want to make and what your other goals are.

Replacing the RS implementation should be reasonably simple for any programmer with some introduction to the code, at least compared to introducing a new GC :) Just maybe not so simple getting the same performance across a large set of applications. I am sure though there is much room for improvement.

The problem at least in G1 in that area is imo more of a matter of getting the selection policy for regions you want to keep the remembered set for right at a particular moment. Shenandoah will need to make similar selection decisions. Maybe this is somewhat easier there because you have a complete overview of liveness after marking. (But you can continously run concurrent markings there too - at a cost at throughput).

Our algorithm doesn't require remembered sets because reference updates happen lazily. We don't need to be able to find every reference to an object and update it, we can wait for the concurrent marking thread to find those references and update them for us. This gives us a different cost/benefit analysis. I believe Shenandoah fulfills the promise of G1

Shenandoah trades remembered sets and pauselessness by intentionally slowing down the mutator using "complex" read/write barriers and the additional indirection from the Brooks pointers. Also there is the increased memory footprint that a user needs to be aware of.

The Brooks pointer is an additional pointer per object (you mentioned

60G heaps, so I will assume 8 bytes per pointer) - at an average object size of maybe 30-40 bytes (just a guess which should be somewhat correct), this increases the footprint of the application by 15-20%. That increase alone (I assume) will impact memory access performance (caching, bandwidth).

So it's really just a different tradeoff Shenandoah takes. I do think everyone's really interested already in how this tradeoff looks like at the moment. :)

Not that the G1 remembered set is particularly small :), but first it's less than that, and as mentioned above there is a lot potential for memory savings. Shenandoah cannot do away with the Brooks pointers.

Otoh G1 has a few pauses (which we are working on greatly reducing at the moment).

by concentrating GC effort where the garbage is whether or not that is in young regions. There is no distinction in Shenandoah between allocating in a separate eden and allocating in an empty region, you would perform exactly the same work.

I can imagine this was how the non-generational G1 worked initially too. It did not work out to have the expected performance, or at least it has been shown extremely beneficial for a few reasons to by default collect the nursery regions. Probably mainly because G1 would select the regions it recently allocated into anyway.

As you mention, Shenandoah will/should likely also automatically do the same too.

Again, note that young gen in G1 is really mostly about selecting the "best" regions for reclamation. You may notice most code in G1 is only interested in the collection set, not the young gen.

> As an aside, I have no bias for generational collectors but I have seen > places where having a continuously running collector would have been very > advantageous. For example, I had one case we had a hard 16ms time budget on > bursts of fixed units of work. In that case the work could be completed in 5 > or 6 ms which left 10ms for GC (should it decide to run and undoubtably > would have run in the middle of the 5-6ms burst of work) but we could never > get the collector to run consistently in under 10ms without risking an OOME. > I’m pretty convinced that continuous collection policy would have helped us > meet that goal and we had plenty of CPU to throw at the problem so….

I think another problem related to hardware is memory, not so much CPU. As you mention, there are typically always enough CPUs to throw at the problem. Increasing the amount of memory starting from a certain threshold gets comparatively expensive, and managing to have good memory bandwidth (at good latency) even more so.

Others probably have more insight into that.

Thanks, and welcome again, Thomas



More information about the hotspot-gc-dev mailing list