8152312: ParNew: Restore preserved marks in parallel (original) (raw)
Tony Printezis tprintezis at twitter.com
Fri Mar 25 13:35:31 UTC 2016
- Previous message (by thread): 8152312: ParNew: Restore preserved marks in parallel
- Next message (by thread): RFR(s): 8152160: SIGFPE in CompactibleFreeListSpaceLAB::compute_desired_plab_size
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Thomas,
Inline.
On March 25, 2016 at 8:56:59 AM, Thomas Schatzl (thomas.schatzl at oracle.com) wrote:
Hi,
On Thu, 2016-03-24 at 11:36 -0400, Tony Printezis wrote:
Thomas,
Thanks for the feedback. Updated webrev here: http://cr.openjdk.java.net/~tonyp/8152312/webrev.1/ Changes: - Your main comment was whether to make the serial vs. parallel decision in DefNew or in restore(). I liked the suggestion of either passing NULL or the workers to restore(WorkGang*) and deciding in restore(…) what to do. The only problem is that ParallelGC doesn’t use a WorkGang, it uses a GCTaskManager instead.
Drat!
Indeed. :-)
And anything we do in restore(...) would be nice to also take advantage of in ParallelGC. So, I came up with a cunning scheme to do that. Let me know what you think. :-) Yes, it works fine with ParallelGC too (I already implemented/tested the changes, I’ll open them for code review once this is done.) I left a restore() method there temporarily to be used by ParallelGC but it will go away.
Some specialized implementation of the method based on the type of E I guess.
First, if it helps you, here are the changes for PS (as applied on top of the ParNew changes):
http://cr.openjdk.java.net/~tonyp/8152312/webrev.1.ps/
Note that we now basically duplicate the restore() method,
Depends which restore() method you’re talking about. restore(E*) remains unchanged in the PS changes. We have to introduce a new restore_internal() that knows how to use the GCTaskManager (and, yes, there’s a bit of code replication in that task; but only a few lines). So, in the future, if we add some additional logic to decide whether to do the restoration serially or in parallel, we’ll only have to change restore(E*) and the “executor”-specific code remains the same (I have hacked up a version of that too, as a proof-of-concept, and it does work out fine, FWIW).
it gets close to just have two (parallel gc, the rest) implementations of the class. Let's see your solution.
We have a single method that decides what to do (serial or parallel restoration) and yields to the specific parallel implementation as needed.
I used a template so that I can put the shared implementation in one place and still be able to easily call the “executor”-specific methods when needed without having lots of ridiculous boilerplate code (pass both executors to the method, make sure only one of them is non-NULL, based on which one is not NULL call the appropriate parallel method, etc.).
If you don’t like the use of the template here, there’s an alternative. Expose the following:
restore(WorkGang* workers)
restore(GCTaskManager* gc_task_manager)
and each if those first calls a private restore_internal() which either does the restoration serially or decides not to and tells the “executor”-specific restore() that it needs to do the work in parallel. But you said in your e-mail that you wanted a single restore() which yields to the parallel when needed, which is what I tried to do in the webrev. :-)
Either way is totally fine with me. Pick your poison.
Since I do not know the actual implementation of your scheme, but it would be nice if the code between lines 65 and 68 in preservedMarks.inline.hpp were moved into a method too, particularly if the future implementation also has a single-threaded path.
It doesn’t, that loop is not replicated (see the PS webrev).
I would prefer to have the implementation(s) of restore() in the cpp file(s) though
I agree. But, given that restore(E*) is a template, I assume I have to put it in the .inline.hpp file. If you really want to in the .cpp file, I think we have to go with the scheme I describe above (expose restore(WorkGang*) and restore(GCTaskManager*)).
(and possibly hide the one for parallel somewhere very far away ;) ).
:-)
The additional call in this case does not seem to be problematic vs. performance.
I totally agree with this.
- Yes, if each stack has a small number of entries we’re better off doing the restoration serially instead of firing up the workers. This will be a small localized change to what I have in the webrev (we’ll only have to change restore(E*);
One comment here: the comment at
70 // Right now, if the executor is not NULL we do the work in 71 // parallel. In the future we might want to do the restoration 72 // serially, if there's only a small number of marks per stack.
Imo is better placed before line 64, cut a lot to something like:
// Decide on the amount of parallelism.
Please add an RFE in the bug tracker explaining the situation about "in the future" ideas. That would be much better than dumping your thoughts in some random place in the code.
I put the comment when we were calling the parallel version, so I didn’t think it was a random place. :-) But, sure, I’ll move it earlier...
the main question will be what policy to use!). I'll do a follow-up to this too.
The main goal is not to be perfect imo, but to determine a reasonable somewhat conservative upper bound on the amount of threads used. Too many people use too large machines on small problems.
I agree, I don’t think it needs to be perfect and we should spend too much time on it. I was actually going to propose a very simple scheme that makes the determination based on the size of the stacks. I.e., start iterating over the stacks and, for every stack we check its size. If it's under a limit we just go ahead and do the work serially. Eventually, if we come across a stack whose size is “too large” we bail out of the serial phase and yield of the parallel version (and we can notify the parallel version that we’ve already done N stacks and it should ignore them). Even if there’s a small number of GC threads (2 / 4), it’s still worth doing the work serially if only a small number of marks have been preserved. (IMHO)
One could argue, that work units < 1ms (or another low value) are not worth distributing.
I agree.
From that one could determine a "reasonable" minimum number of references to process per thread, from which you could then deduct a range of PreservedMarks buffers that each thread could use (assuming that the actual references are evenly distributed across these sets of references).
At least that's what I would start with.
- Ah, yes, I’m glad there’s an atomic add for sizet in JDK 9. :-)
- Re: ParRestoreTask vs. RestoreTask : I tend to add “Par” as the tasks should be safe to use in parallel (even if in practice they are not). Hope that’s OK.
Another reason is that nowadays practically everything done during a STW pause must somehow be implemented to run in parallel to be actually useful (except for serial gc obviously). So this marker is kind of misleading and redundant.
Depends whether you see “Par” as “it will be done in parallel” or “it might be done in parallel”. I see it as the latter (and there’s lots of cases in the code where Par is used for such classes).
Tony
Not really insisting on this, but it may be that some future general cleanup will modify this (nothing actually planned from me).
Thanks, Thomas
Tony Printezis | JVM/GC Engineer / VM Team | Twitter
@TonyPrintezis tprintezis at twitter.com
-------------- next part -------------- An HTML attachment was scrubbed... URL: <https://mail.openjdk.org/pipermail/hotspot-gc-dev/attachments/20160325/6e47d4e3/attachment.htm>
- Previous message (by thread): 8152312: ParNew: Restore preserved marks in parallel
- Next message (by thread): RFR(s): 8152160: SIGFPE in CompactibleFreeListSpaceLAB::compute_desired_plab_size
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]