// Various cleaning tasks that should be done periodically at safepoints

void SafepointSynchronize::do_cleanup_tasks() {

....

  {

    TraceTime t4("mark nmethods", TraceSafepointCleanupTime);

    NMethodSweeper::mark_active_nmethods();

  }

..

}


void NMethodSweeper::mark_active_nmethods() {

 ...

  if (!sweep_in_progress()) {

    _seen = 0;

    _sweep_fractions_left = NmethodSweepFraction;

    _current = CodeCache::first_nmethod();

    _traversals += 1;

    _total_time_this_sweep = Tickspan();


    if (PrintMethodFlushing) {

      tty->print_cr("### Sweep: stack traversal %d", _traversals);

    }

    Threads::nmethods_do(&mark_activation_closure);


  } else {

    // Only set hotness counter

    Threads::nmethods_do(&set_hotness_closure);

  }


  OrderAccess::storestore();

">

(original) (raw)

Hi Vladimir --

I noticed the increase even with Initial and Reserved set to the default of 240 MB, but actual usage much lower (less than a quarter).

Look at this code path. Note that this is invoked at every safepoint (although it says "periodically" in the comment).
In the mark\_active\_nmethods() method, there's a thread iteration in both branches of the if. I haven't checked to
see which of the two was the culprit here, yet (if either).

// Various cleaning tasks that should be done periodically at safepoints

void SafepointSynchronize::do\_cleanup\_tasks() {

....

{

TraceTime t4("mark nmethods", TraceSafepointCleanupTime);

NMethodSweeper::mark\_active\_nmethods();

}

..

}


void NMethodSweeper::mark\_active\_nmethods() {

...

if (!sweep\_in\_progress()) {

\_seen = 0;

\_sweep\_fractions\_left = NmethodSweepFraction;

\_current = CodeCache::first\_nmethod();

\_traversals += 1;

\_total\_time\_this\_sweep = Tickspan();


if (PrintMethodFlushing) {

tty->print\_cr("### Sweep: stack traversal %d", \_traversals);

}

Threads::nmethods\_do(&mark\_activation\_closure);


} else {

// Only set hotness counter

Threads::nmethods\_do(&set\_hotness\_closure);

}


OrderAccess::storestore();

}


On Fri, Jul 31, 2015 at 11:43 AM, Vladimir Kozlov <vladimir.kozlov@oracle.com> wrote:
Hi Ramki,

Did you fill up CodeCache? It start scanning aggressive only with full CodeCache:

// Force stack scanning if there is only 10% free space in the code cache.
// We force stack scanning only non-profiled code heap gets full, since critical
// allocation go to the non-profiled heap and we must be make sure that there is
// enough space.
double free\_percent = 1 / CodeCache::reverse\_free\_ratio(CodeBlobType::MethodNonProfiled) \* 100;
if (free\_percent <= StartAggressiveSweepingAt) {
do\_stack\_scanning();
}

Vladimir


On 7/31/15 11:33 AM, Srinivas Ramakrishna wrote:

Yes.


On Fri, Jul 31, 2015 at 11:31 AM, Vitaly Davidovich <vitalyd@gmail.com
vitalyd@gmail.com>> wrote:

Ramki, are you running tiered compilation?

sent from my phone

On Jul 31, 2015 2:19 PM, "Srinivas Ramakrishna" <ysr1729@gmail.com
ysr1729@gmail.com>> wrote:


Hello GC and Compiler teams!

One of our services that runs with several thousand threads
recently noticed an increase
in safepoint stop times, but not gc times, upon transitioning to
JDK 8.

Further investigation revealed that most of the delta was
related to the so-called
pre-gc/vmop "cleanup" phase when various book-keeping activities
are performed,
and more specifically in the portion that walks java thread
stacks single-threaded (!)
and updates the hotness counters for the active nmethods. This
code appears to
be new to JDK 8 (in jdk 7 one would walk the stacks only during
code cache sweeps).

I have two questions:
(1) has anyone else (typically, I'd expect applications with
many hundreds or thousands of threads)
noticed this regression?
(2) Can we do better, for example, by:
(a) doing these updates by walking thread stacks in
multiple worker threads in parallel, or best of all:
(b) doing these updates when we walk the thread stacks
during GC, and skipping this phase entirely
for non-GC safepoints (with attendant loss in
frequency of this update in low GC frequency
scenarios).

It seems kind of silly to do GC's with many multiple worker
threads, but do these thread stack
walks single-threaded when it is embarrasingly parallel (one
could predicate the parallelization
based on the measured stack sizes and thread population, if
there was concern on the ovrhead of
activating and deactivating the thread gangs for the work).

A followup question: Any guesses as to how code cache
sweep/eviction quality might be compromised if one
were to dispense with these hotness updates entirely (or at a
much reduced frequency), as a temporary
workaround to the performance problem?

Thoughts/Comments? In particular, has this issue been addressed
perhaps in newer JVMs?

Thanks for any comments, feedback, pointers!
\-- ramki

PS: for comparison, here's data with +TraceSafepointCleanup from
JDK 7 (first, where this isn't done)
vs JDK 8 (where this is done) with a program that has a few
thousands of threads:



JDK 7:
..
2827.308: \[sweeping nmethods, 0.0000020 secs\]
2828.679: \[sweeping nmethods, 0.0000030 secs\]
2829.984: \[sweeping nmethods, 0.0000030 secs\]
2830.956: \[sweeping nmethods, 0.0000030 secs\]
..

JDK 8:
..
7368.634: \[mark nmethods, 0.0177030 secs\]
7369.587: \[mark nmethods, 0.0178305 secs\]
7370.479: \[mark nmethods, 0.0180260 secs\]
7371.503: \[mark nmethods, 0.0186494 secs\]
..