RFR (XS) CR 8014233: java.lang.Thread should have @Contended on TLR fields (original) (raw)

Aleksey Shipilev aleksey.shipilev at oracle.com
Tue Jun 18 06:56:30 UTC 2013

Previous message: RFR (XS) CR 8014233: java.lang.Thread should have @Contended on TLR fields
Next message: RFR (XS) CR 8014233: java.lang.Thread should have @Contended on TLR fields
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi David,

It depends on the scenario we are assessing. For the sake of argument, let's say every thread had requested TLR.current() at least once.

Before the merge: Thread maps for ThreadLocal =~ 32 bytes x #threads TLR instances + padding =~ (128 + 8?) bytes x #threads

After the merge: TLR fields in Thread + padding =~ (2x128 + 16) x #threads

So, there is the additional footprint cost per Thread; but that seems abysmal comparing to what native thread already allocates for its native structures (e.g. stack). Note that @Contended does larger padding anticipating the hardware prefetchers also turned on (VM can get better at this though).

Gory details:

**** -XX:-EnableContended: ****

Running 64-bit HotSpot VM. Using compressed references with 3-bit shift. Objects are 8 bytes aligned.

java.lang.Thread offset size type description 0 12 (assumed to be the object header

first field alignment) 12 4 int Thread.priority 16 8 long Thread.eetop 24 8 long Thread.stackSize 32 8 long Thread.nativeParkEventPointer 40 8 long Thread.tid 48 8 long Thread.threadLocalRandomSeed 56 4 int Thread.threadStatus 60 4 int Thread.threadLocalRandomProbe 64 4 int Thread.threadLocalRandomSecondarySeed 68 1 boolean Thread.single_step 69 1 boolean Thread.daemon 70 1 boolean Thread.stillborn 71 1 (alignment/padding gap) 72 4 char[] Thread.name 76 4 Thread Thread.threadQ 80 4 Runnable Thread.target 84 4 ThreadGroup Thread.group 88 4 ClassLoader Thread.contextClassLoader 92 4 AccessControlContext Thread.inheritedAccessControlContext 96 4 ThreadLocalMap Thread.threadLocals 100 4 ThreadLocalMap Thread.inheritableThreadLocals 104 4 Object Thread.parkBlocker 108 4 Interruptible Thread.blocker 112 4 Object Thread.blockerLock 116 4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler 120 (object boundary, size estimate) VM reports 120 bytes per instance

**** -XX:+EnableContended: ****

Running 64-bit HotSpot VM. Using compressed references with 3-bit shift. Objects are 8 bytes aligned.

java.lang.Thread offset size type description 0 12 (assumed to be the object header

first field alignment) 12 4 int Thread.priority 16 8 long Thread.eetop 24 8 long Thread.stackSize 32 8 long Thread.nativeParkEventPointer 40 8 long Thread.tid 48 4 int Thread.threadStatus 52 1 boolean Thread.single_step 53 1 boolean Thread.daemon 54 1 boolean Thread.stillborn 55 1 (alignment/padding gap) 56 4 char[] Thread.name 60 4 Thread Thread.threadQ 64 4 Runnable Thread.target 68 4 ThreadGroup Thread.group 72 4 ClassLoader Thread.contextClassLoader 76 4 AccessControlContext Thread.inheritedAccessControlContext 80 4 ThreadLocalMap Thread.threadLocals 84 4 ThreadLocalMap Thread.inheritableThreadLocals 88 4 Object Thread.parkBlocker 92 4 Interruptible Thread.blocker 96 4 Object Thread.blockerLock 100 4 UncaughtExceptionHandler Thread.uncaughtExceptionHandler 104 128 (alignment/padding gap) 232 8 long Thread.threadLocalRandomSeed 240 4 int Thread.threadLocalRandomProbe 244 4 int Thread.threadLocalRandomSecondarySeed 248 (object boundary, size estimate) VM reports 376 bytes per instance

-Aleksey.

On 06/18/2013 06:03 AM, David Holmes wrote:

Hi Aleksey,

What is the overall change in memory use for this set of changes ie what did we use pre TLR merging and what do we use now? Thanks, David On 17/06/2013 7:00 PM, Aleksey Shipilev wrote: Hi,

This is the respin of the RFE filed a month ago: http://mail.openjdk.java.net/pipermail/core-libs-dev/2013-May/016754.html The webrev is here: http://cr.openjdk.java.net/~shade/8014233/webrev.02/ Testing: - JPRT build passes - Linux x8664/release passes jdk/java/lang jtreg - vm.quick.testlist, vm.quick-gc.testlist on selected platforms - microbenchmarks, see below The rationale follows. After we merged ThreadLocalRandom state in the thread, we are now missing the padding to prevent false sharing on those heavily-updated fields. While the Thread is already large enough to separate two TLR states for two distinct threads, we can still get the false sharing with other thread fields. There is the benchmark showcasing this: http://cr.openjdk.java.net/~shade/8014233/threadbench.zip There are two test cases: first one is only calling its own TLR with nextInt() and then the current thread's ID, another test calls another thread ID, thus inducing the false sharing against another thread's TLR state. On my 2x2 i5 laptop, running Linux x8664: same: 355 +- 1 ops/usec other: 100 +- 5 ops/usec Note the decrease in throughput because of the false sharing. With the patch: same: 359 +- 1 ops/usec other: 356 +- 1 ops/usec Note the performance is back. We want to evade these spurious decreases in performance, due to either unlucky memory layout, or the user code (un)intentionally ruining the cache line locality for the updater thread. Thanks, -Aleksey.

Previous message: RFR (XS) CR 8014233: java.lang.Thread should have @Contended on TLR fields
Next message: RFR (XS) CR 8014233: java.lang.Thread should have @Contended on TLR fields
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the core-libs-dev mailing list