Performance of techniques for correctly implementing lazy initialization (original) (raw)
by Doug Lea
[This note was originally sent as email by Doug Lea on the Java Memory Model mailing list in response to questions about the performance of a technique implementing lazy initialization using ThreadLocals -- Bill Pugh]
The main concern, that I should have mentioned before, is that ThreadLocal varies tremendously in speed across JVMs and JDK versions. On most 1.2.x JVMs, performance is so bad in this context that you'd never want to use it. (The main reason is that until 1.3 ThreadLocal internally used WeakHashMaps, which are needlessly heavy. The 1.4 version will in turn be faster than 1.3.)
You can usually avoid this uncertainty though if you need to.
If you can create and use your own thread subclass, you can implement your own variants of ThreadLocals. (See Section 2.3.2.1 of the 2nd edition of my CPJ book). In fact, if you know in advance all of the singletons you'll use, you don't need a table, just fields in the thread subclass will do. You can squeeze times even further if you can just pass in Thread refs rather than looking it up each time via Thread.currentThread. The attached file shows examples/hacks. I'm not sure I recommend any of this, but if you are going to go this route, you might as well make it both fast and correct.
Due to the nice folks at http://www.testdrive.compaq.com, I did test out some of this on alphas. (Testdrive is a very nice service! Anyone can register. It would be great if other MP vendors did this too.)
The fastest versions of Java I could find on MP alphas at testdrive were 1.2.2 VMs on a 2X500 running Tru64 and a 4X667 running linux. The 4-CPU box failed some of Bill's "volatile" tests (athttp://www.cs.umd.edu/~pugh/java/memoryModel/). I gather that these JVMs don't use enough barriers even for "old" volatile (which is itself insufficient to guarantee double check).
The machines were NOT idle (load average was usually around one), but repeated tests gave about the same ratios, so these figures are probably in the right ballpark.
Here are results (the 3rd and 4th columns are 4-CPU sparc, and the last 2 columns are results on basically the same tests, taken from last post) Table entries are ratios compared to "Eager" version of Singleton.
CPUs | 4-CPU | 2-CPU | 4-CPU | 4-CPU | 2-CPU | 1-CPU |
---|---|---|---|---|---|---|
chip | alpha | alpha | sparc | sparc | x86 | sparc |
OS | linux | Tru64 | sol 8 | sol8 | ? | Sol 98 |
JDK | 1.2.2 | 1.2.2 | 1.3 | 1.2.2_07 | 1.3 | 1.3 |
Eager | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
Volatile(DCL) | 1.09 | 1.01 | 1.22 | 1.34 | 1.31 | 1.18 |
ThreadLocal | 300.80 | 17.84 | 6.32 | 240.74 | 6.50 | 5.01 |
SimThreadLocal | 4.43 | 4.19 | 4.81 | 2.39 | ? | ? |
Synch | 189.26 | 5.73 | 69.03 | 66.41 | 32.12 | 9.64 |
Thread Field | 2.16 | 2.71 | 4.16 | 2.00 | ? | ? |
Direct Field | 1.00 | 1.25 | 1.18 | 1.29 | ? | ? |
Notes:
- The run on 4-CPU sparc under 1.2.2_07 demonstrates above remark that ThreadLocal was unusable in this context until 1.3.
- SimThreadLocal handcrafts something close to the 1.4 ThreadLocal implementation, in a way that works on pre-1.4.
- Again, I'm pretty sure the alpha JVMs didn't put in enough barriers in Volatils/DCL code. This is not their fault. They weren't required to. But these results are wrong (too fast) for a properly barriered version. In fact, on this set of runs, NONE of the volatile results are likely to be exactly right (all too fast).
- "Direct Field" differs from "Thread Field" by directly referencing the singleton field off the thread object rather than going through Thread.currentThread. This doesn't apply very often in practice, but shows the best possible results you could ever get via this kind of design.
- As always, remember that this is a microbenchmark, that might not have much relevance to practical use of singletons.