Good question, I've been asking myself that recently. To give you definite numbers, the benchmarks below (in Scala, compiled to virtually the same bytecodes as the equivalent Java code):
var cnt: String = "" val tlocal = new java.lang.ThreadLocal[String] { override def initialValue = "" } def loop_heap_write = { var i = 0 val until = totalwork / threadnum while (i < until) { if (cnt ne "") cnt = "!" i += 1 } cnt } def threadlocal = { var i = 0 val until = totalwork / threadnum while (i < until) { if (tlocal.get eq null) i = until + i + 1 i += 1 } if (i > until) println("thread local value was null " + i) }
available here, were performed on an AMD 4x 2.8 GHz dual-cores and a quad-core i7 with hyperthreading (2.67 GHz).
These are the numbers:
i7
Specs: Intel i7 2x quad-core @ 2.67 GHz Test: scala.threads.ParallelTests
Test name: loop_heap_read
Thread num.: 1 Total tests: 200
Run times: (showing last 5) 9.0069 9.0036 9.0017 9.0084 9.0074 (avg = 9.1034 min = 8.9986 max = 21.0306 )
Thread num.: 2 Total tests: 200
Run times: (showing last 5) 4.5563 4.7128 4.5663 4.5617 4.5724 (avg = 4.6337 min = 4.5509 max = 13.9476 )
Thread num.: 4 Total tests: 200
Run times: (showing last 5) 2.3946 2.3979 2.3934 2.3937 2.3964 (avg = 2.5113 min = 2.3884 max = 13.5496 )
Thread num.: 8 Total tests: 200
Run times: (showing last 5) 2.4479 2.4362 2.4323 2.4472 2.4383 (avg = 2.5562 min = 2.4166 max = 10.3726 )
Test name: threadlocal
Thread num.: 1 Total tests: 200
Run times: (showing last 5) 91.1741 90.8978 90.6181 90.6200 90.6113 (avg = 91.0291 min = 90.6000 max = 129.7501 )
Thread num.: 2 Total tests: 200
Run times: (showing last 5) 45.3838 45.3858 45.6676 45.3772 45.3839 (avg = 46.0555 min = 45.3726 max = 90.7108 )
Thread num.: 4 Total tests: 200
Run times: (showing last 5) 22.8118 22.8135 59.1753 22.8229 22.8172 (avg = 23.9752 min = 22.7951 max = 59.1753 )
Thread num.: 8 Total tests: 200
Run times: (showing last 5) 22.2965 22.2415 22.3438 22.3109 22.4460 (avg = 23.2676 min = 22.2346 max = 50.3583 )
AMD
Specs: AMD 8220 4x dual-core @ 2.8 GHz Test: scala.threads.ParallelTests
Test name: loop_heap_read
Total work: 20000000 Thread num.: 1 Total tests: 200
Run times: (showing last 5) 12.625 12.631 12.634 12.632 12.628 (avg = 12.7333 min = 12.619 max = 26.698 )
Test name: loop_heap_read Total work: 20000000
Run times: (showing last 5) 6.412 6.424 6.408 6.397 6.43 (avg = 6.5367 min = 6.393 max = 19.716 )
Thread num.: 4 Total tests: 200
Run times: (showing last 5) 3.385 4.298 9.7 6.535 3.385 (avg = 5.6079 min = 3.354 max = 21.603 )
Thread num.: 8 Total tests: 200
Run times: (showing last 5) 5.389 5.795 10.818 3.823 3.824 (avg = 5.5810 min = 2.405 max = 19.755 )
Test name: threadlocal
Thread num.: 1 Total tests: 200
Run times: (showing last 5) 200.217 207.335 200.241 207.342 200.23 (avg = 202.2424 min = 200.184 max = 245.369 )
Thread num.: 2 Total tests: 200
Run times: (showing last 5) 100.208 100.199 100.211 103.781 100.215 (avg = 102.2238 min = 100.192 max = 129.505 )
Thread num.: 4 Total tests: 200
Run times: (showing last 5) 62.101 67.629 62.087 52.021 55.766 (avg = 65.6361 min = 50.282 max = 167.433 )
Thread num.: 8 Total tests: 200
Run times: (showing last 5) 40.672 74.301 34.434 41.549 28.119 (avg = 54.7701 min = 28.119 max = 94.424 )
Summary
A thread local is around 10-20x that of the heap read. It also seems to scale well on this JVM implementation and these architectures with the number of processors.
Threads contain a (unsynchronized) hashmap where the key is the currentThreadLocalobject