Saturday, July 21, 2012

Java hiccups and how to beat them

Recently, The H was looking into Java Virtual Machine performance and in the process came across a different kind of performance evaluation tool, jHiccup, which promised to measure the hiccups that afflict applications in the real world. To find out more about this open source tool, The H talked to its author, Gil Tene, CTO of Azul Systems, makers of the high performance JVM Zing.

The H: Why do Java programs get the hiccups?
Gil Tene: Well, all programs get the hiccups. It's just that those running on modern-day managed runtimes (such as a JVM, or .NET, or Dalvik, or Ruby, etc.) have bigger causes for those hiccups than "what we were used to" up to the mid-90s.
We tend to think of computers as things that run continuously, without interruption, and we tend to treat interruptions in this model as "noise". I call any situation where (from the application's point of view) a computer is not running for a while a "hiccup". If the platform you run your application on exhibits "hiccups" with a magnitude that "matters" to your application, application response time behaviour will invariably be affected.
The reality is that computers often stop running programs for some short (for an arbitrary definition of "short") periods of time for all sorts of reasons: interrupts and their handling are microseconds-level disruptions. Context-switches can reach tens of microseconds, and scheduling delays (when more things are runnable than there are CPUs and the OS scheduler time-slices a CPU) are usually counted in "OS quantums" that range into the milliseconds (a 10msec quantum being pretty common occurrence). All this has been around for decades. Modern day environments add other interesting things, like power saving modes and hypervisor, that do more time slicing.
However, the biggest "gift" of all to modern-day hiccups usually comes from garbage collection. Most modern languages (most of those created since 1995, and many from before then) use some form of implicit, automatic memory management, with the managed runtime environment they run on performing implicit garbage collection. Java is such an environment – arguably the most successful one in history – and most people who run on or use a Java JVM know what garbage collection is: it's that thing that makes their application pause every once in a while.
In most modern day JVMs (e.g. Oracle HotSpot and jRockit, OpenJDK, IBM J9) garbage collection invariably involves "stop-the-world" events. These stop-the-world events range from frequent "young generation" or "minor" collections that usually come in 10s or 100s of millisecond "hiccup" chunks, to the bigger "major" collection (aka "oldgen", "Full GC") that can take several seconds to complete. The length of the bigger GC "hiccups" usually depends on the amount of live data in the program at the time of collection. The larger the live data set (and the larger the heap, for some/most JVMs), the larger the pause you can expect to see.
When a garbage collector "stop the world" for any reason, applications will experience a "hiccup" in execution. Whenever this "hiccup" reaches a magnitude that matters to the application, this effect will typically dominate the worst-case and high-percentile (e.g. 99.9%) response time behaviour of the application. For example, if an application typically responds in sub-second times to user requests, having regularly occurring 3 second "major" GC pauses would dominate the application's response time behaviour. If an application is expected to respond in milliseconds (as is common in Telco and financial applications), even the frequent "minor" GC events tend to dominate the application's response time characteristics.
The H: Is how those hiccups manifest dependent on the VM?
GT: Yes. More precisely, it depends on the garbage collector used, and the garbage collector is a key part of a JVM.
A lot of work has been done on garbage collectors on current JVMs, but most of that work hasn't focused on eliminating hiccups or reducing their absolute magnitude. Instead it has been focused on making them more "rare" through a combination of intricate tuning efforts and garbage collector tricks. The commonly used JVMs on the market today all exhibit significant stop-the-world pauses on a regular basis – something that is readily observable by looking at "hiccup charts" that can be collected (with jHiccup) for pretty much any Java application. Most Java applications with more than a GB or two of heap will exhibit multi-second hiccups multiple times per day or week, and hiccups in the tens or hundreds of milliseconds several times per minute. When heap sizes grow and allocation rates increase to consume more than a tiny fraction of a modern-day commodity server, these numbers tend to get much worse.
As an example of how the choice of VM can dramatically effect the hiccup levels that an application experiences, you can look at the behaviour of Azul's Zing JVM. Zing is specifically focused on dramatically reducing both the absolute magnitude and frequency of application hiccups experienced due to garbage collection. Zing's C4 garbage collector uses a concurrent algorithm for both minor and major collections, allowing it to perform garbage collection without incurring those long stop-the-world events that are inevitable on other JVMs.

Leave a Reply