Saturday, July 21, 2012

Curing the hiccups

The H: Generally, what should you do about Java VM hiccups?
GT: First, you should watch for and be aware of any hiccups in your application behaviour, preferably at all times (including in production). Next, you should compare the hiccup behaviour to your expected and required responsiveness behaviour, e.g. a Service Level Expectation (SLE) or Service Level Agreement (SLA). With that information at hand, it is the job of Architect, Developer, and/or operations and deployment specialists to make the application behave as it should, which generally means removing hiccups that violate the expected behaviour by any means necessary (but not dwelling too much on ones that do not violate the expected behaviour).
So it is those hiccups – the ones that violate your application's expected behaviour – for which you need to figure out the cause and how to deal with it. It has been our experience that the nearly-universal dominant causes of application hiccups on most JVMs are Garbage Collection pauses, and that those dwarf other causes in both size and frequency. Eliminating GC pauses will usually leave system-related hiccups that range in the low tens of milliseconds, which can also be addressed if needed (e.g. for a low latency application) through tuning of system settings (e.g. power management modes, swap and filesystem behaviour, and avoiding deep scheduling queues).
In keeping the GC-related JVM hiccups to within expected levels, you are currently faced with two choices:
1. Continuously tune various GC controls and parameters in the hope of reducing the frequency of very large events (e.g. full GCs) to "acceptable" levels, and the frequency and magnitude of " merely large" events (e.g. "minor" GCs, young gen collections, etc.) to acceptable levels as well. These efforts will typically result in a compromise about the acceptable percentile of certain magnitudes of hiccups, because complete elimination of GC pauses of either kind is not practical on most JVMs.
2. Combine continuous tuning with coding in a "GC friendly" or "Heap friendly" way, in an attempt to reduce the pressure on a garbage collector's young or old generation, with the hope of reducing the occurrence or magnitude of "Bad GC" events. This practice (which has been successful to some degree in low latency applications) typically results in what I call "programming in Java syntax, but without the Java ecosystem", since, for it to be successful, the use of many core behaviours, as well as any third party code is typically prohibited.
3. Address the core problem and eliminate GC altogether as a dominant cause of application hiccups. This is where Azul's flagship product – Zing – comes in. Zing's use of the C4 collector (which stands for "Continuously Concurrent Compacting Collector") simply and completely eliminates GC as a dominant cause for application hiccups, and does so without any special tuning or coding efforts or practices, and without the need for continuously re-tuning and/or re-coding.
Obviously, we believe Zing is a cure for the hiccups. It has been our experience that Zing will immediately bring enterprise applications to a worst case hiccup level in the low tens of milliseconds right out of the box. Those remaining hiccup levels, once Zing is deployed, are typically dominated by non-JVM system artifacts (e.g. scheduling pressure with many runnable threads competing for CPUs, or OS and hardware setting around power management and swap/file behaviour, all of which can be tuned and addressed if needed). In the low latency world, where scheduling pressure is already avoided as a matter of normal practice, we've seen applications immediately reach a worst case of below 10 milliseconds, and with relatively little tuning get to 1-2 millisecond worst case levels. Going below the 1 millisecond mark is quite possible, for those brave enough to tune their system (not their JVM) to provide that level of consistency, and to make their code provide it.

Leave a Reply