Saturday, July 21, 2012

Diagnosing hiccups

The H: A lot of developers might say "But I can see when GC kicks in so I don't need a measuring tool for it"; what would you say to them?
GT: First, jHiccup reports on any observed hiccups, regardless of cause. Hiccups can occur for a multitude of possible reasons, including GC, power savings artifacts, swapping, scheduling pressure, and other OS-level artifacts such as Transparent Huge Page compaction that is done in recent Linux kernels.
Second, jHiccup reports what an application actually experiences and observes, as opposed to what a log file for a JVM or an OS may "claim" has happened. Often reporting in JVM and OS logs is "honest", in which case the hiccups jHiccup will report should closely match up with pauses seen in GC logs. However, sometimes the times or events the JVM reports may be only part of the picture. For example, if you compare pause times reported by verbose GC logs with those reported by using additional flags like -XX:+PrintGCApplicationStoppedTime, you'll often see significant discrepancies. With jHiccup, you don't have to wonder if you have turned on the appropriate logging, and if the logging is accurate, optimistic, or pessimistic – you have a log of *observed* discontinuities in execution, and any discontinuity larger than 1msec simply cannot hide.
I recommend using jHiccup in addition to (and not in place of) other monitoring mechanisms. Results seen with other measurement tools should almost always be worse than those reported by jHiccup (since jHiccup reports the hiccups seen when trying to do absolutely no work). If jHiccup results conflict with those measured through other means (e.g. if an external response time measure shows much better percentile results than those observed by jHiccup), then either jHiccup or the other measurement mechanism probably has a bug or a "methodology problem".
The H: You've placed jHiccup in the public domain using the CC zero licence. Any particular reason for that uncommon licensing choice?
GT: It's the most permissive way I know of to allow people to use the code in any and all forms. Placing the code in the public domain removes any questions about conflict with other forms of licensing. I followed the example of what Doug Lea has done with much of his work, such as his extremely well adopted dlmalloc, as well as java.util.concurrent (all the sources originated by the JSR166 group were similarly placed in the public domain).
The H: Are there any enhancements you'd like to see?
GT: I'd love to see people do more with jHiccup, both in terms of using it as-is, and by incorporating its simple measurement technique as a common way of self-measurement by applications and application platforms.
I'd be happy to see people build non-Java versions (e.g. for the various .NET languages, as well as for Ruby, Python, etc. Maybe even for C/C++). I think the issue of implicitly assuming "platform continuity" for applications, and of ignoring or under-reporting and "platform discontinuity", is universal, and is especially prevalent in managed runtimes with automatic memory management (aka GC).
The H: How about integrating jHiccup with management platforms so people can spot hiccuppy JVMs?
GT: jHiccup was intentionally kept simple, and intentionally kept separate from any specific JVM, OS, or application platform. It would be trivial to take the data collected and reported by jHiccup and incorporate it for presentation as part of an overall monitoring or management solution. It's been placed in the public domain with exactly this sort of thing in mind.
The H: Does Zing detect its own hiccups if they happen or does C4 prevent that?
GT: In a way, jHiccup is there to keep everyone honest, and that includes Zing. No amount of logging or reporting by a runtime or an OS can be used as a replacement for logging the actual experiences that an application would see on such a platform.
Zing seems to do a fairly good job of reporting on what it thinks it is doing, and GC-related hiccups tend to be be a complete non-issue for Zing users. That's what C4 is meant for, after all. However, the best way for them to actually know that is to observe it with tools like jHiccup, rather than to believe what our JVM logs say.
With Zing, we most often find that observed hiccups are no longer dominated by GC effects, and as a result tend to be dramatically lower than those seen with other JVMs. However, hiccups in the multi-msec range can easily be seen, with causes ranging from scheduling pressure (having even a momentary situation with more runnable threads than available cores generally leads to ~10msec+ hiccups being observed), to power-management tuning artifacts, to background cron jobs being kicked off. In the very-low-latency space, we've seen people successfully tune "vanilla" Linux configurations running Zing such that the worst case observed hiccup levels (as reported by jHiccup) were kept well below 1msec.

Leave a Reply