So yesterday was the 6/30/2015 leap second day and boy was it (not) fun.
We’ve had previous leap second events and they also were (not) a barrel of laughs. The last one triggered a myriad of Java Virtual Machine process malfunctions, especially amongst services that depended on Zookeeper sessions for distributed system coordination. Remedying required that we find the errant processes and restart them. Klout uses HBase as distributed big data store; we have multiple clusters for different workloads and HBase’s region (essentially, data shard) management has proven to be very sensitive to the Zookeeper anomalies. The leap second incident prior to that caused a spin-up of JVM resource consumption; it actually triggered a power surge and tripped a circuit. The data center incident took many hours to work through as we got power restored and clusters running again.
Since those prior leap second events, we’ve upgraded our JVM versions in a number of our clusters and some of our HBase clusters were recently refreshed with updated hardware and software versions. The new version of HBase we updated isn’t the final one we’re landing on this year, due to compatibility with extant components, but the interim state we landed on was a big improvement. The data infrastructure, data science, Lithium community analytics and Klout’s core services teams have been working together over the past several months as we consolidate Klout and Lithium’s workloads onto common infrastructure, updated versions and normalize the division of operational responsibilities. We’re very excited about the version of HBase we’re ultimately moving to, it improves numerous performance and stability operating characteristics.
Anticipating the 6/30/2015 leap second, we had our engineers on stand-by. Recognizing that the prior leap second events were unlike each other, it’s difficult to predict what this one would look like. The one area that we expected to be most susceptible to anomalies was, of course, HBase. We expected, worse case, we’d need to restart one or more of our HBase clusters and be able to do so in 10-15 minutes and suffer an outage no more than 15-20 minutes. The hour struck and when Klout went down, we immediately homed in on our HBase clusters. Klout’s service is highly dependent upon HBase – if it’s not operating correctly, Klout’s service is not either. The HBase clusters that were updated earlier this year were fine, the ones we’re planning to refresh were not (we had a timeline to get this done in mid-June but too many slips in the project predecessors have pushed it late July).
This is where things got weird. HBase relies on a distributed filesystem (HDFS) and during cluster restarts, while the cluster master was processing its transaction logs (the SplitLogManager), it’d reported that the filesystem “went away” while its recovery was in mid-flight. From outside appearances, HDFS seemed fine, none of the runtimes were misbehaving and data integrity seemed intact. After a few iterations with that, it seemed clear that there was something more nuanced going on. OK, back up and restart HDFS, not a quick process but one we needed to do to eliminate it as a culprit. Then we restarted the HBase recovery. Although the HDFS restart seemed to help a little bit, the HBase recovery cadence seemed steadier, it was still troublingly slow. In fact it seemed get progressively slower and grind down to a halt after a while but it was no longer hitting the HDFS “went away” condition.
The region servers seemed fine. We hooked up jconsole to the JMX port we have exposed on the HBase master. At first, it reported nothing alarming with the heap usage and garbage collections looked like pretty short, sub-second episodes. Until we hit one long into the recovery cycle, a two minute stop-the-world garbage collection pause! The long pauses are killers, the master’s Zookeeper sessions timeout, its ephemeral node disappears, region servers think it’s dead and they don’t recover when the pause has passed; the cluster essentially dies. We raised the Zookeeper session timeouts to three minutes but the next iteration seemed to hit the same cluster death condition. So maybe the “tenured” generation of objects cleaned up by the concurrent mark/sweep garbage collector had too much work to do, we dialed down the CMSInitiatingOccupancyFraction JVM parameter and restarted. It ran its course for a long time and then hit the same two minute pause and cluster death. Upon closer observation, the pauses were in the ParNew (Parallel New) garbage collector, the “young generation” of JVM objects. My intimacy with JVM tuning isn’t especially deep so I may have this wrong but my understanding is that if there’s too much fragmented garbage in the young generation, lots of little objects, the garbage collection pause is a long stop-the-world event. We were four hours into a downtime and still were stymied by this failed HBase recovery. We had NewSize and MaxNewSize JVM parameters set to 192m, we dialed them down to 128m and restarted: voila! The recovery cadence was faster by a factor of five and it never hit any long GC pauses – within 15 minutes we had the cluster recovered and restored service.
We turned our attention to other internally facing services that were older JVMs and needed to be restarted. Once things stabilized, we came up for air and reckoned with how things ran so far off the tracks. We’ve had HBase cluster restarts before, they are service interrupting but generally not long-lived efforts. We’re still not clear why the SplitLogManager’s work was any different from any other cluster restart, requiring JVM turning parameters different from our otherwise stable-for-years parameters.
Some key take-aways:
- we need to complete our upgrades to get us on updated JVM versions everywhere and newer versions of our big data infrastructure
- when preparing for an anticipated incident, designate a communications coordinator – when things got weird, communication to the rest of the staff that things had gotten really weird and decisions could be made about what further communication was required
- our teams need more smart people who are experienced with java programming, distributed systems, JVM tuning and linux systems knowledge – is that you? join us!
We expect there will be another leap second within the next 18 months or so and we should be on a much better footing with updated JVMs and infrastructure. Until then, we hope this peak into the fun (or not) of running complex big data infrastructure at scale was entertaining!