Node Stability and Memory Leak Fix
Hey everyone, wanted to give a quick update on node stability and our short-term future development plans.
As Tim announced yesterday, the goal this week is to have 6 community pools up, in sync, and being mined on. We’re actively partnering with interested pool operators and helping troubleshoot any issues they’re running into.
As many of you probably remember, roughly 3 weeks ago we were experiencing a memory leak using gRPC v1.10, which manifested itself as something like:
(note how NodeCore-related objects are using <1 GB, and gRPC is taking every last byte of remaining heap space).
(Note that it’s a single instance of a channel that spins out of control, rather than several channels simultaneously error out, implying that a single event occurred in a channel which caused something in the depths of gRPC/Netty to spin out of control in a matter of seconds).
We updated to gRPC v1.14 in a subsequent release (v1.11 and v1.12 both fixed potential channel memory leaks), and the incident rate of these crashes went down significantly. However, the issue still occurs on a random basis.
This is also not the kind of memory leak that gradually drips over time; an instance of NodeCore can run successfully for 48+ hours with no memory leak whatsoever, or can crash 2 minutes after startup. Additionally, the time between when the leak starts and when it causes NodeCore to crash is generally less than a minute. For example, here’s a screenshot of a profiler attached to an instance of NodeCore which crashed due to gRPC memory consumption:
This node ran successfully for ~2 hours before encountering the random memory crash. At timestamp 120:07 the memory leak began, and by 120:30 it had consumed all of the available heap, and >95% of the CPU was being used in garbage collection attempting to free up memory, which it was unable to do.
If this node had been a pool, then it would have began failing to respond to share submissions and failing to keep up with the blockchain until the JVM eventually killed it with an OOM error.
The problem seems to be inside gRPC (or something gRPC is using, like Netty).
In order to mantain NodeCore stability at this point in time, we’ve implemented auto-restart and recovery features while we work on a permanent solution.