Witness Outage explained. A Perfect Storm
During the day & night of 11 - 12 september 2019 my witness stopped producing blocks for several hours, so bad. This is an unforgivable failure and should not have happened, with multiple failovers running and a monitoring script in place to automatically switch.
But still it happened.... What was going on?
It appears a perfect storm happened: Segfaulted main node (?) meh. And my witnessMonitorFailover script got stuck so I did not receive notifications to check. This is the first time in 1000+ days producing blocks.
After investigating it was an outlying coincidence: The server provider on which I have my monitoring script running, had an unexpected maintenance to the vps environment rack (?!!), causing the script to become unresponsive. (FYI i have bitshares related nodes in 3 different providers @ 4 different physical locations).
Most painful thing was that I relied 100% on my failover script. I was missing blocks while literally sitting behind my computer. And even when I launch my browser I have an overview page about witness work opened by default, ... But i didnt launch a new browser... Bummer.
Learning:
- double redundancy on failover scripts or at least a monitoring & notify tool on a second server is not such a bad idea + don't forget to check all nodes before going to bed (!)
Sorry for unwantingly delaying your transactions and I will learn from this.
See you on the chain & thank you for your continued support,
RoelandP