1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.
Several witnesses attempted to code an automatic failover algorithm but I don't believe any were successful without introducing new problems.
One important thing to consider is you absolutely do NOT want 2 nodes producing blocks for the same witness, as that is sure to cause havoc and fork the network.
Whenever I switch production using the "update_witness" API call I manually make sure both the old witness node and the new witness node are both listening and in sync before I execute the call. I usually submit the call on the old witness going out of production, not the new node going into production. I can then use the get_witness API call to verify the signing key for the new node in in effect before I shut down the old witness node.
The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from
that node) but the watchdog node falsely concludes it is dead, it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node. As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.
When the block producer fails it may not be possible to determine for certain why or get confirmation it will not resume block production. You will need to determine if the OS for the failing node is responding but not the app, in which case failover may be possible if you build in some type of communication to restart the witness_node app or restart the entire OS. The issue is what if you can't communicate with the failing node? Is it dead or just temporarily cut off? Will it fork the network if it should come back online?
I was hoping wackou & I could have implemented
the backbone architecture and a failover protocol along with it, but there wasn't enough funding and wackou's time was very scarce (and still is actually). If this ecosystem is going to survive a frontal attack the witness nodes need to be protected from direct public access. Seed nodes and API servers should be the route available for public access, leaving witnesses alone to process and generate blocks quickly with minimum latency.