I think perhaps an easier way to explain the risk is this.
11 witnesses. 5 running failover scripts.
initial state of the network is nodes 1-6 and nodes 7a, 7b, 8a, 8b, 9a, 9b, 10a, 10b, 11a, and 11b all on chain A. On all failover nodes only node a is signing.
There is a fork. Nodes 1-3 and nodes 7a, 8a, 9a, 10a, and 11a all stay on chain A. With 8 of 11 nodes still signing the network is at 72%. Low but sustainable.
On chain B we now have nodes 4-6, and 7b, 8b, 9b, 10b, and 11b. As 7b, 8b, 9b, 10b, and 11b miss blocks the failover script switches the signing key to the key active on 7b, 8b, 9b, 10b, and 11b respectively. Since these nodes are now connected to chain B they can only change the signing key on chain B, leaving chain A unchanged.
The end result is:
Nodes 1-3, 7a, 8a, 9a, 10a, and 11a signing on chain A. 72% participation and transactions being "unreversable" after 67% of witnesses signing them.
Nodes 4-6, 7b, 8b, 9b, 10b, and 11b signing on chain B. 72% participation and transactions being "unreversable" after 67% of witnesses signing them.
This is an extremely unlikely worst case scenario even with a simple script that only looked at missed blocks. The results would be terrible though, and need to be prevented. It could lead to double spending. Even if it didn't lead to double spending it would be a giant pain in the ass to fix, and would cause permanent damage to our image.
This worst case scenario is not possible even with my current rough script, and this script could be improved upon to reduce the risk even further.
Ultimately while the result of us dropping below 67% and the chain coming to a halt is not as bad as two chains over 67% existing. It is far more likely to occur. The chain stopping would still be a giant pain in the ass to fix and would cause permanent damage to our image.
I think a properly designed failover script could mitigate both risks to acceptable levels.
I don't think I have reasoned out all possibilities by any means but in regards to my script in its current form I see a few ways that double sign issues could arise.
First of all would be a massive split of the internet. Lets assume that the majority of primary producing nodes are in the united states. Lets further assume that the united states gets entirely disconnected from the rest of the world. If the majority of control nodes and backup nodes are outside of the united states then when they switched over there would effectively be two networks. One within the united states and one outside of the united states. I think this is so unlikely that we don't really need to game plan for it. If anyone disagrees then let me know. I think there may even be a solution to this extreme possibility but I haven't spent a lot of time thinking about it.
The script as it is currently written consists of three different nodes. 2 producer nodes and a control node. Each producer node will restart the wintess node and cli wallet if it crashes or if witness participation falls below 50%, but the production nodes will not attempt to change signing keys. They will happily miss blocks as long as witness participation stays above 50%. The control node will not sign blocks, but will restart itself if it crashes or if witness participation falls below 50%. As blocks are missed it will round robin between nodes in a deterministic fashion (node depends upon total missed blocks reported to control node from a get_witness command) The possibilities I see are
One producer node forks off onto minority fork
both producer nodes fork off onto minority fork
All three nodes fork onto different forks.
I am saying minority fork, but ultimately we are really only concerned with only signing on one fork at a time. Therefore one producer node forking is exactly the same as the control node and one producer node forking. It is effectively two networks one with a single producer and the other with the control node and producer. As always if my reasoning seems incorrect please let me know.
As the only node that is capable of changing the signing key is the control node any fork that separates the control node from both production nodes is not a concern in regards to double signing. The control node will furiously update the witness from signing key to signing key until witness participation drops below 50%. It will then replay, and if unable to fix itself with replay will resync. All the while the witness node that is on the majority chain will either be signing blocks or missing blocks. It is of course possible that If all three nodes split that the control node will replay and end up on the same chain as one of the producer nodes. I am not sure that this will make a difference though.
This leads us to the interesting possibility. One producer node and the control node on a single chain. Thus the control node is capable of allowing the secondary producer node to sign while the primary is signing on a different chain. The way it should work if there is a fork is that producer a and control node will go off on chain 1 and producer b will go off on chain 2. If the witness misses a block on chain 1 then the control node will change the signing key on chain 1. The control node will then notice the signing key variance and change both producer nodes to the signing key that is active on the chain with higher witness participation. If chain 1 has higher participation then all is well and eventually producer b will fall below 50% and will replay and or resync. If chain 2 has higher participation then the control node will attempt to change the signing key every time the witness misses a block on chain 1 until chain 1 falls below 50% witness participation at that time both producer a and control node replay and or resync.
Most of the time this split will be okay. However if the signing key change happens within 3 blocks of the next block signing slot of the witness it is possible that the witness will double sign a block before it is is fixed. The worst case scenario I can determine would double sign two blocks. For this worst case scenario producer a =a producer b = b, control node = c, chain 1 =1 chain 2 = 2, producer a holds signing key A and producer b holds signing key B.
starting situation is all nodes spending some quality time on chain 1. A is the active signing key on chain 1. Sadly node a crashes. signing key A then misses a block and node c switches the signing key on chain 1 to B. b happily signs blocks on 1 until a replays. Unfortunately there has been a minor fork in the mean time. When a replays it ends up on chain 2. Chain 2 still has signing key A active. as soon as a comes back up c compares the signing key of a and b. Sadly however a decides to come back immediately before it is set to sign a block on chain 2. node a happily signs a block on chain 2. c changes the active signing key on chain 2, but sadly a had back to back blocks and therefore has signed two blocks on the wrong chain. It is further possible that a will continue to crash and replay and will end up on a new minority chain every time and sign two block before it can be caught by c and put back in its place.
The second possibility for concern I see is if a and c end up on a minority chain (1) while b ends up on a majority chain (2). The issue here is that every time that the witness misses a block on 1 c will change the signing key on 1. c will almost immediately catch that a and b no longer have the same signing key and will switch the signing key to B. There is however a risk if the witness has two blocks within 3 blocks of each other. Lets assume that the witness misses block 1000 on chain 1. c will switch the signing key on chain 1 on block 1001. c will then notice that there is a variance in signing keys. c will switch the signing key to B on block 1002. However with lag it is possible this will not take effect until block 1003. If a was designated to sign block 1001, 1002, or 1003 then the witness would have double signed a single block. This could conceivably happen until 1 falls below 50% and both a and c replay and or resync.
I haven't reasoned through what all of these variations would mean for the network, but it does seem that it would be extremely improbable for enough witnesses to run into either of these problems close enough together in time to cause two majority forks.
If you have made it this far I would like to apologize for the massive walls of text that you have waded through. If I could come up with a better way of explaining my reasoning I most certainly would. If you know of a better way of explaining please let me know. Also please let me know if my reasoning or assumptions seems suspect.