Author Topic: [python] failover script (Read 19630 times)

puppies

I've pushed a new update that will now check if your node is one of the last 5 slots in the shuffle order(to reduce the chances of getting back to back blocks) and will not broadcast the update witness command if it is. I am no longer sure this is really needed though.

I noticed some odd behavior in regards to the reported signing key. I had assumed that a node would return the signing key from the last valid block. During testing I killed the network (too many blocks missed). I restarted the node with all of my init witnesses (which has stale production turned on).

Now naturally my other running nodes would not accept these new blocks. They weren't linked and since no block production was taking place the head block age just kept getting higher.

The odd part is that even in that state of not accepting any blocks and no new blocks being created these nodes would still change their reported signing key immediately when given a local update_witness command.

I think this would make it impossible to double sign blocks even with my current script, and a slight modification to broadcast all changes to all nodes and never rely upon them being saved to the blockchain would completely remove the need to look at the shuffle order.

puppies

Quote from: kuro112 on October 31, 2015, 04:59:17 am

well made man, im a python fan myself and i really like your style. this caught my attention when i was setting up nodes for my own projects and im addicted.

am i using this right?

Thanks Kuro,

I wouldn't suggest running the failover script on any dpos chain at this point. I've learned a lot writing it, and had a lot of fun too.

kuro112

well made man, im a python fan myself and i really like your style. this caught my attention when i was setting up nodes for my own projects and im addicted.

am i using this right?

cube

Quote from: puppies on October 30, 2015, 03:24:10 pm

I think it would be cube, but you're more likely to drop blocks in the transition. It's also still a low risk not zero risk. Human error from manually switching is not a zero risk.

I think a better solution would be to only update the signing key if there is enough time to ensure it took, and kill the node if it didn't.

Havin a produce true false flag like 1.0 did would make this easier. I believe that still caused lots of problems in 1.0.

I think dropping a few blocks during a transition is fine. And if the controller node can shutdown the no-longer-signing node (ie the first node) once the switch command is sent, the risk of forked chain signing is essentially zero.

Is this the 'better solution' that you are proposing?

puppies · « *Last Edit: October 30, 2015, 03:30:39 pm by puppies* »

I think it would be cube, but you're more likely to drop blocks in the transition. It's also still a low risk not zero risk. Human error from manually switching is not a zero risk.

I think a better solution would be to only update the signing key if there is enough time to ensure it took, and kill the node if it didn't.

Havin a produce true false flag like 1.0 did would make this easier. I believe that still caused lots of problems in 1.0.

cube

Quote from: emski on October 30, 2015, 09:25:06 am

He already has an external control node. The issue is that both his witness nodes are simultaneously signing blocks (with different signing keys). The control node just updates the signing keys. What I propose is that the control node ensures that only one of the signers will sign at any moment.

I did not realise it has external node control. Cool!

puppies, is it possible to make this change?

emski

Quote from: cube on October 30, 2015, 09:19:53 am

Quote from: emski on October 30, 2015, 08:26:51 am

There is no concept of "low risk" when you are dealing with such system. It either works in all possible cases or it is not secure. Simultaneously signing two chains (even with different signing keys) is an issue. This breaks BM's definition of irreversible.

My recommendation is to have two synchronized nodes and the control node only allows one of them to sign blocks (if they are on a different fork your control node just picks which one should be active).

I think 'low risk' is 'too much risk' for bts network to take. This is especially we are dealing with people's money and the reputation of bitshares.

puppies, is it possible to refine the failover script to include an external 'control node' as recommended by emski?

He already has an external control node. The issue is that both his witness nodes are simultaneously signing blocks (with different signing keys). The control node just updates the signing keys. What I propose is that the control node ensures that only one of the signers will sign at any moment.

cube · « *Last Edit: October 30, 2015, 09:23:28 am by cube* »

Quote from: emski on October 30, 2015, 08:26:51 am

There is no concept of "low risk" when you are dealing with such system. It either works in all possible cases or it is not secure. Simultaneously signing two chains (even with different signing keys) is an issue. This breaks BM's definition of irreversible.

My recommendation is to have two synchronized nodes and the control node only allows one of them to sign blocks (if they are on a different fork your control node just picks which one should be active).

I think 'low risk' is 'too much risk' for bts network to take. This is especially we are dealing with people's money and the reputation of bitshares.

puppies, is it possible to refine the failover script to include an external 'control node' as recommended by emski?

Quote from: tonyk on October 30, 2015, 08:37:28 am

well, add to that that witness "holytransaction" is a guy no one knows anything about....other than that the account is fully controlled by another witness....

Are you saying there is a witness account controlled by another witness?

tonyk · « *Last Edit: October 30, 2015, 08:56:32 am by tonyk* »

well, add to that that witness "holytransaction" is a guy no one knows anything about....other than that the account is fully controlled by another witness....

and this is just a case that a guy with poor tech skills can identify... In reality, there might be only 4 witnesses in all.
3 of them running witnesses on 2 blockchains....

??

emski · « *Last Edit: October 30, 2015, 08:28:59 am by emski* »

I've read your post.
I state that allowing two nodes to sign blocks with the same witness account simultaneously should be banned.
@Bytemaster do you agree ?
@puppies your control node should make sure that only one of the witness nodes is signing blocks at any moment.

There is no concept of "low risk" when you are dealing with such system. It either works in all possible cases or it is not secure. Simultaneously signing two chains (even with different signing keys) is an issue. This breaks BM's definition of irreversible.

My recommendation is to have two synchronized nodes and the control node only allows one of them to sign blocks (if they are on a different fork your control node just picks which one should be active).

This is my opinion. Feel free to do whatever you consider "low enough risk".

puppies · « *Last Edit: October 30, 2015, 07:21:07 am by puppies* »

I think perhaps an easier way to explain the risk is this.

11 witnesses. 5 running failover scripts.

initial state of the network is nodes 1-6 and nodes 7a, 7b, 8a, 8b, 9a, 9b, 10a, 10b, 11a, and 11b all on chain A. On all failover nodes only node a is signing.

There is a fork. Nodes 1-3 and nodes 7a, 8a, 9a, 10a, and 11a all stay on chain A. With 8 of 11 nodes still signing the network is at 72%. Low but sustainable.

On chain B we now have nodes 4-6, and 7b, 8b, 9b, 10b, and 11b. As 7b, 8b, 9b, 10b, and 11b miss blocks the failover script switches the signing key to the key active on 7b, 8b, 9b, 10b, and 11b respectively. Since these nodes are now connected to chain B they can only change the signing key on chain B, leaving chain A unchanged.

The end result is:

Nodes 1-3, 7a, 8a, 9a, 10a, and 11a signing on chain A. 72% participation and transactions being "unreversable" after 67% of witnesses signing them.

Nodes 4-6, 7b, 8b, 9b, 10b, and 11b signing on chain B. 72% participation and transactions being "unreversable" after 67% of witnesses signing them.

This is an extremely unlikely worst case scenario even with a simple script that only looked at missed blocks. The results would be terrible though, and need to be prevented. It could lead to double spending. Even if it didn't lead to double spending it would be a giant pain in the ass to fix, and would cause permanent damage to our image.

This worst case scenario is not possible even with my current rough script, and this script could be improved upon to reduce the risk even further.

Ultimately while the result of us dropping below 67% and the chain coming to a halt is not as bad as two chains over 67% existing. It is far more likely to occur. The chain stopping would still be a giant pain in the ass to fix and would cause permanent damage to our image.

I think a properly designed failover script could mitigate both risks to acceptable levels.

I don't think I have reasoned out all possibilities by any means but in regards to my script in its current form I see a few ways that double sign issues could arise.

First of all would be a massive split of the internet. Lets assume that the majority of primary producing nodes are in the united states. Lets further assume that the united states gets entirely disconnected from the rest of the world. If the majority of control nodes and backup nodes are outside of the united states then when they switched over there would effectively be two networks. One within the united states and one outside of the united states. I think this is so unlikely that we don't really need to game plan for it. If anyone disagrees then let me know. I think there may even be a solution to this extreme possibility but I haven't spent a lot of time thinking about it.

The script as it is currently written consists of three different nodes. 2 producer nodes and a control node. Each producer node will restart the wintess node and cli wallet if it crashes or if witness participation falls below 50%, but the production nodes will not attempt to change signing keys. They will happily miss blocks as long as witness participation stays above 50%. The control node will not sign blocks, but will restart itself if it crashes or if witness participation falls below 50%. As blocks are missed it will round robin between nodes in a deterministic fashion (node depends upon total missed blocks reported to control node from a get_witness command) The possibilities I see are

One producer node forks off onto minority fork

both producer nodes fork off onto minority fork

All three nodes fork onto different forks.

I am saying minority fork, but ultimately we are really only concerned with only signing on one fork at a time. Therefore one producer node forking is exactly the same as the control node and one producer node forking. It is effectively two networks one with a single producer and the other with the control node and producer. As always if my reasoning seems incorrect please let me know.

As the only node that is capable of changing the signing key is the control node any fork that separates the control node from both production nodes is not a concern in regards to double signing. The control node will furiously update the witness from signing key to signing key until witness participation drops below 50%. It will then replay, and if unable to fix itself with replay will resync. All the while the witness node that is on the majority chain will either be signing blocks or missing blocks. It is of course possible that If all three nodes split that the control node will replay and end up on the same chain as one of the producer nodes. I am not sure that this will make a difference though.

This leads us to the interesting possibility. One producer node and the control node on a single chain. Thus the control node is capable of allowing the secondary producer node to sign while the primary is signing on a different chain. The way it should work if there is a fork is that producer a and control node will go off on chain 1 and producer b will go off on chain 2. If the witness misses a block on chain 1 then the control node will change the signing key on chain 1. The control node will then notice the signing key variance and change both producer nodes to the signing key that is active on the chain with higher witness participation. If chain 1 has higher participation then all is well and eventually producer b will fall below 50% and will replay and or resync. If chain 2 has higher participation then the control node will attempt to change the signing key every time the witness misses a block on chain 1 until chain 1 falls below 50% witness participation at that time both producer a and control node replay and or resync.

Most of the time this split will be okay. However if the signing key change happens within 3 blocks of the next block signing slot of the witness it is possible that the witness will double sign a block before it is is fixed. The worst case scenario I can determine would double sign two blocks. For this worst case scenario producer a =a producer b = b, control node = c, chain 1 =1 chain 2 = 2, producer a holds signing key A and producer b holds signing key B.

starting situation is all nodes spending some quality time on chain 1. A is the active signing key on chain 1. Sadly node a crashes. signing key A then misses a block and node c switches the signing key on chain 1 to B. b happily signs blocks on 1 until a replays. Unfortunately there has been a minor fork in the mean time. When a replays it ends up on chain 2. Chain 2 still has signing key A active. as soon as a comes back up c compares the signing key of a and b. Sadly however a decides to come back immediately before it is set to sign a block on chain 2. node a happily signs a block on chain 2. c changes the active signing key on chain 2, but sadly a had back to back blocks and therefore has signed two blocks on the wrong chain. It is further possible that a will continue to crash and replay and will end up on a new minority chain every time and sign two block before it can be caught by c and put back in its place.

The second possibility for concern I see is if a and c end up on a minority chain (1) while b ends up on a majority chain (2). The issue here is that every time that the witness misses a block on 1 c will change the signing key on 1. c will almost immediately catch that a and b no longer have the same signing key and will switch the signing key to B. There is however a risk if the witness has two blocks within 3 blocks of each other. Lets assume that the witness misses block 1000 on chain 1. c will switch the signing key on chain 1 on block 1001. c will then notice that there is a variance in signing keys. c will switch the signing key to B on block 1002. However with lag it is possible this will not take effect until block 1003. If a was designated to sign block 1001, 1002, or 1003 then the witness would have double signed a single block. This could conceivably happen until 1 falls below 50% and both a and c replay and or resync.

I haven't reasoned through what all of these variations would mean for the network, but it does seem that it would be extremely improbable for enough witnesses to run into either of these problems close enough together in time to cause two majority forks.

If you have made it this far I would like to apologize for the massive walls of text that you have waded through. If I could come up with a better way of explaining my reasoning I most certainly would. If you know of a better way of explaining please let me know. Also please let me know if my reasoning or assumptions seems suspect.

puppies

I'm aware of the risk emski.

I think it can be mitigated to acceptable levels. In fact I think a scripted solution could be far less likely to cause an issue than human error is.

emski

Let me see if I got it right:

1 You are running two witness instances for the same witness account but with different signing keys
2 You allow both nodes to sign blocks.
3 At some point in time you want to switch the signing key.

This can work only if the switch signing key transaction is confirmed in BOTH chains AND only one node signs blocks at any moment.

2/3 confirmation is not irreversible if there is an option for double signing.

See my example here (from this thread: https://bitsharestalk.org/index.php/topic,19360.0.html ):

Quote from: emski on October 22, 2015, 10:07:37 pm

No response ?

Imagine the following situation:

31 witnesses total.
Automated backup that works like this (from secondary node):
1 If the primary node is missing blocks publish change signing key transaction.
2 Checks the latest irreversible (by BM's definition 66% of witnesses signed (total of 21)) block and verifies that signing key is irreversibly changed.
3 Starts to sign blocks with the new key if it is irreversibly changed.

Lets we have witnesses with the above mentioned automated backup.
Lets we have a network split where witnesses are divided in two groups -> group A(21) / group B(10) .
In chain A (with 21 witnesses) we have 10 witnesses missing blocks.
In chain B (with 10 witnesses) we have 21 witnesses missing blocks.

In chain A we have 10 transactions for change of signing key (for all witnesses from group B). When these transactions are confirmed then backup nodes for group B start signing blocks.

Imagine now witnesses from A begin to lose connection to others nodes in A and connect to witnesses in B. Let this happen one witness at a time.
When first witness (X) "transfers" from A to B we will still have group A with more than 66% participation. Then X's backup node will activate (let it be connected to group A) changing signing key and starting to sign blocks => maintaining 100% participation in chain A. However the original X will continue signing blocks together with group B. If this is repeated 11 times (note that this can happen with up to 10 witnesses simultaneously) we'll have:
Fork A with >66% active witnesses; Fork B with >66% active witnesses.

Again I'm not saying this is likely to happen but it might be doable if witnesses are able to sign in two chains simultaneously.

puppies

Quote from: emski on October 29, 2015, 07:23:51 am

Quote from: puppies on October 28, 2015, 08:13:07 pm

Switch.py will now integrate with 2 remote. witness nodes. It will ensure that the signing keys for the specified witness match. If there is a fork and they do not match switch.py will copy the signing key from the node with higher witness participation to the node with lower witness participation. Documentation and comments are still pretty minimal. I will try to flesh those out when I get a chance.

Can you provide more info and/or example of this ?

Okay. So one thing I didn't mention is that this does require you to expose the websocket on your witnesses to outside traffic. You could restrict this to only accepting traffic from your control node if you were concerned with this security wise.

When the script launches it opens a wallet on the control node and connects this wallet to the websocket port of your producing witness. If there is no wallet file it creates one and imports your active private key for your witness. It then unlocks the wallet and saves it. If there is already a wallet it just unlocks it.

Every 3 seconds each node is queried with a get_witness witnessname and the signing keys of the two nodes are compared and printed. If there is a mismatch between the two production nodes it looks at witness_participation rate. If participation rate is the same it does nothing. If participation rate is higher on one node it launches a update_witness witnessname "" <signing key from node with higher participation> true on the node with lower participation.

Each node should be running a copy of watcher.py which will replay or resync in case of crash or witness participation below 50%

Hopefully I answered your question emski.

The code to open websocket portals to each production node is

Code: [Select]

def openProducer():
    print("opening " + wallet_name)
    attempt = 0
    result = None
    while result == None:
        if attempt < 4:

            try:
                print("waiting ...")
#                subprocess.call(["screen","-dmS",wallet_name,path_to_cli_wallet,"-H",local_port,"-s",remote_ws,"--chain-id","16362d305df19018476052eed629bb4052903c7655a586a0e0cfbdb0eaf1bfd8"]) ### uncomment this line if running on testnet
                subprocess.call(["screen","-dmS",wallet_name,path_to_cli_wallet,"-H",local_port,"-s",remote_ws,"]) ### comment this line out if running on testnet
                time.sleep(1)
                checkIfNew()
                unlockWallet()
                result = rpc.info()
            except:
                time.sleep(10)
                attempt += 1
                pass
        else:
            break

The portion of the main loop that checks the signing key is

Code: [Select]

        else:
            try:
                if compareSigningKeys() == False:
                    choice = comparePart()
                    setRemoteKey(choice)
            except:
                try:
                    part1 = producer1.info()
                    print(part1)
                except:
                    print("producer1 no workie")
                    producer1.closeProducer()
                    producer1.openProducer()
                try:
                    part2 = producer2.info()
                    print(part2)
                except:
                    producer2.closeProducer()
                    producer2.openProducer()

the functions related to this are

Code: [Select]

def compareSigningKeys():
    if producer1.getSigningKey() == producer2.getSigningKey():
        print("node1 signing key= "+producer1.getSigningKey()+"       node1 witness participation = " + str(producer1.info()))
        print("node2 signing key= "+producer2.getSigningKey()+"       node2 witness participation = " + str(producer2.info()))
        return True
    else:
        print("ERROR....ERROR....ERROR....ERROR....ERROR")
        print("signing keys are different.  You have been forked")
        return False

Code: [Select]

def comparePart():
    if producer1.info() == producer2.info():
        return 0
    elif producer1.info() > producer2.info():
        return 1
    elif producer2.info() > producer1.info():
        return 2

Code: [Select]

def setRemoteKey(num):
    if num == 0:
        return
    elif num == 1:
        signingKey = producer1.getSigningKey()
        producer2.setSigningKey(signingKey)
    elif num == 2:
        signingKey = producer2.getSigningKey()
        producer1.setSigningKey(signingKey)

Code: [Select]

def getSigningKey():
    witness = rpc.get_witness(witnessname)
    signingKey = witness["signing_key"]
    return signingKey

Code: [Select]

def setSigningKey(signingKey):
    rpc.update_witness(witnessname,"",signingKey,"true")

Code: [Select]

def info():
    info = rpc.info()
    part = info["participation"]
    part = float(part)
    return part

As always if you have any input I would love to hear it.

If we end up deciding that running any automated failover script is too risky, and this code is never used by anyone then I will be okay with that. I have learned a lot and had lots of fun writing it.

puppies

Good idea Xeroc, If you verified that your next witness slot was far enough in the future to be able to ensure that the update_witness went through and then if the node could not be switched killed it then you should have no extra liability from switching nodes. I think this might be a little bit of overkill.

Any time that an active witness node misses the block immediately preceding its block there will be a fork. I need to spend some time mapping out the possibilities and then testing the fork resolution.

Emski give me a few minutes and I will go into detail about how the script currently works.

Author Topic: [python] failover script (Read 19630 times)

puppies

Re: [python] failover script

puppies

Re: [python] failover script

kuro112

Re: [python] failover script

cube

Re: [python] failover script

puppies

Re: [python] failover script

cube

Re: [python] failover script

emski

Re: [python] failover script

cube

Re: [python] failover script

tonyk

Re: [python] failover script

emski

Re: [python] failover script

puppies

Re: [python] failover script

puppies

Re: [python] failover script

emski

Re: [python] failover script

puppies

Re: [python] failover script

puppies

Re: [python] failover script