Author Topic: [python] failover script (Read 25847 times)

xeroc

Quote from: pc on October 29, 2015, 05:37:04 pm

Conclusion: if you want to switch your active node you *must* turn off the current signing node first, and only after that publish the update_witness TX.

Can you not wait to active "_W" until you see the update_witness transaction being confirmed by 2/3 of the witnesses (so that it is IRREVERSIBLE) and then be sure that you are on the correct fork when activating "_W"?

pc

Switching witness nodes by updating the signing key will never be secure in the sense that you cannot completely prevent double-signing.

Let's call your current active witness node "W" and the node to which you want to switch "_W".

Suppose that the current head block is block #1, some other node X will sign the next block #2, and it's your turn to sign block #3.

Now you publish the update_witness transaction. Suppose that node X sees it and includes it in block #2.

Suppose that for some reason W does not receive block #2 in time. W will create block #3 and link to block #1.
Suppose further that _W does receive block #2 in time. For _W the switch is in effect, so _W will create block #3 and link to block #2.

You cannot prevent that by monitoring your nodes, because by the time you notice the problem it may already be too late.

Conclusion: if you want to switch your active node you *must* turn off the current signing node first, and only after that publish the update_witness TX.

emski

Quote from: puppies on October 28, 2015, 08:13:07 pm

Switch.py will now integrate with 2 remote. witness nodes. It will ensure that the signing keys for the specified witness match. If there is a fork and they do not match switch.py will copy the signing key from the node with higher witness participation to the node with lower witness participation. Documentation and comments are still pretty minimal. I will try to flesh those out when I get a chance.

Can you provide more info and/or example of this ?

puppies

Alrighty.

I have updated both switch.py and watcher.py.

I have improved the launching of watcher.py during initial launch, during replay and during resync. Rather than waiting a specified time and then loading the cli_wallet it will attempt to load the cli_wallet every 11 seconds until it is successful.

Switch.py will now integrate with 2 remote. witness nodes. It will ensure that the signing keys for the specified witness match. If there is a fork and they do not match switch.py will copy the signing key from the node with higher witness participation to the node with lower witness participation. Documentation and comments are still pretty minimal. I will try to flesh those out when I get a chance.

I have a testnet up and running which you can use to test if you would like. It is based off of a modified oct5-genesis.json which is available at https://github.com/gileadmcgee/dele-puppy/ It is running on the most recent bitshares tag v2.15.294. Seed node is 107.170.232.94:1776.

If you are going to run switch.py on the testnet then you will need to slightly modify producer1.py and producer2.py Specifically lines 34, 35 and 50, 51. Just comment out one line and uncomment the other. It should be pretty self explanitory. You will need to modify config-example.py lines 11,25,28,34, and 40 with your local settings and save it as config.py. There must be a wallet.json at the specified location called out on line 37 of config-example.py. That wallet must have the password called out on config-example line 11. You will need to have launched it with --chain-id set the password and saved the wallet. The producer sub scripts will generate a wallet if you do not have one, but I have not added that functionality to switch.py yet.

The config for producer1.py and producer2.py are built into the files themselves. Currently lines 7-21. These should not need to be modified to work on the testnet, but will of course need to be modified if run under other conditions. Also its possible that you will need an empty file named __init__.py in both the producer1 and producer2 directories.

Let me know what you think. I believe that if properly configured this script mitigates the risk of signing on two different forks at the same time. I would appreciate any feedback before I start using this on the live network though.

Thom

In the testnet I switched block production off on one server an on in another server with a single update_witness transaction.

However that was where both nodes were fully syncronized and on the same fork.

If block production is to be switched to another witness node the current block producer must be disabled or shut down to insure blocks aren't signed on multiple forks. That could be done with some type of multi-node comm, but that is problematic if the other nodes cannot be reached due to DoS or other reasons. It is better for the block producer to switch off block production through self contained means.

Is it possible to monitor missed blocks and if they cross a configured threshold to disable production from the API? I don't see an API command by which that could be done, tho it was possible in the 0.9.x client. It seems like it would be very simple to add such a cli command. However, if the node was missing blocks due to being on a fork, it probably isn't very useful to keep that node running at all, so might as well just kill the process.

The key to robust failover protection is monitoring all nodes participating and communicating in a witness' failover group. Monitoring the witness participation % is probably the best way to detect being on the majority fork. Once that falls below a certain threshold that node should just kill itself. The other nodes can monitor the missed block count and participation rate, and if the missed block count increases and the participation rate stays high it could issue an update_witness to switch keys and activate production on itself.

That scheme would work fine for 2 nodes but gets tricky for 3 or more, without causing multiple block producers. I think it would be very difficult if not impossible to create an algorithm which wouldn't have the potential of race conditions without some sort of communication between the participating failover nodes.

emski

(: I'll be shocked if it isn't better as mine is just a "restart if it crashes".

puppies

Quote from: emski on October 20, 2015, 07:21:09 am

Quote from: puppies on October 20, 2015, 01:40:35 am
Thanks emski.

I just submitted a pull request for a new script. This one is called watcher.py. I have not tested the witness participation portion nor the resync since I removed all the key switching stuff. If you use it, please let me know if something doesn't work.

I'll review it and if it looks better than my quick-n-dirty solution I'll use it.

I would be shocked if it is better than your solution. I'm a nubbin at the programming. I'm fully expecting someone who is a better programmer than I am to come up with a better solution than I have. I figure it's a good way to learn more about programming while the better developers are working on more important things. My feelings are not going to be hurt when my script is replaced.

emski

Quote from: puppies on October 20, 2015, 01:40:35 am

Thanks emski.

I just submitted a pull request for a new script. This one is called watcher.py. I have not tested the witness participation portion nor the resync since I removed all the key switching stuff. If you use it, please let me know if something doesn't work.

I'll review it and if it looks better than my quick-n-dirty solution I'll use it.

puppies

Thanks emski.

I just submitted a pull request for a new script. This one is called watcher.py. I have not tested the witness participation portion nor the resync since I removed all the key switching stuff. If you use it, please let me know if something doesn't work.

emski

Quote from: puppies on October 19, 2015, 08:33:51 pm

I will pull all the key switching stuff out and push a different script that just replays and resyncs in case of crash or low participation. Hopefully tonight or tomorrow, but I'm getting married on Wednesday so things are a little hectic.

Congrats!

puppies

I will pull all the key switching stuff out and push a different script that just replays and resyncs in case of crash or low participation. Hopefully tonight or tomorrow, but I'm getting married on Wednesday so things are a little hectic.

emski

Quote from: rnglab on October 19, 2015, 08:22:34 pm

In the meantime, running puppy's script on just one node, with a single signing key pair, seems to be a good strategy for witnesses to quickly recover from a fork that requires to reindex or resync the blockchain.

Please correct me if I'm missing something.

Yes running it on a single node is a good idea.

rnglab

In the meantime, running puppy's script on just one node, with a single signing key pair, seems to be a good strategy for witnesses to quickly recover from a fork that requires to reindex or resync the blockchain.

Please correct me if I'm missing something.

emski

Quote from: puppies on October 19, 2015, 06:18:40 pm

If we can ensure that all nodes return the same signing key with get_witness and every node has only 1 signing key this should mitigate the risk of signing on two forks. Should be pretty easy to do add well. What does everyone think?

In order to ensure this you need actual (direct) connectivity between your nodes.
In that case it is easier just to stop the primary and start the backup. You dont need to change signing keys.

puppies

If we can ensure that all nodes return the same signing key with get_witness and every node has only 1 signing key this should mitigate the risk of signing on two forks. Should be pretty easy to do add well. What does everyone think?

Author Topic: [python] failover script (Read 25847 times)

xeroc

Re: [python] failover script

pc

Re: [python] failover script

emski

Re: [python] failover script

puppies

Re: [python] failover script

Thom

Re: [python] failover script

emski

Re: [python] failover script

puppies

Re: [python] failover script

emski

Re: [python] failover script

puppies

Re: [python] failover script

emski

Re: [python] failover script

puppies

Re: [python] failover script

emski

Re: [python] failover script

rnglab

Re: [python] failover script

emski

Re: [python] failover script

puppies

Re: [python] failover script