BitShares Forum

Main => Technical Support => Topic started by: puppies on October 10, 2015, 06:47:28 pm

Title: [python] failover script
Post by: puppies on October 10, 2015, 06:47:28 pm: Hey everybody. I have updated and improved my failover script. I have tested it quite a bit myself, but I don't think I have hit all the edge cases. Further testing is appreciated.

I am just learning how to code, and am releasing this in the hopes that it is useful to somebody until someone comes out with a better version. I will be working on improving this script, and the quality of my programming. I would appreciate any feedback that helps me move towards those goals.

The script keeps an eye on your witnesses missed blocks. when you miss a block (or multiple blocks) it will switch your signing key for you. If after switching through every key twice no blocks have been produced it will switch to a set of emergency keys. If those keys still fail to produce blocks it will switch back. Even if those keys do produce blocks it will still switch back after 30 or so minutes.

The idea is that you can have multiple keys running on multiple servers, and if one goes down you can automatically switch over to a new one. If all of them go down you can switch over to a lower powered emergency device such as a home pc. The scripts behaviour is a little odd when using strictness over 1, it still should do a reasonable job, but for best results I would use strictness = 1

If none of this makes sense please let me know.

This script requires a config.py with the following parameters.
witnessname = <the name of your witness>
publickeys = <tuple of public keys as strings> i.e. ("GPH57pBVHtJzfsZZ117e5dBfaMTJxbfzfZQRFFMVuompRQAidAEwK", "GPH75xxKG4ZeztPpnhmFch99smunUWMvDy9mB6Le497vpAA3XUXaD") must have at least 2
strictness = <the number of blocks missed before a new public key is switched to> must be set to 1 or higher.
emergencykeys = <tuple of emergency public keys as strings> If no emergency nodes are used set emergency keys = 0. If keys are used, must have at least two entries. Can use same key twice if only running single emergency node

and here is the script
Code: [Select]
#!/usr/bin/env python # -*- coding: utf-8 -*- ### You must have a config.py with the following parameters. ### witnessname = <the name of your witness> ### publickeys = <tuple of public keys as strings> i.e. ("GPH57pBVHtJzfsZZ117e5dBfaMTJxbfzfZQRFFMVuompRQAidAEwK", "GPH75xxKG4ZeztPpnhmFch99smunUWMvDy9mB6Le497vpAA3XUXaD") must have at least 2 ### strictness = <the number of blocks missed before a new public key is switched to> must be set to 1 or higher. ### emergencykeys = <tuple of emergency public keys as strings> If no emergency nodes are used set emergency keys = 0. If keys are used, must have at least two entries. Can use same key twice if only running single emergency node ### If all public keys fail to produce blocks after two rotations, then emergencykeys will be used. ### If all emergency keys fail to produce blocks after two rotations, then attempt will be made to switch back to primary keys ### If emegergency keys produce blocks attempt will still be made to switch back to primary keys after 30ish minutes import sys import json from grapheneapi import GrapheneWebsocket, GrapheneWebsocketProtocol import time import config rpc = GrapheneWebsocket("localhost", 8092, "", "") ### returns total missed blocks from witnessname def getmissed(witnessname): witness = rpc.get_witness(witnessname) missed = witness["total_missed"] return missed ### work on cleaning up these preliminary variables missed = getmissed(config.witnessname) recentmissed = 0 witness = rpc.get_witness(config.witnessname) lastblock = witness["last_confirmed_block_num"] emergency = False ### switches to next public key after config.strictness missed blocks def switch(witnessname, publickeys, missed): keynumber = (missed//config.strictness) % len(publickeys) key = publickeys[keynumber] rpc.update_witness(witnessname, "", key, "true") print("updated signing key to " + key) ### break some of this out into separate functions. while True: witness = rpc.get_witness(config.witnessname) if lastblock < witness["last_confirmed_block_num"]: lastblock = witness["last_confirmed_block_num"] print(config.witnessname + " generated block num " + str(lastblock)) recentmissed = 0 elif config.emergencykeys != 0: if emergency == True: witness = rpc.get_witness(config.witnessname) if missed <= getmissed(config.witnessname) - config.strictness: missed = getmissed(config.witnessname) switch(config.witnessname, config.emergencykeys, missed) recentmissed +=1 lastblock = witness["last_confirmed_block_num"] print("EMERGENCY!!! total missed = " + str(missed) + " recent missed = " + str(recentmissed)) elif emergencyblock < block - 600: emergency = False switch(config.witnessname, config.publickeys, missed) recentmissed = 0 print("attempting to switch back to primary nodes") elif recentmissed == len(config.emergencykeys) * 2: emergency = False switch(config.witnessname, config.publickeys, missed) recentmissed = 0 print("attempting to switch back to primary nodes") else: time.sleep(3) info = rpc.info() block = info["head_block_num"] age = info["head_block_age"] participation = info["participation"] print(str(block) + " " + str(age) + " " + str(participation)) elif recentmissed > len(config.publickeys * config.strictness * 2): emergency = True missed = getmissed(config.witnessname) switch(config.witnessname, config.emergencykeys, missed) recentmissed = 0 lastblock = witness["last_confirmed_block_num"] print("all primary nodes down. switching to emergency nodes") emergencyblock = block elif missed <= getmissed(config.witnessname) - config.strictness: missed = getmissed(config.witnessname) switch(config.witnessname, config.publickeys, missed) recentmissed +=1 print(config.witnessname + " missed a block. total missed = " + str(missed) + " recent missed = " + str(recentmissed)) lastblock = witness["last_confirmed_block_num"] else: time.sleep(3) info = rpc.info() block = info["head_block_num"] age = info["head_block_age"] participation = info["participation"] print(str(block) + " " + str(age) + " " + str(participation)) elif missed <= getmissed(config.witnessname) - config.strictness: missed = getmissed(config.witnessname) switch(config.witnessname, config.publickeys, missed) recentmissed +=1 print(config.witnessname + " missed a block. total missed = " + str(missed) + " recent missed = " + str(recentmissed)) lastblock = witness["last_confirmed_block_num"] else: time.sleep(3) info = rpc.info() block = info["head_block_num"] age = info["head_block_age"] participation = info["participation"] print(str(block) + " " + str(age) + " " + str(participation))
Title: Re: [python] failover script
Post by: xeroc on October 10, 2015, 07:23:08 pm: Cool .. would you like to join the python development on this and have this script be a part of my repo? You can start by forking the repo and putting your script into the scripts subfolder.. then send a pull request!
Title: Re: [python] failover script
Post by: puppies on October 10, 2015, 07:27:16 pm: Quote from: xeroc on October 10, 2015, 07:23:08 pm
Cool .. would you like to join the python development on this and have this script be a part of my repo? You can start by forking the repo and putting your script into the scripts subfolder.. then send a pull request!

Most certainly. Thanks Xeroc
Title: Re: [python] failover script
Post by: cube on October 11, 2015, 01:36:37 am: Nice!

Can the script choose to switch when the participation rate is < 50%?
Title: Re: [python] failover script
Post by: puppies on October 11, 2015, 02:15:27 am: Quote from: cube on October 11, 2015, 01:36:37 am
Nice!

Can the script choose to switch when the participation rate is < 50%?

It doesn't really have to. I was thinking I would probably run 3 nodes with 3 different signing keys. I would also run two emergency backups with two more keys. One probably on my seed node, and another on a desktop at home. The key selection should be deterministic so you can run the failover script on multiple boxes and they should all be selecting the same key at the same time.

If production falls below 50 percent on one of the nodes then it is probably on a minority fork. If one of the witness nodes is on this fork it will be missing blocks on the main chain, and so production will be switched away from it. If one of the failover script nodes is on the fork, then it will attempt to switch away from any witnesses it sees missing blocks, but since its on a fork its transactions will not make it onto the main chain and the signing key will not be updated.

It would be possible to use a local cli_wallet and connect to the witness_nodes running on block producing nodes. In that case you would want it to switch to another node if block production fell below 50%. I like the added redundancy of running the failover script on multiple nodes more though.

Also @Xeroc I submitted a pull request, but I am a noob at github so I am not sure I did it right.
Title: Re: [python] failover script
Post by: cube on October 11, 2015, 04:07:35 am: I am thinking of running four nodes - one main node, one backup node and two emergency nodes.

The main node will switche to the backup node upon behind 3 blocks. If ever both main and backup nodes are <50%, it will switch to one of the two emergency node which has >50% participation. Is this possible?
Title: Re: [python] failover script
Post by: puppies on October 11, 2015, 04:28:54 am: Quote from: cube on October 11, 2015, 04:07:35 am
I am thinking of running four nodes - one main node, one backup node and two emergency nodes.

The main node will switche to the backup node upon behind 3 blocks. If ever both main and backup nodes are <50%, it will switch to one of the two emergency node which has >50% participation. Is this possible?

good question. @Xeroc, can graphene api communicate with 4 wallets at once?
Title: Re: [python] failover script
Post by: xeroc on October 11, 2015, 09:45:13 am: Quote from: puppies on October 11, 2015, 04:28:54 am
Quote from: cube on October 11, 2015, 04:07:35 am
I am thinking of running four nodes - one main node, one backup node and two emergency nodes.

The main node will switche to the backup node upon behind 3 blocks. If ever both main and backup nodes are <50%, it will switch to one of the two emergency node which has >50% participation. Is this possible?

good question. @Xeroc, can graphene api communicate with 4 wallets at once?
sure .. its yet another instance of an api connection ...

The python libs can do so too ..

@git pullrequest .. i am currently traveling to shanghai .. and wont be able to take a look probably for another 24h. sorry for the inconvenience
Title: Re: [python] failover script
Post by: puppies on October 15, 2015, 09:47:40 pm: Alright so script has been updated. Until the latest pull request is approved it can be grabbed from https://github.com/gileadmcgee/python-graphenlib/scripts/switch-keys

I figured I would type up some documentation about it. All of this will undoubtedly change. I will try to remember to come back here and delete or update this when it does but I can't make any promises.

The script is designed to:
close any screens named witness or wallet #we dont want extra screen open
pkill witness_node # two wallets running off of the same config.ini == bad
open a detached screen with a witness_node running inside it ## with --replay-blockchain flag
wait 3 minutes for that witness to be ready to accept rpc calls
open a detached screen with a cli_wallet running inside it
unlock the wallet
watch for missed blocks and if your witness misses a block switch the signing key of the witness to a different key
keep track of how many blocks have been missed recently and switch to an emergency signing key if all primary keys miss two full rotations.
if emergency keys miss blocks switch back to primary keys
if emergency keys sign blocks still try to switch back every 600 blocks (roughly 30 minutes)
If your node crashes or witness production falls below 50% kill the screens and start over.
if three launches of witness_node --replay-blockchain doesn't fix your problem try a --resync-blockchain

restarting below 50% production has never been tested.
--resync-blockchain after three crashes does not appear to be currently working. It will just keep --replaying

You must launch with a config.py in the same directory as the switch.py programs. This is the current example-config.py
Code: [Select]
### this is very experimental code, it has barely been tested and I don't really even know what I am doing. ### Use it with caution and at your own risk. It seriously might not work right. ### It will kill any running instance of witness_node and relaunch witness_node in a new screen named witness ### there is a delay between launching the witness_node and launching the cli_wallet. This is to give your witness node enough time to open up and get ready to accept the connection ### from your cli_wallet. It is currently set to 3 minutes for a --replay and 5 minutes for a --resync. You can modify these wait times on lines 27 and 82 of switch.py # the name of your witness as string witnessname = "dele-puppy" # the password of your wallet as string wallet_password = "puppiesRkewl" # not really my password. Just left in so people can see how it should look. # the public keys you would like to switch between must have at least two. Can list the same key twice if needed publickeys = ("BTS6v1yYVgrvrMV8XsThUT6f7YtyoSxYaec1qcthbA6sU9Xtps7fi","BTS73UhnE6uD8Axdp3cU8EmvjjaFuiAAPRwARqrgRY1vZkJLFYo4u","BTS5gH5wokGkbhcZZpxLEc884xNby3HAkiEo39bMXZ4b2AvNuSWni") # How many missed blocks to wait for until switching to new key. # very little testing has been done with any value other than 1 strictness = 1 # public keys you would like to use in case of emergency. Set to 0 if you do not want to use emergency keys. # if keys are used, must enter at least two. Can list the same key twice if needed. emergencykeys = 0 # the full path to your witness_node binary including binary name path_to_witness_node = "/home/user/src/bitshares-2/programs/witness_node/witness_node" # The full path to your data directory path_to_data_dir = "/home/user/src/bitshares-2/programs/witness_node/witness_node_data_dir" # rpc host and port rpc_port = "127.0.0.1:8092" # the full path too your cli_wallet binary including binary path_to_cli_wallet = "/home/user/src/bitshares-2/programs/cli_wallet/cli_wallet" # the full path to your wallet json including json file. path_to_wallet_json = "/home/user/src/bitshares-2/programs/cli_wallet/wallet.json"
I left my command in whenever possible so people could see what it should look like.

If you would like to run this on a single node for the crash and fork protection I would suggest launching your witness_node with only a single public private key pair and just using that public key twice in the config.py So for example if my witness signing key was BTS6v1yYVgrvrMV8XsThUT6f7YtyoSxYaec1qcthbA6sU9Xtps7fi my config.py could look like
Code: [Select]
publickeys = ("BTS6v1yYVgrvrMV8XsThUT6f7YtyoSxYaec1qcthbA6sU9Xtps7fi","BTS6v1yYVgrvrMV8XsThUT6f7YtyoSxYaec1qcthbA6sU9Xtps7fi")
The script will still attempt to switch between these keys when your witness misses a block, and there is a fee associated with that. I will find a way to turn different features on and off in the future.

Oh, and the witness node is not currently set to launch with any parameters outside of --replay-blockchain and the data directory. Everything else must be in your config.ini

Let me know if you have any questions, and I will try to answer them. If you have any feedback or advice I would appreciate it. I am just learning to script.
Title: Re: [python] failover script
Post by: puppies on October 19, 2015, 03:58:57 pm: After some discussion on the witness telegram chat I decided that my previous warnings might not have been verbose enough.

There is a risk when using any failover script that you can sign blocks on two different forks at the same time. If 66% of witnesses did this it would be a really really bad thing. Any witness that signs blocks on two chains at the same time should be fired. Any time you are running more than 1 witness_node with block production enabled you need to be extremely careful. It would be better for the network for you to miss blocks on both forks than for you to produce blocks on both forks.

If you don't understand what I am talking about then please don't run my script. I am planning on adding some remote rpc calls to ensure that all nodes are on the same chain, but that is not implemented yet.

If you wanted to run a single witness node and use my script for crash and fork protection then there would not be any risk of signing on two forks at the same time.

Ultimately all witnesses are responsible for ensuring that their nodes are not misbehaving. My script will remove some risks, but also increases the risk associated with running multiple nodes. Please consider the health of the network and only run my script if you know what you are doing and understand the risks. This script is currently in the experimental stage, and any use should be considered testing and debugging.
Title: Re: [python] failover script
Post by: xeroc on October 19, 2015, 04:13:14 pm: I agree with puppies concerns .. I would propose (for the short-term) .. to enable your backup machine only in the case your change-signing-key transactions has been confirmed by the network ..
Title: Re: [python] failover script
Post by: emski on October 19, 2015, 04:17:32 pm: Quote from: xeroc on October 19, 2015, 04:13:14 pm
I agree with puppies concerns .. I would propose (for the short-term) .. to enable your backup machine only in the case your change-signing-key transactions has been confirmed by the network ..
I'd advice against that.
I would propose that you enable your backup machine only in the case you confirmed your primary machine is not signing blocks (with any key).
Title: Re: [python] failover script
Post by: puppies on October 19, 2015, 06:18:40 pm: If we can ensure that all nodes return the same signing key with get_witness and every node has only 1 signing key this should mitigate the risk of signing on two forks. Should be pretty easy to do add well. What does everyone think?
Title: Re: [python] failover script
Post by: emski on October 19, 2015, 07:05:39 pm: Quote from: puppies on October 19, 2015, 06:18:40 pm
If we can ensure that all nodes return the same signing key with get_witness and every node has only 1 signing key this should mitigate the risk of signing on two forks. Should be pretty easy to do add well. What does everyone think?

In order to ensure this you need actual (direct) connectivity between your nodes.
In that case it is easier just to stop the primary and start the backup. You dont need to change signing keys.
Title: Re: [python] failover script
Post by: rnglab on October 19, 2015, 08:22:34 pm: In the meantime, running puppy's script on just one node, with a single signing key pair, seems to be a good strategy for witnesses to quickly recover from a fork that requires to reindex or resync the blockchain.

Please correct me if I'm missing something.
Title: Re: [python] failover script
Post by: emski on October 19, 2015, 08:24:03 pm: Quote from: rnglab on October 19, 2015, 08:22:34 pm
In the meantime, running puppy's script on just one node, with a single signing key pair, seems to be a good strategy for witnesses to quickly recover from a fork that requires to reindex or resync the blockchain.

Please correct me if I'm missing something.

Yes running it on a single node is a good idea.
Title: Re: [python] failover script
Post by: puppies on October 19, 2015, 08:33:51 pm: I will pull all the key switching stuff out and push a different script that just replays and resyncs in case of crash or low participation. Hopefully tonight or tomorrow, but I'm getting married on Wednesday so things are a little hectic.
Title: Re: [python] failover script
Post by: emski on October 19, 2015, 09:04:04 pm: Quote from: puppies on October 19, 2015, 08:33:51 pm
I will pull all the key switching stuff out and push a different script that just replays and resyncs in case of crash or low participation. Hopefully tonight or tomorrow, but I'm getting married on Wednesday so things are a little hectic.

Congrats!
Title: Re: [python] failover script
Post by: puppies on October 20, 2015, 01:40:35 am: Thanks emski.

I just submitted a pull request for a new script. This one is called watcher.py. I have not tested the witness participation portion nor the resync since I removed all the key switching stuff. If you use it, please let me know if something doesn't work.
Title: Re: [python] failover script
Post by: emski on October 20, 2015, 07:21:09 am: Quote from: puppies on October 20, 2015, 01:40:35 am
Thanks emski.

I just submitted a pull request for a new script. This one is called watcher.py. I have not tested the witness participation portion nor the resync since I removed all the key switching stuff. If you use it, please let me know if something doesn't work.

I'll review it and if it looks better than my quick-n-dirty solution I'll use it.
Title: Re: [python] failover script
Post by: puppies on October 20, 2015, 03:29:14 pm: Quote from: emski on October 20, 2015, 07:21:09 am
Quote from: puppies on October 20, 2015, 01:40:35 am
Thanks emski.

I just submitted a pull request for a new script. This one is called watcher.py. I have not tested the witness participation portion nor the resync since I removed all the key switching stuff. If you use it, please let me know if something doesn't work.

I'll review it and if it looks better than my quick-n-dirty solution I'll use it.
I would be shocked if it is better than your solution. I'm a nubbin at the programming. I'm fully expecting someone who is a better programmer than I am to come up with a better solution than I have. I figure it's a good way to learn more about programming while the better developers are working on more important things. My feelings are not going to be hurt when my script is replaced.
Title: Re: [python] failover script
Post by: emski on October 20, 2015, 04:27:03 pm: (: I'll be shocked if it isn't better as mine is just a "restart if it crashes".
Title: Re: [python] failover script
Post by: Thom on October 20, 2015, 11:12:50 pm: In the testnet I switched block production off on one server an on in another server with a single update_witness transaction.

However that was where both nodes were fully syncronized and on the same fork.

If block production is to be switched to another witness node the current block producer must be disabled or shut down to insure blocks aren't signed on multiple forks. That could be done with some type of multi-node comm, but that is problematic if the other nodes cannot be reached due to DoS or other reasons. It is better for the block producer to switch off block production through self contained means.

Is it possible to monitor missed blocks and if they cross a configured threshold to disable production from the API? I don't see an API command by which that could be done, tho it was possible in the 0.9.x client. It seems like it would be very simple to add such a cli command. However, if the node was missing blocks due to being on a fork, it probably isn't very useful to keep that node running at all, so might as well just kill the process.

The key to robust failover protection is monitoring all nodes participating and communicating in a witness' failover group. Monitoring the witness participation % is probably the best way to detect being on the majority fork. Once that falls below a certain threshold that node should just kill itself. The other nodes can monitor the missed block count and participation rate, and if the missed block count increases and the participation rate stays high it could issue an update_witness to switch keys and activate production on itself.

That scheme would work fine for 2 nodes but gets tricky for 3 or more, without causing multiple block producers. I think it would be very difficult if not impossible to create an algorithm which wouldn't have the potential of race conditions without some sort of communication between the participating failover nodes.
Title: Re: [python] failover script
Post by: puppies on October 28, 2015, 08:13:07 pm: Alrighty.

I have updated both switch.py and watcher.py.

I have improved the launching of watcher.py during initial launch, during replay and during resync. Rather than waiting a specified time and then loading the cli_wallet it will attempt to load the cli_wallet every 11 seconds until it is successful.

Switch.py will now integrate with 2 remote. witness nodes. It will ensure that the signing keys for the specified witness match. If there is a fork and they do not match switch.py will copy the signing key from the node with higher witness participation to the node with lower witness participation. Documentation and comments are still pretty minimal. I will try to flesh those out when I get a chance.

I have a testnet up and running which you can use to test if you would like. It is based off of a modified oct5-genesis.json which is available at https://github.com/gileadmcgee/dele-puppy/ (https://github.com/gileadmcgee/dele-puppy/) It is running on the most recent bitshares tag v2.15.294. Seed node is 107.170.232.94:1776.

If you are going to run switch.py on the testnet then you will need to slightly modify producer1.py and producer2.py Specifically lines 34, 35 and 50, 51. Just comment out one line and uncomment the other. It should be pretty self explanitory. You will need to modify config-example.py lines 11,25,28,34, and 40 with your local settings and save it as config.py. There must be a wallet.json at the specified location called out on line 37 of config-example.py. That wallet must have the password called out on config-example line 11. You will need to have launched it with --chain-id set the password and saved the wallet. The producer sub scripts will generate a wallet if you do not have one, but I have not added that functionality to switch.py yet.

The config for producer1.py and producer2.py are built into the files themselves. Currently lines 7-21. These should not need to be modified to work on the testnet, but will of course need to be modified if run under other conditions. Also its possible that you will need an empty file named __init__.py in both the producer1 and producer2 directories.

Let me know what you think. I believe that if properly configured this script mitigates the risk of signing on two different forks at the same time. I would appreciate any feedback before I start using this on the live network though.
Title: Re: [python] failover script
Post by: emski on October 29, 2015, 07:23:51 am: Quote from: puppies on October 28, 2015, 08:13:07 pm

Switch.py will now integrate with 2 remote. witness nodes. It will ensure that the signing keys for the specified witness match. If there is a fork and they do not match switch.py will copy the signing key from the node with higher witness participation to the node with lower witness participation. Documentation and comments are still pretty minimal. I will try to flesh those out when I get a chance.

Can you provide more info and/or example of this ?
Title: Re: [python] failover script
Post by: pc on October 29, 2015, 05:37:04 pm: Switching witness nodes by updating the signing key will never be secure in the sense that you cannot completely prevent double-signing.

Let's call your current active witness node "W" and the node to which you want to switch "_W".

Suppose that the current head block is block #1, some other node X will sign the next block #2, and it's your turn to sign block #3.

Now you publish the update_witness transaction. Suppose that node X sees it and includes it in block #2.

Suppose that for some reason W does not receive block #2 in time. W will create block #3 and link to block #1.
Suppose further that _W does receive block #2 in time. For _W the switch is in effect, so _W will create block #3 and link to block #2.

You cannot prevent that by monitoring your nodes, because by the time you notice the problem it may already be too late.

Conclusion: if you want to switch your active node you *must* turn off the current signing node first, and only after that publish the update_witness TX.
Title: Re: [python] failover script
Post by: xeroc on October 29, 2015, 06:30:39 pm: Quote from: pc on October 29, 2015, 05:37:04 pm
Conclusion: if you want to switch your active node you *must* turn off the current signing node first, and only after that publish the update_witness TX.
Can you not wait to active "_W" until you see the update_witness transaction being confirmed by 2/3 of the witnesses (so that it is IRREVERSIBLE) and then be sure that you are on the correct fork when activating "_W"?
Title: Re: [python] failover script
Post by: puppies on October 29, 2015, 07:00:29 pm: Good idea Xeroc, If you verified that your next witness slot was far enough in the future to be able to ensure that the update_witness went through and then if the node could not be switched killed it then you should have no extra liability from switching nodes. I think this might be a little bit of overkill.

Any time that an active witness node misses the block immediately preceding its block there will be a fork. I need to spend some time mapping out the possibilities and then testing the fork resolution.

Emski give me a few minutes and I will go into detail about how the script currently works.
Title: Re: [python] failover script
Post by: puppies on October 29, 2015, 09:14:20 pm: Quote from: emski on October 29, 2015, 07:23:51 am
Quote from: puppies on October 28, 2015, 08:13:07 pm

Switch.py will now integrate with 2 remote. witness nodes. It will ensure that the signing keys for the specified witness match. If there is a fork and they do not match switch.py will copy the signing key from the node with higher witness participation to the node with lower witness participation. Documentation and comments are still pretty minimal. I will try to flesh those out when I get a chance.

Can you provide more info and/or example of this ?

Okay. So one thing I didn't mention is that this does require you to expose the websocket on your witnesses to outside traffic. You could restrict this to only accepting traffic from your control node if you were concerned with this security wise.

When the script launches it opens a wallet on the control node and connects this wallet to the websocket port of your producing witness. If there is no wallet file it creates one and imports your active private key for your witness. It then unlocks the wallet and saves it. If there is already a wallet it just unlocks it.

Every 3 seconds each node is queried with a get_witness witnessname and the signing keys of the two nodes are compared and printed. If there is a mismatch between the two production nodes it looks at witness_participation rate. If participation rate is the same it does nothing. If participation rate is higher on one node it launches a update_witness witnessname "" <signing key from node with higher participation> true on the node with lower participation.

Each node should be running a copy of watcher.py which will replay or resync in case of crash or witness participation below 50%

Hopefully I answered your question emski.

The code to open websocket portals to each production node is
Code: [Select]
def openProducer(): print("opening " + wallet_name) attempt = 0 result = None while result == None: if attempt < 4: try: print("waiting ...") # subprocess.call(["screen","-dmS",wallet_name,path_to_cli_wallet,"-H",local_port,"-s",remote_ws,"--chain-id","16362d305df19018476052eed629bb4052903c7655a586a0e0cfbdb0eaf1bfd8"]) ### uncomment this line if running on testnet subprocess.call(["screen","-dmS",wallet_name,path_to_cli_wallet,"-H",local_port,"-s",remote_ws,"]) ### comment this line out if running on testnet time.sleep(1) checkIfNew() unlockWallet() result = rpc.info() except: time.sleep(10) attempt += 1 pass else: breakThe portion of the main loop that checks the signing key is
Code: [Select]
else: try: if compareSigningKeys() == False: choice = comparePart() setRemoteKey(choice) except: try: part1 = producer1.info() print(part1) except: print("producer1 no workie") producer1.closeProducer() producer1.openProducer() try: part2 = producer2.info() print(part2) except: producer2.closeProducer() producer2.openProducer()the functions related to this are
Code: [Select]
def compareSigningKeys(): if producer1.getSigningKey() == producer2.getSigningKey(): print("node1 signing key= "+producer1.getSigningKey()+" node1 witness participation = " + str(producer1.info())) print("node2 signing key= "+producer2.getSigningKey()+" node2 witness participation = " + str(producer2.info())) return True else: print("ERROR....ERROR....ERROR....ERROR....ERROR") print("signing keys are different. You have been forked") return False
Code: [Select]
def comparePart(): if producer1.info() == producer2.info(): return 0 elif producer1.info() > producer2.info(): return 1 elif producer2.info() > producer1.info(): return 2
Code: [Select]
def setRemoteKey(num): if num == 0: return elif num == 1: signingKey = producer1.getSigningKey() producer2.setSigningKey(signingKey) elif num == 2: signingKey = producer2.getSigningKey() producer1.setSigningKey(signingKey)
Code: [Select]
def getSigningKey(): witness = rpc.get_witness(witnessname) signingKey = witness["signing_key"] return signingKey
Code: [Select]
def setSigningKey(signingKey): rpc.update_witness(witnessname,"",signingKey,"true")
Code: [Select]
def info(): info = rpc.info() part = info["participation"] part = float(part) return part

As always if you have any input I would love to hear it.

If we end up deciding that running any automated failover script is too risky, and this code is never used by anyone then I will be okay with that. I have learned a lot and had lots of fun writing it.
Title: Re: [python] failover script
Post by: emski on October 29, 2015, 09:54:20 pm: Let me see if I got it right:

1 You are running two witness instances for the same witness account but with different signing keys
2 You allow both nodes to sign blocks.
3 At some point in time you want to switch the signing key.

This can work only if the switch signing key transaction is confirmed in BOTH chains AND only one node signs blocks at any moment.

2/3 confirmation is not irreversible if there is an option for double signing.

See my example here (from this thread: https://bitsharestalk.org/index.php/topic,19360.0.html ):
Quote from: emski on October 22, 2015, 10:07:37 pm
No response ?

Imagine the following situation:

31 witnesses total.
Automated backup that works like this (from secondary node):
1 If the primary node is missing blocks publish change signing key transaction.
2 Checks the latest irreversible (by BM's definition 66% of witnesses signed (total of 21)) block and verifies that signing key is irreversibly changed.
3 Starts to sign blocks with the new key if it is irreversibly changed.

Lets we have witnesses with the above mentioned automated backup.
Lets we have a network split where witnesses are divided in two groups -> group A(21) / group B(10) .
In chain A (with 21 witnesses) we have 10 witnesses missing blocks.
In chain B (with 10 witnesses) we have 21 witnesses missing blocks.

In chain A we have 10 transactions for change of signing key (for all witnesses from group B). When these transactions are confirmed then backup nodes for group B start signing blocks.

Imagine now witnesses from A begin to lose connection to others nodes in A and connect to witnesses in B. Let this happen one witness at a time.
When first witness (X) "transfers" from A to B we will still have group A with more than 66% participation. Then X's backup node will activate (let it be connected to group A) changing signing key and starting to sign blocks => maintaining 100% participation in chain A. However the original X will continue signing blocks together with group B. If this is repeated 11 times (note that this can happen with up to 10 witnesses simultaneously) we'll have:
Fork A with >66% active witnesses; Fork B with >66% active witnesses.

Again I'm not saying this is likely to happen but it might be doable if witnesses are able to sign in two chains simultaneously.
Title: Re: [python] failover script
Post by: puppies on October 29, 2015, 10:09:41 pm: I'm aware of the risk emski.

I think it can be mitigated to acceptable levels. In fact I think a scripted solution could be far less likely to cause an issue than human error is.
Title: Re: [python] failover script
Post by: puppies on October 29, 2015, 10:45:26 pm: I think perhaps an easier way to explain the risk is this.

11 witnesses. 5 running failover scripts.

initial state of the network is nodes 1-6 and nodes 7a, 7b, 8a, 8b, 9a, 9b, 10a, 10b, 11a, and 11b all on chain A. On all failover nodes only node a is signing.

There is a fork. Nodes 1-3 and nodes 7a, 8a, 9a, 10a, and 11a all stay on chain A. With 8 of 11 nodes still signing the network is at 72%. Low but sustainable.

On chain B we now have nodes 4-6, and 7b, 8b, 9b, 10b, and 11b. As 7b, 8b, 9b, 10b, and 11b miss blocks the failover script switches the signing key to the key active on 7b, 8b, 9b, 10b, and 11b respectively. Since these nodes are now connected to chain B they can only change the signing key on chain B, leaving chain A unchanged.

The end result is:

Nodes 1-3, 7a, 8a, 9a, 10a, and 11a signing on chain A. 72% participation and transactions being "unreversable" after 67% of witnesses signing them.

Nodes 4-6, 7b, 8b, 9b, 10b, and 11b signing on chain B. 72% participation and transactions being "unreversable" after 67% of witnesses signing them.

This is an extremely unlikely worst case scenario even with a simple script that only looked at missed blocks. The results would be terrible though, and need to be prevented. It could lead to double spending. Even if it didn't lead to double spending it would be a giant pain in the ass to fix, and would cause permanent damage to our image.

This worst case scenario is not possible even with my current rough script, and this script could be improved upon to reduce the risk even further.

Ultimately while the result of us dropping below 67% and the chain coming to a halt is not as bad as two chains over 67% existing. It is far more likely to occur. The chain stopping would still be a giant pain in the ass to fix and would cause permanent damage to our image.

I think a properly designed failover script could mitigate both risks to acceptable levels.

I don't think I have reasoned out all possibilities by any means but in regards to my script in its current form I see a few ways that double sign issues could arise.

First of all would be a massive split of the internet. Lets assume that the majority of primary producing nodes are in the united states. Lets further assume that the united states gets entirely disconnected from the rest of the world. If the majority of control nodes and backup nodes are outside of the united states then when they switched over there would effectively be two networks. One within the united states and one outside of the united states. I think this is so unlikely that we don't really need to game plan for it. If anyone disagrees then let me know. I think there may even be a solution to this extreme possibility but I haven't spent a lot of time thinking about it.

The script as it is currently written consists of three different nodes. 2 producer nodes and a control node. Each producer node will restart the wintess node and cli wallet if it crashes or if witness participation falls below 50%, but the production nodes will not attempt to change signing keys. They will happily miss blocks as long as witness participation stays above 50%. The control node will not sign blocks, but will restart itself if it crashes or if witness participation falls below 50%. As blocks are missed it will round robin between nodes in a deterministic fashion (node depends upon total missed blocks reported to control node from a get_witness command) The possibilities I see are

One producer node forks off onto minority fork

both producer nodes fork off onto minority fork

All three nodes fork onto different forks.

I am saying minority fork, but ultimately we are really only concerned with only signing on one fork at a time. Therefore one producer node forking is exactly the same as the control node and one producer node forking. It is effectively two networks one with a single producer and the other with the control node and producer. As always if my reasoning seems incorrect please let me know.

As the only node that is capable of changing the signing key is the control node any fork that separates the control node from both production nodes is not a concern in regards to double signing. The control node will furiously update the witness from signing key to signing key until witness participation drops below 50%. It will then replay, and if unable to fix itself with replay will resync. All the while the witness node that is on the majority chain will either be signing blocks or missing blocks. It is of course possible that If all three nodes split that the control node will replay and end up on the same chain as one of the producer nodes. I am not sure that this will make a difference though.

This leads us to the interesting possibility. One producer node and the control node on a single chain. Thus the control node is capable of allowing the secondary producer node to sign while the primary is signing on a different chain. The way it should work if there is a fork is that producer a and control node will go off on chain 1 and producer b will go off on chain 2. If the witness misses a block on chain 1 then the control node will change the signing key on chain 1. The control node will then notice the signing key variance and change both producer nodes to the signing key that is active on the chain with higher witness participation. If chain 1 has higher participation then all is well and eventually producer b will fall below 50% and will replay and or resync. If chain 2 has higher participation then the control node will attempt to change the signing key every time the witness misses a block on chain 1 until chain 1 falls below 50% witness participation at that time both producer a and control node replay and or resync.

Most of the time this split will be okay. However if the signing key change happens within 3 blocks of the next block signing slot of the witness it is possible that the witness will double sign a block before it is is fixed. The worst case scenario I can determine would double sign two blocks. For this worst case scenario producer a =a producer b = b, control node = c, chain 1 =1 chain 2 = 2, producer a holds signing key A and producer b holds signing key B.

starting situation is all nodes spending some quality time on chain 1. A is the active signing key on chain 1. Sadly node a crashes. signing key A then misses a block and node c switches the signing key on chain 1 to B. b happily signs blocks on 1 until a replays. Unfortunately there has been a minor fork in the mean time. When a replays it ends up on chain 2. Chain 2 still has signing key A active. as soon as a comes back up c compares the signing key of a and b. Sadly however a decides to come back immediately before it is set to sign a block on chain 2. node a happily signs a block on chain 2. c changes the active signing key on chain 2, but sadly a had back to back blocks and therefore has signed two blocks on the wrong chain. It is further possible that a will continue to crash and replay and will end up on a new minority chain every time and sign two block before it can be caught by c and put back in its place.

The second possibility for concern I see is if a and c end up on a minority chain (1) while b ends up on a majority chain (2). The issue here is that every time that the witness misses a block on 1 c will change the signing key on 1. c will almost immediately catch that a and b no longer have the same signing key and will switch the signing key to B. There is however a risk if the witness has two blocks within 3 blocks of each other. Lets assume that the witness misses block 1000 on chain 1. c will switch the signing key on chain 1 on block 1001. c will then notice that there is a variance in signing keys. c will switch the signing key to B on block 1002. However with lag it is possible this will not take effect until block 1003. If a was designated to sign block 1001, 1002, or 1003 then the witness would have double signed a single block. This could conceivably happen until 1 falls below 50% and both a and c replay and or resync.

I haven't reasoned through what all of these variations would mean for the network, but it does seem that it would be extremely improbable for enough witnesses to run into either of these problems close enough together in time to cause two majority forks.

If you have made it this far I would like to apologize for the massive walls of text that you have waded through. If I could come up with a better way of explaining my reasoning I most certainly would. If you know of a better way of explaining please let me know. Also please let me know if my reasoning or assumptions seems suspect.
Title: Re: [python] failover script
Post by: emski on October 30, 2015, 08:26:51 am: I've read your post.
I state that allowing two nodes to sign blocks with the same witness account simultaneously should be banned.
@Bytemaster do you agree ?
@puppies your control node should make sure that only one of the witness nodes is signing blocks at any moment.

There is no concept of "low risk" when you are dealing with such system. It either works in all possible cases or it is not secure. Simultaneously signing two chains (even with different signing keys) is an issue. This breaks BM's definition of irreversible.

My recommendation is to have two synchronized nodes and the control node only allows one of them to sign blocks (if they are on a different fork your control node just picks which one should be active).

This is my opinion. Feel free to do whatever you consider "low enough risk".
Title: Re: [python] failover script
Post by: tonyk on October 30, 2015, 08:37:28 am: well, add to that that witness "holytransaction" is a guy no one knows anything about....other than that the account is fully controlled by another witness....

and this is just a case that a guy with poor tech skills can identify... In reality, there might be only 4 witnesses in all.
3 of them running witnesses on 2 blockchains....?????
Title: Re: [python] failover script
Post by: cube on October 30, 2015, 09:19:53 am: Quote from: emski on October 30, 2015, 08:26:51 am

There is no concept of "low risk" when you are dealing with such system. It either works in all possible cases or it is not secure. Simultaneously signing two chains (even with different signing keys) is an issue. This breaks BM's definition of irreversible.

My recommendation is to have two synchronized nodes and the control node only allows one of them to sign blocks (if they are on a different fork your control node just picks which one should be active).

I think 'low risk' is 'too much risk' for bts network to take. This is especially we are dealing with people's money and the reputation of bitshares.

puppies, is it possible to refine the failover script to include an external 'control node' as recommended by emski?

Quote from: tonyk on October 30, 2015, 08:37:28 am
well, add to that that witness "holytransaction" is a guy no one knows anything about....other than that the account is fully controlled by another witness....

Are you saying there is a witness account controlled by another witness?
Title: Re: [python] failover script
Post by: emski on October 30, 2015, 09:25:06 am: Quote from: cube on October 30, 2015, 09:19:53 am
Quote from: emski on October 30, 2015, 08:26:51 am

There is no concept of "low risk" when you are dealing with such system. It either works in all possible cases or it is not secure. Simultaneously signing two chains (even with different signing keys) is an issue. This breaks BM's definition of irreversible.

My recommendation is to have two synchronized nodes and the control node only allows one of them to sign blocks (if they are on a different fork your control node just picks which one should be active).

I think 'low risk' is 'too much risk' for bts network to take. This is especially we are dealing with people's money and the reputation of bitshares.

puppies, is it possible to refine the failover script to include an external 'control node' as recommended by emski?

He already has an external control node. The issue is that both his witness nodes are simultaneously signing blocks (with different signing keys). The control node just updates the signing keys. What I propose is that the control node ensures that only one of the signers will sign at any moment.
Title: Re: [python] failover script
Post by: cube on October 30, 2015, 01:00:06 pm: Quote from: emski on October 30, 2015, 09:25:06 am
He already has an external control node. The issue is that both his witness nodes are simultaneously signing blocks (with different signing keys). The control node just updates the signing keys. What I propose is that the control node ensures that only one of the signers will sign at any moment.

I did not realise it has external node control. Cool!

puppies, is it possible to make this change?
Title: Re: [python] failover script
Post by: puppies on October 30, 2015, 03:24:10 pm: I think it would be cube, but you're more likely to drop blocks in the transition. It's also still a low risk not zero risk. Human error from manually switching is not a zero risk.

I think a better solution would be to only update the signing key if there is enough time to ensure it took, and kill the node if it didn't.

Havin a produce true false flag like 1.0 did would make this easier. I believe that still caused lots of problems in 1.0.
Title: Re: [python] failover script
Post by: cube on October 30, 2015, 03:50:12 pm: Quote from: puppies on October 30, 2015, 03:24:10 pm
I think it would be cube, but you're more likely to drop blocks in the transition. It's also still a low risk not zero risk. Human error from manually switching is not a zero risk.

I think a better solution would be to only update the signing key if there is enough time to ensure it took, and kill the node if it didn't.

Havin a produce true false flag like 1.0 did would make this easier. I believe that still caused lots of problems in 1.0.

I think dropping a few blocks during a transition is fine. And if the controller node can shutdown the no-longer-signing node (ie the first node) once the switch command is sent, the risk of forked chain signing is essentially zero.

Is this the 'better solution' that you are proposing?
Title: Re: [python] failover script
Post by: kuro112 on October 31, 2015, 04:59:17 am: well made man, im a python fan myself and i really like your style. this caught my attention when i was setting up nodes for my own projects and im addicted.

+5% am i using this right? :D
Title: Re: [python] failover script
Post by: puppies on October 31, 2015, 06:22:24 pm: Quote from: kuro112 on October 31, 2015, 04:59:17 am
well made man, im a python fan myself and i really like your style. this caught my attention when i was setting up nodes for my own projects and im addicted.

+5% am i using this right? :D

Thanks Kuro,

I wouldn't suggest running the failover script on any dpos chain at this point. I've learned a lot writing it, and had a lot of fun too.
Title: Re: [python] failover script
Post by: puppies on November 05, 2015, 07:06:46 pm: I've pushed a new update that will now check if your node is one of the last 5 slots in the shuffle order(to reduce the chances of getting back to back blocks) and will not broadcast the update witness command if it is. I am no longer sure this is really needed though.

I noticed some odd behavior in regards to the reported signing key. I had assumed that a node would return the signing key from the last valid block. During testing I killed the network (too many blocks missed). I restarted the node with all of my init witnesses (which has stale production turned on).

Now naturally my other running nodes would not accept these new blocks. They weren't linked and since no block production was taking place the head block age just kept getting higher.

The odd part is that even in that state of not accepting any blocks and no new blocks being created these nodes would still change their reported signing key immediately when given a local update_witness command.

I think this would make it impossible to double sign blocks even with my current script, and a slight modification to broadcast all changes to all nodes and never rely upon them being saved to the blockchain would completely remove the need to look at the shuffle order.