Author Topic: [python] failover script (Read 17845 times)

emski

Quote from: xeroc on October 19, 2015, 04:13:14 pm

I agree with puppies concerns .. I would propose (for the short-term) .. to enable your backup machine only in the case your change-signing-key transactions has been confirmed by the network ..

I'd advice against that.
I would propose that you enable your backup machine only in the case you confirmed your primary machine is not signing blocks (with any key).

xeroc

I agree with puppies concerns .. I would propose (for the short-term) .. to enable your backup machine only in the case your change-signing-key transactions has been confirmed by the network ..

puppies

After some discussion on the witness telegram chat I decided that my previous warnings might not have been verbose enough.

There is a risk when using any failover script that you can sign blocks on two different forks at the same time. If 66% of witnesses did this it would be a really really bad thing. Any witness that signs blocks on two chains at the same time should be fired. Any time you are running more than 1 witness_node with block production enabled you need to be extremely careful. It would be better for the network for you to miss blocks on both forks than for you to produce blocks on both forks.

If you don't understand what I am talking about then please don't run my script. I am planning on adding some remote rpc calls to ensure that all nodes are on the same chain, but that is not implemented yet.

If you wanted to run a single witness node and use my script for crash and fork protection then there would not be any risk of signing on two forks at the same time.

Ultimately all witnesses are responsible for ensuring that their nodes are not misbehaving. My script will remove some risks, but also increases the risk associated with running multiple nodes. Please consider the health of the network and only run my script if you know what you are doing and understand the risks. This script is currently in the experimental stage, and any use should be considered testing and debugging.

puppies

Alright so script has been updated. Until the latest pull request is approved it can be grabbed from https://github.com/gileadmcgee/python-graphenlib/scripts/switch-keys

I figured I would type up some documentation about it. All of this will undoubtedly change. I will try to remember to come back here and delete or update this when it does but I can't make any promises.

The script is designed to:
close any screens named witness or wallet #we dont want extra screen open
pkill witness_node # two wallets running off of the same config.ini == bad
open a detached screen with a witness_node running inside it ## with --replay-blockchain flag
wait 3 minutes for that witness to be ready to accept rpc calls
open a detached screen with a cli_wallet running inside it
unlock the wallet
watch for missed blocks and if your witness misses a block switch the signing key of the witness to a different key
keep track of how many blocks have been missed recently and switch to an emergency signing key if all primary keys miss two full rotations.
if emergency keys miss blocks switch back to primary keys
if emergency keys sign blocks still try to switch back every 600 blocks (roughly 30 minutes)
If your node crashes or witness production falls below 50% kill the screens and start over.
if three launches of witness_node --replay-blockchain doesn't fix your problem try a --resync-blockchain

restarting below 50% production has never been tested.
--resync-blockchain after three crashes does not appear to be currently working. It will just keep --replaying

You must launch with a config.py in the same directory as the switch.py programs. This is the current example-config.py

Code: [Select]

### this is very experimental code, it has barely been tested and I don't really even know what I am doing.
### Use it with caution and at your own risk.  It seriously might not work right.
### It will kill any running instance of witness_node and relaunch witness_node in a new screen named witness
### there is a delay between launching the witness_node and launching the cli_wallet.  This is to give your witness node enough time to open up and get ready to accept the connection
### from your cli_wallet.  It is currently set to 3 minutes for a --replay and 5 minutes for a --resync.  You can modify these wait times on lines 27 and 82 of switch.py

# the name of your witness as string
witnessname = "dele-puppy"

# the password of your wallet as string
wallet_password = "puppiesRkewl" # not really my password.  Just left in so people can see how it should look.

# the public keys you would like to switch between must have at least two.  Can list the same key twice if needed
publickeys = ("BTS6v1yYVgrvrMV8XsThUT6f7YtyoSxYaec1qcthbA6sU9Xtps7fi","BTS73UhnE6uD8Axdp3cU8EmvjjaFuiAAPRwARqrgRY1vZkJLFYo4u","BTS5gH5wokGkbhcZZpxLEc884xNby3HAkiEo39bMXZ4b2AvNuSWni")

# How many missed blocks to wait for until switching to new key.
# very little testing has been done with any value other than 1
strictness = 1

# public keys you would like to use in case of emergency.  Set to 0 if you do not want to use emergency keys.
# if keys are used, must enter at least two.  Can list the same key twice if needed.
emergencykeys = 0

# the full path to your witness_node binary including binary name
path_to_witness_node = "/home/user/src/bitshares-2/programs/witness_node/witness_node"

# The full path to your data directory
path_to_data_dir = "/home/user/src/bitshares-2/programs/witness_node/witness_node_data_dir"

# rpc host and port
rpc_port = "127.0.0.1:8092"

# the full path too your cli_wallet binary including binary
path_to_cli_wallet = "/home/user/src/bitshares-2/programs/cli_wallet/cli_wallet"

# the full path to your wallet json including json file.
path_to_wallet_json = "/home/user/src/bitshares-2/programs/cli_wallet/wallet.json"

I left my command in whenever possible so people could see what it should look like.

If you would like to run this on a single node for the crash and fork protection I would suggest launching your witness_node with only a single public private key pair and just using that public key twice in the config.py So for example if my witness signing key was BTS6v1yYVgrvrMV8XsThUT6f7YtyoSxYaec1qcthbA6sU9Xtps7fi my config.py could look like

Code: [Select]

publickeys = ("BTS6v1yYVgrvrMV8XsThUT6f7YtyoSxYaec1qcthbA6sU9Xtps7fi","BTS6v1yYVgrvrMV8XsThUT6f7YtyoSxYaec1qcthbA6sU9Xtps7fi")

The script will still attempt to switch between these keys when your witness misses a block, and there is a fee associated with that. I will find a way to turn different features on and off in the future.

Oh, and the witness node is not currently set to launch with any parameters outside of --replay-blockchain and the data directory. Everything else must be in your config.ini

Let me know if you have any questions, and I will try to answer them. If you have any feedback or advice I would appreciate it. I am just learning to script.

xeroc

Quote from: puppies on October 11, 2015, 04:28:54 am

Quote from: cube on October 11, 2015, 04:07:35 am
I am thinking of running four nodes - one main node, one backup node and two emergency nodes.

The main node will switche to the backup node upon behind 3 blocks. If ever both main and backup nodes are <50%, it will switch to one of the two emergency node which has >50% participation. Is this possible?

good question. @Xeroc, can graphene api communicate with 4 wallets at once?

sure .. its yet another instance of an api connection ...

The python libs can do so too ..

@git pullrequest .. i am currently traveling to shanghai .. and wont be able to take a look probably for another 24h. sorry for the inconvenience

puppies

Quote from: cube on October 11, 2015, 04:07:35 am

I am thinking of running four nodes - one main node, one backup node and two emergency nodes.

The main node will switche to the backup node upon behind 3 blocks. If ever both main and backup nodes are <50%, it will switch to one of the two emergency node which has >50% participation. Is this possible?

good question. @Xeroc, can graphene api communicate with 4 wallets at once?

cube

I am thinking of running four nodes - one main node, one backup node and two emergency nodes.

The main node will switche to the backup node upon behind 3 blocks. If ever both main and backup nodes are <50%, it will switch to one of the two emergency node which has >50% participation. Is this possible?

puppies

Quote from: cube on October 11, 2015, 01:36:37 am

Nice!

Can the script choose to switch when the participation rate is < 50%?

It doesn't really have to. I was thinking I would probably run 3 nodes with 3 different signing keys. I would also run two emergency backups with two more keys. One probably on my seed node, and another on a desktop at home. The key selection should be deterministic so you can run the failover script on multiple boxes and they should all be selecting the same key at the same time.

If production falls below 50 percent on one of the nodes then it is probably on a minority fork. If one of the witness nodes is on this fork it will be missing blocks on the main chain, and so production will be switched away from it. If one of the failover script nodes is on the fork, then it will attempt to switch away from any witnesses it sees missing blocks, but since its on a fork its transactions will not make it onto the main chain and the signing key will not be updated.

It would be possible to use a local cli_wallet and connect to the witness_nodes running on block producing nodes. In that case you would want it to switch to another node if block production fell below 50%. I like the added redundancy of running the failover script on multiple nodes more though.

Also @Xeroc I submitted a pull request, but I am a noob at github so I am not sure I did it right.

cube

Nice!

Can the script choose to switch when the participation rate is < 50%?

puppies

Quote from: xeroc on October 10, 2015, 07:23:08 pm

Cool .. would you like to join the python development on this and have this script be a part of my repo? You can start by forking the repo and putting your script into the scripts subfolder.. then send a pull request!

Most certainly. Thanks Xeroc

xeroc

Cool .. would you like to join the python development on this and have this script be a part of my repo? You can start by forking the repo and putting your script into the scripts subfolder.. then send a pull request!

puppies · « *Last Edit: October 10, 2015, 07:04:45 pm by puppies* »

Hey everybody. I have updated and improved my failover script. I have tested it quite a bit myself, but I don't think I have hit all the edge cases. Further testing is appreciated.

I am just learning how to code, and am releasing this in the hopes that it is useful to somebody until someone comes out with a better version. I will be working on improving this script, and the quality of my programming. I would appreciate any feedback that helps me move towards those goals.

The script keeps an eye on your witnesses missed blocks. when you miss a block (or multiple blocks) it will switch your signing key for you. If after switching through every key twice no blocks have been produced it will switch to a set of emergency keys. If those keys still fail to produce blocks it will switch back. Even if those keys do produce blocks it will still switch back after 30 or so minutes.

The idea is that you can have multiple keys running on multiple servers, and if one goes down you can automatically switch over to a new one. If all of them go down you can switch over to a lower powered emergency device such as a home pc. The scripts behaviour is a little odd when using strictness over 1, it still should do a reasonable job, but for best results I would use strictness = 1

If none of this makes sense please let me know.

This script requires a config.py with the following parameters.
witnessname = <the name of your witness>
publickeys = <tuple of public keys as strings> i.e. ("GPH57pBVHtJzfsZZ117e5dBfaMTJxbfzfZQRFFMVuompRQAidAEwK", "GPH75xxKG4ZeztPpnhmFch99smunUWMvDy9mB6Le497vpAA3XUXaD") must have at least 2
strictness = <the number of blocks missed before a new public key is switched to> must be set to 1 or higher.
emergencykeys = <tuple of emergency public keys as strings> If no emergency nodes are used set emergency keys = 0. If keys are used, must have at least two entries. Can use same key twice if only running single emergency node

and here is the script

Code: [Select]

#!/usr/bin/env python
# -*- coding: utf-8 -*-

### You must have a config.py with the following parameters.
### witnessname = <the name of your witness>
### publickeys = <tuple of public keys as strings> i.e. ("GPH57pBVHtJzfsZZ117e5dBfaMTJxbfzfZQRFFMVuompRQAidAEwK", "GPH75xxKG4ZeztPpnhmFch99smunUWMvDy9mB6Le497vpAA3XUXaD") must have at least 2
### strictness = <the number of blocks missed before a new public key is switched to> must be set to 1 or higher.
### emergencykeys = <tuple of emergency public keys as strings>  If no emergency nodes are used set emergency keys = 0.  If keys are used, must have at least two entries.  Can use same key twice if only running single emergency node
### If all public keys fail to produce blocks after two rotations, then emergencykeys will be used.
### If all emergency keys fail to produce blocks after two rotations, then attempt will be made to switch back to primary keys
### If emegergency keys produce blocks attempt will still be made to switch back to primary keys after 30ish minutes


import sys
import json
from grapheneapi import GrapheneWebsocket, GrapheneWebsocketProtocol
import time
import config

rpc = GrapheneWebsocket("localhost", 8092, "", "")

### returns total missed blocks from witnessname
def getmissed(witnessname):
    witness = rpc.get_witness(witnessname)
    missed = witness["total_missed"]
    return missed

### work on cleaning up these preliminary variables
missed = getmissed(config.witnessname)
recentmissed = 0
witness = rpc.get_witness(config.witnessname)
lastblock = witness["last_confirmed_block_num"]
emergency = False

### switches to next public key after config.strictness missed blocks
def switch(witnessname, publickeys, missed):
    keynumber = (missed//config.strictness) % len(publickeys)
    key = publickeys[keynumber]
    rpc.update_witness(witnessname, "", key, "true")
    print("updated signing key to " + key)

### break some of this out into separate functions.
while True:
    witness = rpc.get_witness(config.witnessname)
    if lastblock < witness["last_confirmed_block_num"]:
        lastblock = witness["last_confirmed_block_num"]
        print(config.witnessname + " generated block num " + str(lastblock))
        recentmissed = 0
    elif config.emergencykeys != 0:
        if emergency == True:
            witness = rpc.get_witness(config.witnessname)
            if missed <= getmissed(config.witnessname) - config.strictness:
                missed = getmissed(config.witnessname)
                switch(config.witnessname, config.emergencykeys, missed)
                recentmissed +=1
                lastblock = witness["last_confirmed_block_num"]
                print("EMERGENCY!!! total missed = " + str(missed) + " recent missed = " + str(recentmissed))
            elif emergencyblock < block - 600:
                emergency = False
                switch(config.witnessname, config.publickeys, missed)
                recentmissed = 0
                print("attempting to switch back to primary nodes")
            elif recentmissed == len(config.emergencykeys) * 2:
                emergency = False
                switch(config.witnessname, config.publickeys, missed)
                recentmissed = 0
                print("attempting to switch back to primary nodes")
            else:
                time.sleep(3)
                info = rpc.info()
                block = info["head_block_num"]
                age = info["head_block_age"]
                participation = info["participation"]
                print(str(block) + "     " + str(age) + "     " + str(participation))
        elif recentmissed > len(config.publickeys * config.strictness * 2):
            emergency = True
            missed = getmissed(config.witnessname)
            switch(config.witnessname, config.emergencykeys, missed)
            recentmissed = 0
            lastblock = witness["last_confirmed_block_num"]
            print("all primary nodes down. switching to emergency nodes")
            emergencyblock = block
        elif missed <= getmissed(config.witnessname) - config.strictness:
            missed = getmissed(config.witnessname)
            switch(config.witnessname, config.publickeys, missed)
            recentmissed +=1
            print(config.witnessname + " missed a block.  total missed = " + str(missed) + " recent missed = " + str(recentmissed))
            lastblock = witness["last_confirmed_block_num"]
        else:
            time.sleep(3)
            info = rpc.info()
            block = info["head_block_num"]
            age = info["head_block_age"]
            participation = info["participation"]
            print(str(block) + "     " + str(age) + "     " + str(participation))

    elif missed <= getmissed(config.witnessname) - config.strictness:
        missed = getmissed(config.witnessname)
        switch(config.witnessname, config.publickeys, missed)
        recentmissed +=1
        print(config.witnessname + " missed a block.  total missed = " + str(missed) + " recent missed = " + str(recentmissed))
        lastblock = witness["last_confirmed_block_num"]
    else:
        time.sleep(3)
        info = rpc.info()
        block = info["head_block_num"]
        age = info["head_block_age"]
        participation = info["participation"]
        print(str(block) + "     " + str(age) + "     " + str(participation))

Author Topic: [python] failover script (Read 17845 times)

emski

Re: [python] failover script

xeroc

Re: [python] failover script

puppies

Re: [python] failover script

puppies

Re: [python] failover script

xeroc

Re: [python] failover script

puppies

Re: [python] failover script

cube

Re: [python] failover script

puppies

Re: [python] failover script

cube

Re: [python] failover script

puppies

Re: [python] failover script

xeroc

Re: [python] failover script

puppies

[python] failover script