Author Topic: Witness Monitoring Script based on websocket connection only (Python Bitshares) (Read 19212 times)

abit

Quote from: GChicken on April 25, 2017, 01:45:50 pm

Quote from: Thom on April 25, 2017, 03:23:56 am
Quote from: roelandp on April 24, 2017, 08:38:52 pm
1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.

Several witnesses attempted to code an automatic failover algorithm but I don't believe any were successful without introducing new problems.

One important thing to consider is you absolutely do NOT want 2 nodes producing blocks for the same witness, as that is sure to cause havoc and fork the network.

Whenever I switch production using the "update_witness" API call I manually make sure both the old witness node and the new witness node are both listening and in sync before I execute the call. I usually submit the call on the old witness going out of production, not the new node going into production. I can then use the get_witness API call to verify the signing key for the new node in in effect before I shut down the old witness node.

The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from that node) but the watchdog node falsely concludes it is dead, it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node. As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.

When the block producer fails it may not be possible to determine for certain why or get confirmation it will not resume block production. You will need to determine if the OS for the failing node is responding but not the app, in which case failover may be possible if you build in some type of communication to restart the witness_node app or restart the entire OS. The issue is what if you can't communicate with the failing node? Is it dead or just temporarily cut off? Will it fork the network if it should come back online?

I was hoping wackou & I could have implemented the backbone architecture and a failover protocol along with it, but there wasn't enough funding and wackou's time was very scarce (and still is actually). If this ecosystem is going to survive a frontal attack the witness nodes need to be protected from direct public access. Seed nodes and API servers should be the route available for public access, leaving witnesses alone to process and generate blocks quickly with minimum latency.

Looking at the stats i think @abit has a script that detects failing witness and issues a transaction to the network to update his signing key; this would allow him to run two witnesses both with different signing keys and auto switch based on any issues. - this is only speculation; i have no idea really. But i all his time of being a witness he has only missed 133 blocks, and you can see updates of signing key on his account.

I'm not using a script for BitShares witness, but switch keys manually.

I keep 3+ nodes online. With the help of @spartako's telegram bot, I got notifications in time, then try to fix/switch asap.

Another reason of low block missing rate is good server/VPS hosting provider (so far, AWS) and perhaps a bit lucky.

I AM using a script for Steem Witness though.

Thom

Quote from: GChicken on April 25, 2017, 01:45:50 pm

Quote from: Thom on April 25, 2017, 03:23:56 am
Quote from: roelandp on April 24, 2017, 08:38:52 pm
1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.

Several witnesses attempted to code an automatic failover algorithm but I don't believe any were successful without introducing new problems.

One important thing to consider is you absolutely do NOT want 2 nodes producing blocks for the same witness, as that is sure to cause havoc and fork the network.

Whenever I switch production using the "update_witness" API call I manually make sure both the old witness node and the new witness node are both listening and in sync before I execute the call. I usually submit the call on the old witness going out of production, not the new node going into production. I can then use the get_witness API call to verify the signing key for the new node in in effect before I shut down the old witness node.

The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from that node) but the watchdog node falsely concludes it is dead, it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node. As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.

When the block producer fails it may not be possible to determine for certain why or get confirmation it will not resume block production. You will need to determine if the OS for the failing node is responding but not the app, in which case failover may be possible if you build in some type of communication to restart the witness_node app or restart the entire OS. The issue is what if you can't communicate with the failing node? Is it dead or just temporarily cut off? Will it fork the network if it should come back online?

I was hoping wackou & I could have implemented the backbone architecture and a failover protocol along with it, but there wasn't enough funding and wackou's time was very scarce (and still is actually). If this ecosystem is going to survive a frontal attack the witness nodes need to be protected from direct public access. Seed nodes and API servers should be the route available for public access, leaving witnesses alone to process and generate blocks quickly with minimum latency.

Looking at the stats i think @abit has a script that detects failing witness and issues a transaction to the network to update his signing key; this would allow him to run two witnesses both with different signing keys and auto switch based on any issues. - this is only speculation; i have no idea really. But i all his time of being a witness he has only missed 133 blocks, and you can see updates of signing key on his account.

That's a very good point @GChicken, I have often wondered how he has been able to achieve such low missed block numbers.

@roelandp you're correct in your understanding of how update_witness functions. However in the scenario I tried to describe, wherein an active witness has a network infrastructure failure (not an app failure or host failure such as out of diskspace or memory) and due to that doesn't see the transaction transmitted by the monitor to switch signing keys, if the network is restored and the witness is reconnected to the network, it will continue to sign blocks for that witness but with an incorrect signing key, thus creating the real possibility of forking the network.

I know that @puppies spent some time working on an automatic failover algo and people found holes in it and I don't think his approach caught on due to the shortcomings raised. I am all for improving the robustness of our network, and hope a solid algo can be developed to automatically switch in redundant nodes and disable failed nodes. The testnet is a perfect context to work out such an algorithm and observe the affects. The exact case of a witness missing an update_witness transaction can be tested without risking a fork in production.

Pheonike

Great work.

GChicken

Great work Roeland! thanks for sharing

GChicken

Quote from: Thom on April 25, 2017, 03:23:56 am

Quote from: roelandp on April 24, 2017, 08:38:52 pm
1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.

Several witnesses attempted to code an automatic failover algorithm but I don't believe any were successful without introducing new problems.

One important thing to consider is you absolutely do NOT want 2 nodes producing blocks for the same witness, as that is sure to cause havoc and fork the network.

Whenever I switch production using the "update_witness" API call I manually make sure both the old witness node and the new witness node are both listening and in sync before I execute the call. I usually submit the call on the old witness going out of production, not the new node going into production. I can then use the get_witness API call to verify the signing key for the new node in in effect before I shut down the old witness node.

The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from that node) but the watchdog node falsely concludes it is dead, it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node. As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.

When the block producer fails it may not be possible to determine for certain why or get confirmation it will not resume block production. You will need to determine if the OS for the failing node is responding but not the app, in which case failover may be possible if you build in some type of communication to restart the witness_node app or restart the entire OS. The issue is what if you can't communicate with the failing node? Is it dead or just temporarily cut off? Will it fork the network if it should come back online?

I was hoping wackou & I could have implemented the backbone architecture and a failover protocol along with it, but there wasn't enough funding and wackou's time was very scarce (and still is actually). If this ecosystem is going to survive a frontal attack the witness nodes need to be protected from direct public access. Seed nodes and API servers should be the route available for public access, leaving witnesses alone to process and generate blocks quickly with minimum latency.

Looking at the stats i think @abit has a script that detects failing witness and issues a transaction to the network to update his signing key; this would allow him to run two witnesses both with different signing keys and auto switch based on any issues. - this is only speculation; i have no idea really. But i all his time of being a witness he has only missed 133 blocks, and you can see updates of signing key on his account.

lafona

Nice! I will definitely be using this to monitor my seed node and other witness related activities.

roelandp

Quote from: abit on April 25, 2017, 08:18:34 am

Hope someone will setup a web site to show the info. Statistics, charts, etc.

I think @lafona has some stuff in the making re witness overview. This is more for personal use / monitoring of your own witness for availability. However the scripts could easily be converted into logging the data in a db and outputting as stats table like SteemDb.com/witnesses

roelandp

Quote from: Thom on April 25, 2017, 03:23:56 am

The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from that node) but the watchdog node falsely concludes it is dead, it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node. As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.

@Thom thanks for your feedback. I was under the impression that the way it works with signing keys is that if you have your witness producer name setup in config.ini but not supply the correct privkey corresponding to the current listed 'public signing key' that the witness node is not producing blocks?

A failsafe backup scenario (imho) would be: The main server runs under pubkey X with privkey XXX in the config.ini and should it fail the independent monitoring server calls the 'update_witness' command to start siging with pubKey Y. The backup server runs as a hot witness with privkey YYY in the config.ini and will receive messages like: 'Not producing block 12394871234 because I don't have the private key for pubKey X', right?

Only thing is to setup an 'update_witness' with pybitshares @xeroc? Let's see if I can write it

abit

Not bad.

Hope someone will setup a web site to show the info. Statistics, charts, etc.

Thom

Quote from: roelandp on April 24, 2017, 08:38:52 pm

1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.

Several witnesses attempted to code an automatic failover algorithm but I don't believe any were successful without introducing new problems.

One important thing to consider is you absolutely do NOT want 2 nodes producing blocks for the same witness, as that is sure to cause havoc and fork the network.

Whenever I switch production using the "update_witness" API call I manually make sure both the old witness node and the new witness node are both listening and in sync before I execute the call. I usually submit the call on the old witness going out of production, not the new node going into production. I can then use the get_witness API call to verify the signing key for the new node in in effect before I shut down the old witness node.

The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from that node) but the watchdog node falsely concludes it is dead, it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node. As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.

When the block producer fails it may not be possible to determine for certain why or get confirmation it will not resume block production. You will need to determine if the OS for the failing node is responding but not the app, in which case failover may be possible if you build in some type of communication to restart the witness_node app or restart the entire OS. The issue is what if you can't communicate with the failing node? Is it dead or just temporarily cut off? Will it fork the network if it should come back online?

I was hoping wackou & I could have implemented the backbone architecture and a failover protocol along with it, but there wasn't enough funding and wackou's time was very scarce (and still is actually). If this ecosystem is going to survive a frontal attack the witness nodes need to be protected from direct public access. Seed nodes and API servers should be the route available for public access, leaving witnesses alone to process and generate blocks quickly with minimum latency.

sudo

xeroc

Great to see people using pybitshares!

roelandp

Yes! I got voted in the active witness list! (here is my proposal) Thanks for your support!!! I immediately started my feed publishing more intense (twice per hour) and will continue to add more price feeds.

This morning I took the time to write a Witness Monitoring Script to monitor my witness main tasks on an independent server, powered by @xeroc 's Python Bitshares libraries (he just release 0.1.5!) for python3.

The script provides the monitoring of 3 core witness tasks and reports via a telegram bot API call the following:

1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.

2. Monitor the availability of your public seednode
By utilising the telnet library the script tries to connect to the given seednode and will report on time-out or errors.

3. Monitor the publishing of a set of assets' pricefeed(s)
By requesting the asset's feeds and checking against your witness name (configurable) the script keeps monitoring how long since you posted the given asset's feed. Whenever the configurable threshold in hours has passed and you have not yet published a new feed for the asset, you will get a notification.

FYI:

The script is written for & tested in python3 and to be run continuously in a 'screen'-session.
It utilises Telegram for notifications. Create your Telegram bot at @BotFather (https://telegram.me/botfather), get your telegram id via @MyTelegramID_bot (https://telegram.me/mytelegramid_bot).
Thanks to Python Bitshares you can run this script independent of your bitshares nodes and the script doesn't need cli_wallet or witness_node'd running.
In the first lines of the script you will find all configurable parameters, with explaining comments

Check it out on Github!
Let me know your thoughts, remarks, or requests.

Author Topic: Witness Monitoring Script based on websocket connection only (Python Bitshares) (Read 19212 times)

abit

Re: Witness Monitoring Script based on websocket connection only (Python Bitshares)

Thom

Re: Witness Monitoring Script based on websocket connection only (Python Bitshares)

Pheonike

Re: Witness Monitoring Script based on websocket connection only (Python Bitshares)

GChicken

Re: Witness Monitoring Script based on websocket connection only (Python Bitshares)

GChicken

Re: Witness Monitoring Script based on websocket connection only (Python Bitshares)

lafona

Re: Witness Monitoring Script based on websocket connection only (Python Bitshares)

roelandp

Re: Witness Monitoring Script based on websocket connection only (Python Bitshares)

roelandp

Re: Witness Monitoring Script based on websocket connection only (Python Bitshares)

abit

Re: Witness Monitoring Script based on websocket connection only (Python Bitshares)

Thom

Re: Witness Monitoring Script based on websocket connection only (Python Bitshares)

sudo

Re: Witness Monitoring Script based on websocket connection only (Python Bitshares)

xeroc

Re: Witness Monitoring Script based on websocket connection only (Python Bitshares)

roelandp

Witness Monitoring Script based on websocket connection only (Python Bitshares)