Author Topic: Witness Monitoring Script based on websocket connection only (Python Bitshares)  (Read 4463 times)

0 Members and 1 Guest are viewing this topic.

Offline roelandp

  • Full Member
  • ***
  • Posts: 111
  • Witness, dad, kitesurfer, event organiser
    • View Profile
    • RoelandP.nl
  • BitShares: roelandp
  • GitHub: roelandp
Yes! I got voted in the active witness list! (here is my proposal) Thanks for your support!!! I immediately started my feed publishing more intense (twice per hour) and will continue to add more price feeds.

This morning I took the time to write a Witness Monitoring Script to monitor my witness main tasks on an independent server, powered by [member=120]xeroc[/member] 's Python Bitshares libraries (he just release 0.1.5!) for python3.

The script provides the monitoring of 3 core witness tasks and reports via a telegram bot API call the following:

1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.

2. Monitor the availability of your public seednode
By utilising the telnet library the script tries to connect to the given seednode and will report on time-out or errors.

3. Monitor the publishing of a set of assets' pricefeed(s)
By requesting the asset's feeds and checking against your witness name (configurable) the script keeps monitoring how long since you posted the given asset's feed. Whenever the configurable threshold in hours has passed and you have not yet published a new feed for the asset, you will get a notification.

FYI:
  • The script is written for & tested in python3 and to be run continuously in a 'screen'-session.
  • It utilises Telegram for notifications. Create your Telegram bot at @BotFather (https://telegram.me/botfather), get your telegram id via @MyTelegramID_bot (https://telegram.me/mytelegramid_bot).
  • Thanks to Python Bitshares you can run this script independent of your bitshares nodes and the script doesn't need cli_wallet or witness_node'd running.
  • In the first lines of the script you will find all configurable parameters, with explaining comments

Check it out on Github!
Let me know your thoughts, remarks, or requests.

Offline xeroc

  • Board Moderator
  • Hero Member
  • *****
  • Posts: 12897
  • ChainSquad GmbH
    • View Profile
    • ChainSquad GmbH
  • BitShares: xeroc
  • GitHub: xeroc
Great to see people using pybitshares!
Give BitShares a try! Use the http://testnet.bitshares.eu provided by http://bitshares.eu powered by ChainSquad GmbH

Offline sudo

  • Hero Member
  • *****
  • Posts: 2255
    • View Profile
  • BitShares: ags

Offline Thom

1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.

Several witnesses attempted to code an automatic failover algorithm but I don't believe any were successful without introducing new problems.

One important thing to consider is you absolutely do NOT want 2 nodes producing blocks for the same witness, as that is sure to cause havoc and fork the network.

Whenever I switch production using the "update_witness" API call I manually make sure both the old witness node and the new witness node are both listening and in sync before I execute the call. I usually submit the call on the old witness going out of production, not the new node going into production. I can then use the get_witness API call to verify the signing key for the new node in in effect before I shut down the old witness node.

The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from that node) but the watchdog node falsely concludes it is dead,  it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node.  As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.

When the block producer fails it may not be possible to determine for certain why or get confirmation it will not resume block production. You will need to determine if the OS for the failing node is responding but not the app, in which case failover may be possible if you build in some type of communication to restart the witness_node app or restart the entire OS. The issue is what if you can't communicate with the failing node? Is it dead or just temporarily cut off? Will it fork the network if it should come back online?

I was hoping wackou & I could have implemented the backbone architecture and a failover protocol along with it, but there wasn't enough funding and wackou's time was very scarce (and still is actually). If this ecosystem is going to survive a frontal attack the witness nodes need to be protected from direct public access. Seed nodes and API servers should be the route available for public access, leaving witnesses alone to process and generate blocks quickly with minimum latency.
Injustice anywhere is a threat to justice everywhere - MLK |  Verbaltech2 Witness Reports: https://bitsharestalk.org/index.php/topic,23902.0.html

Offline abit

  • Committee member
  • Hero Member
  • *
  • Posts: 3928
    • View Profile
    • Steemit Blog
  • BitShares: abit
  • GitHub: abitmore
Not bad.

Hope someone will setup a web site to show the info. Statistics, charts, etc.

 +5%
BTS account: abit
BTS committee member: abit
BTS witness: in.abit

Offline roelandp

  • Full Member
  • ***
  • Posts: 111
  • Witness, dad, kitesurfer, event organiser
    • View Profile
    • RoelandP.nl
  • BitShares: roelandp
  • GitHub: roelandp
The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from that node) but the watchdog node falsely concludes it is dead,  it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node.  As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.

[member=21490]Thom[/member] thanks for your feedback. I was under the impression that the way it works with signing keys is that if you have your witness producer name setup in config.ini but not supply the correct privkey corresponding to the current listed 'public signing key' that the witness node is not producing blocks?

A failsafe backup scenario (imho) would be: The main server runs under pubkey X with privkey XXX in the config.ini and should it fail the  independent monitoring server calls the 'update_witness' command to start siging with pubKey Y. The backup server runs as a hot witness with privkey YYY in the config.ini and will receive messages like: 'Not producing block 12394871234 because I don't have the private key for pubKey X', right?

Only thing is to setup an 'update_witness' with pybitshares [member=120]xeroc[/member]? Let's see if I can write it :P

Offline roelandp

  • Full Member
  • ***
  • Posts: 111
  • Witness, dad, kitesurfer, event organiser
    • View Profile
    • RoelandP.nl
  • BitShares: roelandp
  • GitHub: roelandp
Hope someone will setup a web site to show the info. Statistics, charts, etc.

I think [member=22138]lafona[/member] has some stuff in the making re witness overview. This is more for personal use / monitoring of your own witness for availability. However the scripts could easily be converted into logging the data in a db and outputting as stats table like SteemDb.com/witnesses

Offline lafona

  • Sr. Member
  • ****
  • Posts: 231
    • View Profile
  • BitShares: lafona
Nice! I will definitely be using this to monitor my seed node and other witness related activities.
BTS Witnesses: delegate-1.lafona     Witness Thread: https://bitsharestalk.org/index.php/topic,21569.msg280911/topicseen.html#msg280911
MUSE Witness: lafona

Offline GChicken

  • Sr. Member
  • ****
  • Posts: 231
    • View Profile
1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.

Several witnesses attempted to code an automatic failover algorithm but I don't believe any were successful without introducing new problems.

One important thing to consider is you absolutely do NOT want 2 nodes producing blocks for the same witness, as that is sure to cause havoc and fork the network.

Whenever I switch production using the "update_witness" API call I manually make sure both the old witness node and the new witness node are both listening and in sync before I execute the call. I usually submit the call on the old witness going out of production, not the new node going into production. I can then use the get_witness API call to verify the signing key for the new node in in effect before I shut down the old witness node.

The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from that node) but the watchdog node falsely concludes it is dead,  it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node.  As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.

When the block producer fails it may not be possible to determine for certain why or get confirmation it will not resume block production. You will need to determine if the OS for the failing node is responding but not the app, in which case failover may be possible if you build in some type of communication to restart the witness_node app or restart the entire OS. The issue is what if you can't communicate with the failing node? Is it dead or just temporarily cut off? Will it fork the network if it should come back online?

I was hoping wackou & I could have implemented the backbone architecture and a failover protocol along with it, but there wasn't enough funding and wackou's time was very scarce (and still is actually). If this ecosystem is going to survive a frontal attack the witness nodes need to be protected from direct public access. Seed nodes and API servers should be the route available for public access, leaving witnesses alone to process and generate blocks quickly with minimum latency.

Looking at the stats i think [member=18687]abit[/member] has a script that detects failing witness and issues a transaction to the network to update his signing key; this would allow him to run two witnesses both with different signing keys and auto switch based on any issues. - this is only speculation; i have no idea really. But i all his time of being a witness he has only missed 133 blocks, and you can see updates of signing key on his account.

Offline GChicken

  • Sr. Member
  • ****
  • Posts: 231
    • View Profile
Great work Roeland! thanks for sharing  :)

Offline Pheonike


Great work.

Offline Thom

1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.

Several witnesses attempted to code an automatic failover algorithm but I don't believe any were successful without introducing new problems.

One important thing to consider is you absolutely do NOT want 2 nodes producing blocks for the same witness, as that is sure to cause havoc and fork the network.

Whenever I switch production using the "update_witness" API call I manually make sure both the old witness node and the new witness node are both listening and in sync before I execute the call. I usually submit the call on the old witness going out of production, not the new node going into production. I can then use the get_witness API call to verify the signing key for the new node in in effect before I shut down the old witness node.

The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from that node) but the watchdog node falsely concludes it is dead,  it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node.  As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.

When the block producer fails it may not be possible to determine for certain why or get confirmation it will not resume block production. You will need to determine if the OS for the failing node is responding but not the app, in which case failover may be possible if you build in some type of communication to restart the witness_node app or restart the entire OS. The issue is what if you can't communicate with the failing node? Is it dead or just temporarily cut off? Will it fork the network if it should come back online?

I was hoping wackou & I could have implemented the backbone architecture and a failover protocol along with it, but there wasn't enough funding and wackou's time was very scarce (and still is actually). If this ecosystem is going to survive a frontal attack the witness nodes need to be protected from direct public access. Seed nodes and API servers should be the route available for public access, leaving witnesses alone to process and generate blocks quickly with minimum latency.

Looking at the stats i think [member=18687]abit[/member] has a script that detects failing witness and issues a transaction to the network to update his signing key; this would allow him to run two witnesses both with different signing keys and auto switch based on any issues. - this is only speculation; i have no idea really. But i all his time of being a witness he has only missed 133 blocks, and you can see updates of signing key on his account.

That's a very good point [member=38721]GChicken[/member], I have often wondered how he has been able to achieve such low missed block numbers.

[member=43607]roelandp[/member] you're correct in your understanding of how update_witness functions. However in the scenario I tried to describe, wherein an active witness has a network infrastructure failure (not an app failure or host failure such as out of diskspace or memory) and due to that doesn't see the transaction transmitted by the monitor to switch signing keys, if the network is restored and the witness is reconnected to the network, it will continue to sign blocks for that witness but with an incorrect signing key, thus creating the real possibility of forking the network.

I know that [member=9301]puppies[/member] spent  some time working on an automatic failover algo and people found holes in it and I don't think his approach caught on due to the shortcomings raised. I am all for improving the robustness of our network, and hope a solid algo can be developed to automatically switch in redundant nodes and disable failed nodes. The testnet is a perfect context to work out such an algorithm and observe the affects. The exact case of a witness missing an update_witness transaction can be tested without risking a fork in production. 
Injustice anywhere is a threat to justice everywhere - MLK |  Verbaltech2 Witness Reports: https://bitsharestalk.org/index.php/topic,23902.0.html

Offline abit

  • Committee member
  • Hero Member
  • *
  • Posts: 3928
    • View Profile
    • Steemit Blog
  • BitShares: abit
  • GitHub: abitmore
1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.

Several witnesses attempted to code an automatic failover algorithm but I don't believe any were successful without introducing new problems.

One important thing to consider is you absolutely do NOT want 2 nodes producing blocks for the same witness, as that is sure to cause havoc and fork the network.

Whenever I switch production using the "update_witness" API call I manually make sure both the old witness node and the new witness node are both listening and in sync before I execute the call. I usually submit the call on the old witness going out of production, not the new node going into production. I can then use the get_witness API call to verify the signing key for the new node in in effect before I shut down the old witness node.

The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from that node) but the watchdog node falsely concludes it is dead,  it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node.  As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.

When the block producer fails it may not be possible to determine for certain why or get confirmation it will not resume block production. You will need to determine if the OS for the failing node is responding but not the app, in which case failover may be possible if you build in some type of communication to restart the witness_node app or restart the entire OS. The issue is what if you can't communicate with the failing node? Is it dead or just temporarily cut off? Will it fork the network if it should come back online?

I was hoping wackou & I could have implemented the backbone architecture and a failover protocol along with it, but there wasn't enough funding and wackou's time was very scarce (and still is actually). If this ecosystem is going to survive a frontal attack the witness nodes need to be protected from direct public access. Seed nodes and API servers should be the route available for public access, leaving witnesses alone to process and generate blocks quickly with minimum latency.

Looking at the stats i think [member=18687]abit[/member] has a script that detects failing witness and issues a transaction to the network to update his signing key; this would allow him to run two witnesses both with different signing keys and auto switch based on any issues. - this is only speculation; i have no idea really. But i all his time of being a witness he has only missed 133 blocks, and you can see updates of signing key on his account.

I'm not using a script for BitShares witness, but switch keys manually.

I keep 3+ nodes online. With the help of [member=5846]spartako[/member]'s telegram bot, I got notifications in time, then try to fix/switch asap.

Another reason of low block missing rate is good server/VPS hosting provider (so far, AWS) and perhaps a bit lucky.

I AM using a script for Steem Witness though.
BTS account: abit
BTS committee member: abit
BTS witness: in.abit

Offline Thom

Well, that's a fantastic record abit, especially since you are only manually intervening. Have you been able to make any profit from Oct 2015 to Feb 2017 using AWS servers for 3+ nodes? Of the total witness pay what % was necessary to pay server bills?

I think the hosting aspect is also extremely important. Until recently I ran all my nodes exclusively on VPSs. Regardless of how much RAM a server has (16GB on highest end VPS) I miss a block every week or so, sometimes every couple of weeks. A trickle. It could be due to many things. I just bought 2 dedicated servers. They are both with hosting companies I have not used before. When I put the first one located in Romania into operation as a seed, I ran into an odd problem I never saw before. It turned out to be an issue with the OS image (LOCALE was not set at all, no default) used by that hosting company. After resolving the LOCALE issue I ran it as a seed node for over a week and saw no issues, ran like a clock.

Last night I attempted to put that node into use as the block producing witness the same way I always do, but it missed 2 blocks in under a minute. My luck to be picked to generate 2 blocks so close together. It looks like there is a missing library or some other code problem looking at the errors. The binaries were compiled on that platform. Not sure if the issue is due to an OS difference (for example a missing shared lib normally supplied with the OS) or a failed package installation or an issue in the executable binary. The problem didn't happen until the node was called on to produce a block. Dbl chked the signing keys on all nodes which were correct. I'll get to the bottom of that today, or tomorrow if it's elusive to find.

I use the same setup script to ready a system to run, and I used it on another host after the one in Romania and had no issues. I will update my setup script to make sure the LOCALE is setup for English as required by the code. Probably been lucky using VPSs all around the world that I never ran into the LOCALE issue before.

Injustice anywhere is a threat to justice everywhere - MLK |  Verbaltech2 Witness Reports: https://bitsharestalk.org/index.php/topic,23902.0.html

Offline Yao

  • Hero Member
  • *****
  • Posts: 532
  • QQ/WeChat(微信):664349247
    • View Profile
  • BitShares: yao
  • GitHub: imYao
BTS witness: witness.yao
BTS Proxy: yao