Author Topic: Witness Monitoring Script based on websocket connection only (Python Bitshares) (Read 19301 times)

xeroc

Quote from: roelandp on December 11, 2017, 08:29:22 am

Sorry for the slow reply. I have updated my github with the code to execute the 'switch key' part. The logic you discuss here is NOT integrated as I had never given the edge case a thought.

It would be cool to test this witness-frequent-key-switching on the testnet for sure, by running 2 instances with different signing keys.

If we can come up with a rotation scheme to test for, I am all ears writing the script based on python-bitshares which some can then install on the testnets and monitor. https://github.com/roelandp/Bitshares-Witness-Monitor/commit/a8bc151a9f029bab7f4e3634271bbdb040d7b700

You may be interested to see how you can use uptick to build your own console tools:
http://uptick.readthedocs.io/en/latest/custom-scripts.html

roelandp

Sorry for the slow reply. I have updated my github with the code to execute the 'switch key' part. The logic you discuss here is NOT integrated as I had never given the edge case a thought.

It would be cool to test this witness-frequent-key-switching on the testnet for sure, by running 2 instances with different signing keys.

If we can come up with a rotation scheme to test for, I am all ears writing the script based on python-bitshares which some can then install on the testnets and monitor. https://github.com/roelandp/Bitshares-Witness-Monitor/commit/a8bc151a9f029bab7f4e3634271bbdb040d7b700

Thom

Quote from: abit on May 08, 2017, 10:00:58 pm

Why randomly? A script should only switch signing key when
1) network participation rate is above a threshold, for example 80%, and
2) head block age is not too old, for example within 10 seconds, and
3) the witness missed x blocks in a row

I know it's not perfect, I'm not arguing about this, but IMHO the risk is relatively low.

OK, so you acknowledge there is a window of vulnerability. You may believe it is insignificant but you have no evidence to back up such a claim. I happen to agree with you, but I also think we should be cautious and avoid introducing things which may have a negative impact. Due diligence says this risk should be evaluated and characterized before we deploy auto failover widely.

I say randomly to simulate a worse case scenario, to increase the failure rate so we can observe the effects. More switching, more Witnesses. It would be good to see just how robust the failover is. If missed blocks is a factor stressing the testnet far enough that Witnesses start to miss more blocks is simply being thorough in testing.

Thom

Hmmm... I was under the distinct impression that security would be reduced if it's possible to anticipate when a Witness is about to produce a block. It certainly makes it easier for an attacker to target a specific Witness if the attacker can anticipate when that Witness is about to generate a block. Such an attacker could use this info to trigger a DDoS barrage just slightly before the targeted Witness is to generate.

I thought a randomized production order was a central aspect of DPoS, and can even recall discussions about wagering and how the random number generation used for Witness scheduling was not robust enough (lack of sufficient entropy as I recall) for a wagering / betting app.

xeroc

Quote from: abit on May 08, 2017, 10:00:58 pm

Quote from: Thom on May 08, 2017, 04:41:53 pm

Quote from: abit on May 08, 2017, 12:16:31 pm
... the "update_witness" command should NOT be broadcast just before your scheduled block, or even 2~3 blocks before...

How can you know when you're going to be scheduled? You don't or there's a serious problem in the Witness randomization algorithm. So how can you know when it is "safe" to switch?

Of course we know. The Pseudo-random algorithm we're using produces determinate result, that said, most time we know exactly when a witness is scheduled to produce a block. We just need to expose an API to show that info (like Steem).

Take a look at the object 2.12.0

Code: [Select]


└─(%) uptick info 2.12.0                                                                                                                                                                                                        ─┘
+----------------------------+---------------+
| Key                        | Value         |
+----------------------------+---------------+
| current_shuffled_witnesses | [             |
|                            |     "1.6.37", |
|                            |     "1.6.59", |
|                            |     "1.6.17", |
|                            |     "1.6.63", |
|                            |     "1.6.15", |
|                            |     "1.6.71", |
|                            |     "1.6.26", |
|                            |     "1.6.20", |
|                            |     "1.6.74", |
|                            |     "1.6.23", |
|                            |     "1.6.35", |
|                            |     "1.6.76", |
|                            |     "1.6.69", |
|                            |     "1.6.22", |
|                            |     "1.6.73", |
|                            |     "1.6.34", |
|                            |     "1.6.45", |
|                            |     "1.6.18", |
|                            |     "1.6.24", |
|                            |     "1.6.72", |
|                            |     "1.6.64", |
|                            |     "1.6.65", |
|                            |     "1.6.16"  |
|                            | ]             |
| id                         | 2.12.0        |
+----------------------------+---------------+

abit

Quote from: Thom on May 08, 2017, 04:41:53 pm

Quote from: abit on May 08, 2017, 12:16:31 pm
... the "update_witness" command should NOT be broadcast just before your scheduled block, or even 2~3 blocks before...

How can you know when you're going to be scheduled? You don't or there's a serious problem in the Witness randomization algorithm. So how can you know when it is "safe" to switch?

Of course we know. The Pseudo-random algorithm we're using produces determinate result, that said, most time we know exactly when a witness is scheduled to produce a block. We just need to expose an API to show that info (like Steem).

Quote

This edge case increases with the number of witnesses and with the frequency witnesses switch their signing keys. The risk may be acceptable, but before we can be sure of that testing is required to quantify the risk. If you are willing to disclose your auto switching algo we could have a full slate of witnesses (30+) voted in on the testnet that use it, all switching their keys randomly and asynchronously, and we can gather more info about the risks.

Why randomly? A script should only switch signing key when
1) network participation rate is above a threshold, for example 80%, and
2) head block age is not too old, for example within 10 seconds, and
3) the witness missed x blocks in a row

I know it's not perfect, I'm not arguing about this, but IMHO the risk is relatively low.

Thom

Quote from: abit on May 08, 2017, 12:16:31 pm

@roelandp: @Thom is correct.

Thx @abit for acknowledging my concerns.

Quote from: abit on May 08, 2017, 12:16:31 pm

... the "update_witness" command should NOT be broadcast just before your scheduled block, or even 2~3 blocks before...

How can you know when you're going to be scheduled? You don't or there's a serious problem in the Witness randomization algorithm. So how can you know when it is "safe" to switch?

This edge case increases with the number of witnesses and with the frequency witnesses switch their signing keys. The risk may be acceptable, but before we can be sure of that testing is required to quantify the risk. If you are willing to disclose your auto switching algo we could have a full slate of witnesses (30+) voted in on the testnet that use it, all switching their keys randomly and asynchronously, and we can gather more info about the risks.

abit

I just got my auto-failover script up.

@roelandp: @Thom is correct. There is an edge case that both nodes will produce blocks (up to the next witness to decide). To solve this, the "update_witness" command should NOT be broadcast just before your scheduled block, or even 2~3 blocks before, so both nodes will see that transaction being included in a block by another witness, and then be confirmed by other witnesses. It's still not 100% safe though, but practically doable. I think it's also practicable even if not checking this way, because normally we won't have a lot of witnesses switching keys at the same time (when that's happening, network participation rate must be low, so we can check participation rate first before update).

roelandp

Code: [Select]

2013000ms th_a    witness.cpp:196    block_production_loo ] Not producing block because I don't have the private key for BTS7Q2wS9rhqrkY7nAGNMG5MqscSUTY7gupwTQAybcEjUB77vD9a4

This is the above message on my backup witness every time it is my turn to sign a block.

Here is the relevant code: https://github.com/cryptonomex/graphene/blob/d7de6f63e8e29de42af8d06e0029d89fcfddf4fa/libraries/plugins/witness/witness.cpp#L264

The malfunctioning node will not directly receive the

Code: [Select]

update_witness call, but it will receive the scheduled round in which the witness_id is appearing (if active witness). But the 'state of the chain' has changed and this round it requires a block from his witness account but with a different 'privatekey' signature.

if the malfunction node is in the producing loop and validates all the conditions to sign, it will fail at the private-key condition, because that one is not met and will refuse to produce a block.

Thom

Quote from: roelandp on May 07, 2017, 10:12:48 pm

If you then issue an
Code: [Select]
update_witness command to switch to the other public signing key it doesn't matter if the other node might be coming back online, because it then still would try to sign blocks with the (by then) outdated public key.

It is true that if the previously active witness resumes operation and didn't see the update_witness msg it will resume signing blocks using the old signing key, and that signing key won't be the correct active signing key.

The resuming node doesn't know it isn't the correct key. When that node sees its "turn" in the witness rotation it will produce a signed block for that witness which may fork the network, as now you have 2 nodes for the same witness signing blocks with different keys.

If the malfunction affected 2 or more witness (common datacenter or problematic network trunk) and neither of them received the update_witness msg the "other" cut off witness could think it was a valid block and add it to its chain causing a fork.

Such double production with different keys may not fork the net if receivers of the "bad/old" block reject it outright due to some type of cryptographic decrypt failure that prevents that bad block from ever being considered valid. I do not know enough details to say if such blocks are rejected as invalid. I do know there was quite a discussion about automatic switching and AFAIK no algorithm was conceived to eliminate forking risks.

This is a perfect case to testing on the testnet.

Perhaps someone familiar with the C++ code could evaluate how multiple blocks for the same witness signed with different keys are processed could lay this question to rest.

If there is a possibility that automatic switching might increase the chance of forking, even if it is a rare and fringe case, it seems the likelihood would only increase as the volume of transactions increase.

roelandp

Quote from: Thom on May 03, 2017, 03:22:15 pm

To avoid that you need a way to make sure the old witness is definitely is dead with no chance of coming back online while the new witness takes over. To do that you need some smarts in the cooperating failsafe nodes to determine each node's state. Some type of heartbeat so that if the node producing blocks does NOT hear heartbeats from at least 2 other nodes it will cease block production. The producing node needs to verify it can communicate with the other witnesses, particularly the failsafe nodes.

Hi @Thom we briefly discussed this in telegram (i think) but I still feel the setup with having multiple servers with each its own private / public key (use

Code: [Select]

suggest_brain_key) and the witness_node software running is the way to go. As soon as the blockchain starts logging missing blocks for your witness, you know it is malfunctioning. If you then issue an

Code: [Select]

update_witness command to switch to the other public signing key it doesn't matter if the other node might be coming back online, because it then still would try to sign blocks with the (by then) outdated public key.

I wrote a paragraph in this update for the witness docs (not yet committed): https://github.com/roelandp/docs.bitshares.eu/commit/75f56c50caeddf1e34c548c005443d726d6ab509#diff-c4ebae0b7f619df56e73bcea77eb3fe1R235

Thom · « *Last Edit: May 03, 2017, 03:26:44 pm by Thom* »

Quote from: abit on May 03, 2017, 08:03:40 am

Well, my block producing node got stuck due to insufficient disk space (filled by p2p log) a few hours ago, while I'm sleeping. Unfortunately my phone was set to vibration mode, although it was notifying me all the time, I didn't wake up. I missed 177 new blocks (133 -> 310). Quite ironic. I won't be always lucky. I think it's time to setup an fail-over script.

Sorry to hear that. So you & roelandp are working on automatic failover. I hope one of you can perfect it. I have discussed that idea elsewhere, but it seems not many believe the risks are significant. All it takes to mess up the chain is for 2 nodes to broadcast signed transactions for the same witness. Fork city. To avoid that you need a way to make sure the old witness is definitely is dead with no chance of coming back online while the new witness takes over. To do that you need some smarts in the cooperating failsafe nodes to determine each node's state. Some type of heartbeat so that if the node producing blocks does NOT hear heartbeats from at least 2 other nodes it will cease block production. The producing node needs to verify it can communicate with the other witnesses, particularly the failsafe nodes.

Quote from: abit on May 03, 2017, 08:03:40 am

In regards to bills, at first I was running nodes in China with less cost. We didn't have that many transactions in the early days, so network latency was not a big issue. After Steem blockchain was launched, I got some compensation there, then setup a few nodes in AWS (as my main BitShares block producing nodes) after latency became an issue, still, had been compensated by Steem witness pay for quite some months until recently. My AWS instances are mostly r3.large (15G RAM, 2 cores, 32G local SSD), the cost per month is around 150$ each (including additional cost for more disk spaces, data transmission and etc).

Thanks for this info. This confirms that until recently witness pay barely covered the cost of servers. Essentially it was altruism (fueled by the belief the platform was worth subsidizing) that kept the network operating while we all hoped that eventually we would reach much higher adoption.

Quote from: Thom on May 01, 2017, 05:40:43 pm

Last night I attempted to put that node into use as the block producing witness the same way I always do, but it missed 2 blocks in under a minute.

I restarted the node with different witness / cli binaries and it's working fine since yesterday. So it may be a compiler issue or missing dynamically link library (if any of those are used in the build process). I will rerun the build and carefully review the logs for errors.

abit · « *Last Edit: May 03, 2017, 08:10:51 am by abit* »

Quote from: Thom on May 01, 2017, 05:40:43 pm

Well, that's a fantastic record abit, especially since you are only manually intervening. Have you been able to make any profit from Oct 2015 to Feb 2017 using AWS servers for 3+ nodes? Of the total witness pay what % was necessary to pay server bills?

Well, my block producing node got stuck due to insufficient disk space (filled by p2p log) a few hours ago, while I'm sleeping. Unfortunately my phone was set to vibration mode, although it was notifying me all the time, I didn't wake up. I missed 177 new blocks (133 -> 310). Quite ironic. I won't be always lucky. I think it's time to setup an fail-over script.

In regards to bills, at first I was running nodes in China with less cost. We didn't have that many transactions in the early days, so network latency was not a big issue. After Steem blockchain was launched, I got some compensation there, then setup a few nodes in AWS (as my main BitShares block producing nodes) after latency became an issue, still, had been compensated by Steem witness pay for quite some months until recently. My AWS instances are mostly r3.large (15G RAM, 2 cores, 32G local SSD), the cost per month is around 150$ each (including additional cost for more disk spaces, data transmission and etc).

Quote

I think the hosting aspect is also extremely important. Until recently I ran all my nodes exclusively on VPSs. Regardless of how much RAM a server has (16GB on highest end VPS) I miss a block every week or so, sometimes every couple of weeks. A trickle. It could be due to many things. I just bought 2 dedicated servers. They are both with hosting companies I have not used before. When I put the first one located in Romania into operation as a seed, I ran into an odd problem I never saw before. It turned out to be an issue with the OS image (LOCALE was not set at all, no default) used by that hosting company. After resolving the LOCALE issue I ran it as a seed node for over a week and saw no issues, ran like a clock.

Last night I attempted to put that node into use as the block producing witness the same way I always do, but it missed 2 blocks in under a minute. My luck to be picked to generate 2 blocks so close together. It looks like there is a missing library or some other code problem looking at the errors. The binaries were compiled on that platform. Not sure if the issue is due to an OS difference (for example a missing shared lib normally supplied with the OS) or a failed package installation or an issue in the executable binary. The problem didn't happen until the node was called on to produce a block. Dbl chked the signing keys on all nodes which were correct. I'll get to the bottom of that today, or tomorrow if it's elusive to find.

I use the same setup script to ready a system to run, and I used it on another host after the one in Romania and had no issues. I will update my setup script to make sure the LOCALE is setup for English as required by the code. Probably been lucky using VPSs all around the world that I never ran into the LOCALE issue before.

Thanks for sharing the experience.

Yao

Thom

Well, that's a fantastic record abit, especially since you are only manually intervening. Have you been able to make any profit from Oct 2015 to Feb 2017 using AWS servers for 3+ nodes? Of the total witness pay what % was necessary to pay server bills?

I think the hosting aspect is also extremely important. Until recently I ran all my nodes exclusively on VPSs. Regardless of how much RAM a server has (16GB on highest end VPS) I miss a block every week or so, sometimes every couple of weeks. A trickle. It could be due to many things. I just bought 2 dedicated servers. They are both with hosting companies I have not used before. When I put the first one located in Romania into operation as a seed, I ran into an odd problem I never saw before. It turned out to be an issue with the OS image (LOCALE was not set at all, no default) used by that hosting company. After resolving the LOCALE issue I ran it as a seed node for over a week and saw no issues, ran like a clock.

Last night I attempted to put that node into use as the block producing witness the same way I always do, but it missed 2 blocks in under a minute. My luck to be picked to generate 2 blocks so close together. It looks like there is a missing library or some other code problem looking at the errors. The binaries were compiled on that platform. Not sure if the issue is due to an OS difference (for example a missing shared lib normally supplied with the OS) or a failed package installation or an issue in the executable binary. The problem didn't happen until the node was called on to produce a block. Dbl chked the signing keys on all nodes which were correct. I'll get to the bottom of that today, or tomorrow if it's elusive to find.

I use the same setup script to ready a system to run, and I used it on another host after the one in Romania and had no issues. I will update my setup script to make sure the LOCALE is setup for English as required by the code. Probably been lucky using VPSs all around the world that I never ran into the LOCALE issue before.

abit

Quote from: GChicken on April 25, 2017, 01:45:50 pm

Quote from: Thom on April 25, 2017, 03:23:56 am
Quote from: roelandp on April 24, 2017, 08:38:52 pm
1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.

Several witnesses attempted to code an automatic failover algorithm but I don't believe any were successful without introducing new problems.

One important thing to consider is you absolutely do NOT want 2 nodes producing blocks for the same witness, as that is sure to cause havoc and fork the network.

Whenever I switch production using the "update_witness" API call I manually make sure both the old witness node and the new witness node are both listening and in sync before I execute the call. I usually submit the call on the old witness going out of production, not the new node going into production. I can then use the get_witness API call to verify the signing key for the new node in in effect before I shut down the old witness node.

The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from that node) but the watchdog node falsely concludes it is dead, it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node. As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.

When the block producer fails it may not be possible to determine for certain why or get confirmation it will not resume block production. You will need to determine if the OS for the failing node is responding but not the app, in which case failover may be possible if you build in some type of communication to restart the witness_node app or restart the entire OS. The issue is what if you can't communicate with the failing node? Is it dead or just temporarily cut off? Will it fork the network if it should come back online?

I was hoping wackou & I could have implemented the backbone architecture and a failover protocol along with it, but there wasn't enough funding and wackou's time was very scarce (and still is actually). If this ecosystem is going to survive a frontal attack the witness nodes need to be protected from direct public access. Seed nodes and API servers should be the route available for public access, leaving witnesses alone to process and generate blocks quickly with minimum latency.

Looking at the stats i think @abit has a script that detects failing witness and issues a transaction to the network to update his signing key; this would allow him to run two witnesses both with different signing keys and auto switch based on any issues. - this is only speculation; i have no idea really. But i all his time of being a witness he has only missed 133 blocks, and you can see updates of signing key on his account.

I'm not using a script for BitShares witness, but switch keys manually.

I keep 3+ nodes online. With the help of @spartako's telegram bot, I got notifications in time, then try to fix/switch asap.

Another reason of low block missing rate is good server/VPS hosting provider (so far, AWS) and perhaps a bit lucky.

I AM using a script for Steem Witness though.

Thom

Quote from: GChicken on April 25, 2017, 01:45:50 pm

Quote from: Thom on April 25, 2017, 03:23:56 am
Quote from: roelandp on April 24, 2017, 08:38:52 pm
1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.

Several witnesses attempted to code an automatic failover algorithm but I don't believe any were successful without introducing new problems.

One important thing to consider is you absolutely do NOT want 2 nodes producing blocks for the same witness, as that is sure to cause havoc and fork the network.

Whenever I switch production using the "update_witness" API call I manually make sure both the old witness node and the new witness node are both listening and in sync before I execute the call. I usually submit the call on the old witness going out of production, not the new node going into production. I can then use the get_witness API call to verify the signing key for the new node in in effect before I shut down the old witness node.

The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from that node) but the watchdog node falsely concludes it is dead, it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node. As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.

When the block producer fails it may not be possible to determine for certain why or get confirmation it will not resume block production. You will need to determine if the OS for the failing node is responding but not the app, in which case failover may be possible if you build in some type of communication to restart the witness_node app or restart the entire OS. The issue is what if you can't communicate with the failing node? Is it dead or just temporarily cut off? Will it fork the network if it should come back online?

I was hoping wackou & I could have implemented the backbone architecture and a failover protocol along with it, but there wasn't enough funding and wackou's time was very scarce (and still is actually). If this ecosystem is going to survive a frontal attack the witness nodes need to be protected from direct public access. Seed nodes and API servers should be the route available for public access, leaving witnesses alone to process and generate blocks quickly with minimum latency.

Looking at the stats i think @abit has a script that detects failing witness and issues a transaction to the network to update his signing key; this would allow him to run two witnesses both with different signing keys and auto switch based on any issues. - this is only speculation; i have no idea really. But i all his time of being a witness he has only missed 133 blocks, and you can see updates of signing key on his account.

That's a very good point @GChicken, I have often wondered how he has been able to achieve such low missed block numbers.

@roelandp you're correct in your understanding of how update_witness functions. However in the scenario I tried to describe, wherein an active witness has a network infrastructure failure (not an app failure or host failure such as out of diskspace or memory) and due to that doesn't see the transaction transmitted by the monitor to switch signing keys, if the network is restored and the witness is reconnected to the network, it will continue to sign blocks for that witness but with an incorrect signing key, thus creating the real possibility of forking the network.

I know that @puppies spent some time working on an automatic failover algo and people found holes in it and I don't think his approach caught on due to the shortcomings raised. I am all for improving the robustness of our network, and hope a solid algo can be developed to automatically switch in redundant nodes and disable failed nodes. The testnet is a perfect context to work out such an algorithm and observe the affects. The exact case of a witness missing an update_witness transaction can be tested without risking a fork in production.

Pheonike

Great work.

GChicken

Great work Roeland! thanks for sharing

GChicken

Quote from: Thom on April 25, 2017, 03:23:56 am

Quote from: roelandp on April 24, 2017, 08:38:52 pm
1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.

Several witnesses attempted to code an automatic failover algorithm but I don't believe any were successful without introducing new problems.

One important thing to consider is you absolutely do NOT want 2 nodes producing blocks for the same witness, as that is sure to cause havoc and fork the network.

Whenever I switch production using the "update_witness" API call I manually make sure both the old witness node and the new witness node are both listening and in sync before I execute the call. I usually submit the call on the old witness going out of production, not the new node going into production. I can then use the get_witness API call to verify the signing key for the new node in in effect before I shut down the old witness node.

The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from that node) but the watchdog node falsely concludes it is dead, it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node. As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.

When the block producer fails it may not be possible to determine for certain why or get confirmation it will not resume block production. You will need to determine if the OS for the failing node is responding but not the app, in which case failover may be possible if you build in some type of communication to restart the witness_node app or restart the entire OS. The issue is what if you can't communicate with the failing node? Is it dead or just temporarily cut off? Will it fork the network if it should come back online?

I was hoping wackou & I could have implemented the backbone architecture and a failover protocol along with it, but there wasn't enough funding and wackou's time was very scarce (and still is actually). If this ecosystem is going to survive a frontal attack the witness nodes need to be protected from direct public access. Seed nodes and API servers should be the route available for public access, leaving witnesses alone to process and generate blocks quickly with minimum latency.

Looking at the stats i think @abit has a script that detects failing witness and issues a transaction to the network to update his signing key; this would allow him to run two witnesses both with different signing keys and auto switch based on any issues. - this is only speculation; i have no idea really. But i all his time of being a witness he has only missed 133 blocks, and you can see updates of signing key on his account.

lafona

Nice! I will definitely be using this to monitor my seed node and other witness related activities.

roelandp

Quote from: abit on April 25, 2017, 08:18:34 am

Hope someone will setup a web site to show the info. Statistics, charts, etc.

I think @lafona has some stuff in the making re witness overview. This is more for personal use / monitoring of your own witness for availability. However the scripts could easily be converted into logging the data in a db and outputting as stats table like SteemDb.com/witnesses

roelandp

Quote from: Thom on April 25, 2017, 03:23:56 am

The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from that node) but the watchdog node falsely concludes it is dead, it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node. As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.

@Thom thanks for your feedback. I was under the impression that the way it works with signing keys is that if you have your witness producer name setup in config.ini but not supply the correct privkey corresponding to the current listed 'public signing key' that the witness node is not producing blocks?

A failsafe backup scenario (imho) would be: The main server runs under pubkey X with privkey XXX in the config.ini and should it fail the independent monitoring server calls the 'update_witness' command to start siging with pubKey Y. The backup server runs as a hot witness with privkey YYY in the config.ini and will receive messages like: 'Not producing block 12394871234 because I don't have the private key for pubKey X', right?

Only thing is to setup an 'update_witness' with pybitshares @xeroc? Let's see if I can write it

abit

Not bad.

Hope someone will setup a web site to show the info. Statistics, charts, etc.

Thom

Quote from: roelandp on April 24, 2017, 08:38:52 pm

1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.

Several witnesses attempted to code an automatic failover algorithm but I don't believe any were successful without introducing new problems.

One important thing to consider is you absolutely do NOT want 2 nodes producing blocks for the same witness, as that is sure to cause havoc and fork the network.

Whenever I switch production using the "update_witness" API call I manually make sure both the old witness node and the new witness node are both listening and in sync before I execute the call. I usually submit the call on the old witness going out of production, not the new node going into production. I can then use the get_witness API call to verify the signing key for the new node in in effect before I shut down the old witness node.

The difficulty is in coming up with a reliable way to know for certain the node you want to take out of production will not be able to generate blocks after you switch production to another node. If the "aberrant" node has not crashed, is still running but cut off from the net (or the watchdog listener is cut off from that node) but the watchdog node falsely concludes it is dead, it may broadcast a new signing key, causing a new node to take over, but then the network to the aberrant server is restored and resumes network communications still thinking it is the block producer and so generates a block along with the failover node. As far as the aberrant node is concerned it never saw the new signing key, never thought it was offline and continues to generate a block whenever its time to do so comes around.

When the block producer fails it may not be possible to determine for certain why or get confirmation it will not resume block production. You will need to determine if the OS for the failing node is responding but not the app, in which case failover may be possible if you build in some type of communication to restart the witness_node app or restart the entire OS. The issue is what if you can't communicate with the failing node? Is it dead or just temporarily cut off? Will it fork the network if it should come back online?

I was hoping wackou & I could have implemented the backbone architecture and a failover protocol along with it, but there wasn't enough funding and wackou's time was very scarce (and still is actually). If this ecosystem is going to survive a frontal attack the witness nodes need to be protected from direct public access. Seed nodes and API servers should be the route available for public access, leaving witnesses alone to process and generate blocks quickly with minimum latency.

sudo

xeroc

Great to see people using pybitshares!

roelandp

Yes! I got voted in the active witness list! (here is my proposal) Thanks for your support!!! I immediately started my feed publishing more intense (twice per hour) and will continue to add more price feeds.

This morning I took the time to write a Witness Monitoring Script to monitor my witness main tasks on an independent server, powered by @xeroc 's Python Bitshares libraries (he just release 0.1.5!) for python3.

The script provides the monitoring of 3 core witness tasks and reports via a telegram bot API call the following:

1. Monitor missing blocks
Whenever a new block is missed you will get a notification. This part of the script can (and will) be extended towards automated switching to the backup witness signing key once a threshold is passed.

2. Monitor the availability of your public seednode
By utilising the telnet library the script tries to connect to the given seednode and will report on time-out or errors.

3. Monitor the publishing of a set of assets' pricefeed(s)
By requesting the asset's feeds and checking against your witness name (configurable) the script keeps monitoring how long since you posted the given asset's feed. Whenever the configurable threshold in hours has passed and you have not yet published a new feed for the asset, you will get a notification.

FYI:

The script is written for & tested in python3 and to be run continuously in a 'screen'-session.
It utilises Telegram for notifications. Create your Telegram bot at @BotFather (https://telegram.me/botfather), get your telegram id via @MyTelegramID_bot (https://telegram.me/mytelegramid_bot).
Thanks to Python Bitshares you can run this script independent of your bitshares nodes and the script doesn't need cli_wallet or witness_node'd running.
In the first lines of the script you will find all configurable parameters, with explaining comments

Check it out on Github!
Let me know your thoughts, remarks, or requests.