Author Topic: Witness Monitoring Script based on websocket connection only (Python Bitshares)  (Read 7784 times)

0 Members and 1 Guest are viewing this topic.

Offline xeroc

  • Board Moderator
  • Hero Member
  • *****
  • Posts: 12920
  • ChainSquad GmbH
    • View Profile
    • ChainSquad GmbH
  • BitShares: xeroc
  • GitHub: xeroc
Sorry for the slow reply. I have updated my github with the code to execute the 'switch key' part. The logic you discuss here is NOT integrated as I had never given the edge case a thought.

It would be cool to test this witness-frequent-key-switching on the testnet for sure, by running 2 instances with different signing keys.

If we can come up with a rotation scheme to test for, I am all ears writing the script based on python-bitshares which some can then install on the testnets and monitor. https://github.com/roelandp/Bitshares-Witness-Monitor/commit/a8bc151a9f029bab7f4e3634271bbdb040d7b700

You may be interested to see how you can use uptick to build your own console tools:
http://uptick.readthedocs.io/en/latest/custom-scripts.html

Offline roelandp

  • Full Member
  • ***
  • Posts: 114
  • Witness, dad, kitesurfer, event organiser
    • View Profile
    • RoelandP.nl
  • BitShares: roelandp
  • GitHub: roelandp
Sorry for the slow reply. I have updated my github with the code to execute the 'switch key' part. The logic you discuss here is NOT integrated as I had never given the edge case a thought.

It would be cool to test this witness-frequent-key-switching on the testnet for sure, by running 2 instances with different signing keys.

If we can come up with a rotation scheme to test for, I am all ears writing the script based on python-bitshares which some can then install on the testnets and monitor. https://github.com/roelandp/Bitshares-Witness-Monitor/commit/a8bc151a9f029bab7f4e3634271bbdb040d7b700


Offline Thom

Why randomly? A script should only switch signing key when
1) network participation rate is above a threshold, for example 80%, and
2) head block age is not too old, for example within 10 seconds, and
3) the witness missed x blocks in a row

I know it's not perfect, I'm not arguing about this, but IMHO the risk is relatively low.

OK, so you acknowledge there is a window of vulnerability. You may believe it is insignificant but you have no evidence to back up such a claim. I happen to agree with you, but I also think we should be cautious and avoid introducing things which may have a negative impact. Due diligence says this risk should be evaluated and characterized before we deploy auto failover widely.

I say randomly to simulate a worse case scenario, to increase the failure rate so we can observe the effects. More switching, more Witnesses. It would be good to see just how robust the failover is. If missed blocks is a factor stressing the testnet far enough that Witnesses start to miss more blocks is simply being thorough in testing.
Injustice anywhere is a threat to justice everywhere - MLK |  Verbaltech2 Witness Reports: https://bitsharestalk.org/index.php/topic,23902.0.html

Offline Thom

Hmmm... I was under the distinct impression that security would be reduced if it's possible to anticipate when a Witness is about to produce a block. It certainly makes it easier for an attacker to target a specific Witness if the attacker can anticipate when that Witness is about to generate a block. Such an attacker could use this info to trigger a DDoS barrage just slightly before the targeted Witness is to generate.

I thought a randomized production order was a central aspect of DPoS, and can even recall discussions about wagering and how the random number generation used for Witness scheduling was not robust enough (lack of sufficient entropy as I recall) for a wagering / betting app.
Injustice anywhere is a threat to justice everywhere - MLK |  Verbaltech2 Witness Reports: https://bitsharestalk.org/index.php/topic,23902.0.html

Offline xeroc

  • Board Moderator
  • Hero Member
  • *****
  • Posts: 12920
  • ChainSquad GmbH
    • View Profile
    • ChainSquad GmbH
  • BitShares: xeroc
  • GitHub: xeroc

... the "update_witness" command should NOT be broadcast just before your scheduled block, or even 2~3 blocks before...

How can you know when you're going to be scheduled? You don't or there's a serious problem in the Witness randomization algorithm. So how can you know when it is "safe" to switch?

Of course we know. The Pseudo-random algorithm we're using produces determinate result, that said, most time we know exactly when a witness is scheduled to produce a block. We just need to expose an API to show that info (like Steem).

Take a look at the object 2.12.0

Code: [Select]

└─(%) uptick info 2.12.0                                                                                                                                                                                                        ─┘
+----------------------------+---------------+
| Key                        | Value         |
+----------------------------+---------------+
| current_shuffled_witnesses | [             |
|                            |     "1.6.37", |
|                            |     "1.6.59", |
|                            |     "1.6.17", |
|                            |     "1.6.63", |
|                            |     "1.6.15", |
|                            |     "1.6.71", |
|                            |     "1.6.26", |
|                            |     "1.6.20", |
|                            |     "1.6.74", |
|                            |     "1.6.23", |
|                            |     "1.6.35", |
|                            |     "1.6.76", |
|                            |     "1.6.69", |
|                            |     "1.6.22", |
|                            |     "1.6.73", |
|                            |     "1.6.34", |
|                            |     "1.6.45", |
|                            |     "1.6.18", |
|                            |     "1.6.24", |
|                            |     "1.6.72", |
|                            |     "1.6.64", |
|                            |     "1.6.65", |
|                            |     "1.6.16"  |
|                            | ]             |
| id                         | 2.12.0        |
+----------------------------+---------------+

Offline abit

  • Committee member
  • Hero Member
  • *
  • Posts: 4572
    • View Profile
    • Abit's Hive Blog
  • BitShares: abit
  • GitHub: abitmore

... the "update_witness" command should NOT be broadcast just before your scheduled block, or even 2~3 blocks before...

How can you know when you're going to be scheduled? You don't or there's a serious problem in the Witness randomization algorithm. So how can you know when it is "safe" to switch?

Of course we know. The Pseudo-random algorithm we're using produces determinate result, that said, most time we know exactly when a witness is scheduled to produce a block. We just need to expose an API to show that info (like Steem).

Quote

This edge case increases with the number of witnesses and with the frequency witnesses switch their signing keys. The risk may be acceptable, but before we can be sure of that testing is required to quantify the risk. If you are willing to disclose your auto switching algo we could have a full slate of witnesses (30+) voted in on the testnet that use it, all switching their keys randomly and asynchronously, and we can gather more info about the risks.
Why randomly? A script should only switch signing key when
1) network participation rate is above a threshold, for example 80%, and
2) head block age is not too old, for example within 10 seconds, and
3) the witness missed x blocks in a row

I know it's not perfect, I'm not arguing about this, but IMHO the risk is relatively low.
BitShares committee member: abit
BitShares witness: in.abit

Offline Thom

@roelandp: @Thom is correct.
Thx @abit for acknowledging my concerns.

... the "update_witness" command should NOT be broadcast just before your scheduled block, or even 2~3 blocks before...

How can you know when you're going to be scheduled? You don't or there's a serious problem in the Witness randomization algorithm. So how can you know when it is "safe" to switch?

This edge case increases with the number of witnesses and with the frequency witnesses switch their signing keys. The risk may be acceptable, but before we can be sure of that testing is required to quantify the risk. If you are willing to disclose your auto switching algo we could have a full slate of witnesses (30+) voted in on the testnet that use it, all switching their keys randomly and asynchronously, and we can gather more info about the risks. 
Injustice anywhere is a threat to justice everywhere - MLK |  Verbaltech2 Witness Reports: https://bitsharestalk.org/index.php/topic,23902.0.html

Offline abit

  • Committee member
  • Hero Member
  • *
  • Posts: 4572
    • View Profile
    • Abit's Hive Blog
  • BitShares: abit
  • GitHub: abitmore
I just got my auto-failover script up.

@roelandp: @Thom is correct. There is an edge case that both nodes will produce blocks (up to the next witness to decide). To solve this, the "update_witness" command should NOT be broadcast just before your scheduled block, or even 2~3 blocks before, so both nodes will see that transaction being included in a block by another witness, and then be confirmed by other witnesses. It's still not 100% safe though, but practically doable. I think it's also practicable even if not checking this way, because normally we won't have a lot of witnesses switching keys at the same time (when that's happening, network participation rate must be low, so we can check participation rate first before update).
BitShares committee member: abit
BitShares witness: in.abit

Offline roelandp

  • Full Member
  • ***
  • Posts: 114
  • Witness, dad, kitesurfer, event organiser
    • View Profile
    • RoelandP.nl
  • BitShares: roelandp
  • GitHub: roelandp
Code: [Select]
2013000ms th_a    witness.cpp:196    block_production_loo ] Not producing block because I don't have the private key for BTS7Q2wS9rhqrkY7nAGNMG5MqscSUTY7gupwTQAybcEjUB77vD9a4
This is the above message on my backup witness every time it is my turn to sign a block.

Here is the relevant code: https://github.com/cryptonomex/graphene/blob/d7de6f63e8e29de42af8d06e0029d89fcfddf4fa/libraries/plugins/witness/witness.cpp#L264

The malfunctioning node will not directly receive the
Code: [Select]
update_witness call, but it will receive the scheduled round in which the witness_id is appearing (if active witness). But the 'state of the chain' has changed and this round it requires a block from his witness account but with a different 'privatekey' signature.

if the malfunction node is in the producing loop and validates all the conditions to sign, it will fail at the private-key condition, because that one is not met and will refuse to produce a block.

Offline Thom

If you then issue an
Code: [Select]
update_witness command to switch to the other public signing key it doesn't matter if the other node might be coming back online, because it then still would try to sign blocks with the (by then) outdated public key.

It is true that if the previously active witness resumes operation and didn't see the update_witness msg it will resume signing blocks using the old signing key, and that signing key won't be the correct active signing key.

The resuming node doesn't know it isn't the correct key. When that node sees its "turn" in the witness rotation it will produce a signed block for that witness which may fork the network, as now you have 2 nodes for the same witness signing blocks with different keys.

If the malfunction affected 2 or more witness (common datacenter or problematic network trunk) and neither of them received the update_witness msg the "other" cut off witness could think it was a valid block and add it to its chain causing a fork.

Such double production with different keys may not fork the net if receivers of the "bad/old" block reject it outright due to some type of cryptographic decrypt failure that prevents that bad block from ever being considered valid. I do not know enough details to say if such blocks are rejected as invalid. I do know there was quite a discussion about automatic switching and AFAIK no algorithm was conceived to eliminate forking risks.

This is a perfect case to testing on the testnet.

Perhaps someone familiar with the C++ code could evaluate how multiple blocks for the same witness signed with different keys are processed could lay this question to rest.

If there is a possibility that automatic switching might increase the chance of forking, even if it is a rare and fringe case, it seems the likelihood would only increase as the volume of transactions increase.
Injustice anywhere is a threat to justice everywhere - MLK |  Verbaltech2 Witness Reports: https://bitsharestalk.org/index.php/topic,23902.0.html

Offline roelandp

  • Full Member
  • ***
  • Posts: 114
  • Witness, dad, kitesurfer, event organiser
    • View Profile
    • RoelandP.nl
  • BitShares: roelandp
  • GitHub: roelandp
To avoid that you need a way to make sure the old witness is definitely is dead with no chance of coming back online while the new witness takes over. To do that you need some smarts in the cooperating failsafe nodes to determine each node's state. Some type of heartbeat so that if the node producing blocks does NOT hear heartbeats from at least 2 other nodes it will cease block production. The producing node needs to verify it can communicate with the other witnesses, particularly the failsafe nodes.

Hi @Thom we briefly discussed this in telegram (i think) but I still feel the setup with having multiple servers with each its own private  / public key (use
Code: [Select]
suggest_brain_key) and the witness_node software running is the way to go. As soon as the blockchain starts logging missing blocks for your witness, you know it is malfunctioning. If you then issue an
Code: [Select]
update_witness command to switch to the other public signing key it doesn't matter if the other node might be coming back online, because it then still would try to sign blocks with the (by then) outdated public key.

I wrote a paragraph in this update for the witness docs (not yet committed): https://github.com/roelandp/docs.bitshares.eu/commit/75f56c50caeddf1e34c548c005443d726d6ab509#diff-c4ebae0b7f619df56e73bcea77eb3fe1R235

Offline Thom

Well, my block producing node got stuck due to insufficient disk space (filled by p2p log) a few hours ago, while I'm sleeping. Unfortunately my phone was set to vibration mode, although it was notifying me all the time, I didn't wake up. I missed 177 new blocks (133 -> 310). Quite ironic. I won't be always lucky. I think it's time to setup an fail-over script.
Sorry to hear that. So you & roelandp are working on automatic failover. I hope one of you can perfect it. I have discussed that idea elsewhere, but it seems not many believe the risks are significant. All it takes to mess up the chain is for 2 nodes to broadcast signed transactions for the same witness. Fork city. To avoid that you need a way to make sure the old witness is definitely is dead with no chance of coming back online while the new witness takes over. To do that you need some smarts in the cooperating failsafe nodes to determine each node's state. Some type of heartbeat so that if the node producing blocks does NOT hear heartbeats from at least 2 other nodes it will cease block production. The producing node needs to verify it can communicate with the other witnesses, particularly the failsafe nodes.

In regards to bills, at first I was running nodes in China with less cost. We didn't have that many transactions in the early days, so network latency was not a big issue. After Steem blockchain was launched, I got some compensation there, then setup a few nodes in AWS (as my main BitShares block producing nodes) after latency became an issue, still, had been compensated by Steem witness pay for quite some months until recently. My AWS instances are mostly r3.large (15G RAM, 2 cores, 32G local SSD), the cost per month is around 150$ each (including additional cost for more disk spaces, data transmission and etc).
Thanks for this info. This confirms that until recently witness pay barely covered the cost of servers. Essentially it was altruism (fueled by the belief the platform was worth subsidizing) that kept the network operating while we all hoped that eventually we would reach much higher adoption.

Last night I attempted to put that node into use as the block producing witness the same way I always do, but it missed 2 blocks in under a minute.
I restarted the node with different witness / cli binaries and it's working fine since yesterday. So it may be a compiler issue or missing dynamically link library (if any of those are used in the build process). I will rerun the build and carefully review the logs for errors.
« Last Edit: May 03, 2017, 03:26:44 pm by Thom »
Injustice anywhere is a threat to justice everywhere - MLK |  Verbaltech2 Witness Reports: https://bitsharestalk.org/index.php/topic,23902.0.html

Offline abit

  • Committee member
  • Hero Member
  • *
  • Posts: 4572
    • View Profile
    • Abit's Hive Blog
  • BitShares: abit
  • GitHub: abitmore
Well, that's a fantastic record abit, especially since you are only manually intervening. Have you been able to make any profit from Oct 2015 to Feb 2017 using AWS servers for 3+ nodes? Of the total witness pay what % was necessary to pay server bills?
Well, my block producing node got stuck due to insufficient disk space (filled by p2p log) a few hours ago, while I'm sleeping. Unfortunately my phone was set to vibration mode, although it was notifying me all the time, I didn't wake up. I missed 177 new blocks (133 -> 310). Quite ironic. I won't be always lucky. I think it's time to setup an fail-over script.

In regards to bills, at first I was running nodes in China with less cost. We didn't have that many transactions in the early days, so network latency was not a big issue. After Steem blockchain was launched, I got some compensation there, then setup a few nodes in AWS (as my main BitShares block producing nodes) after latency became an issue, still, had been compensated by Steem witness pay for quite some months until recently. My AWS instances are mostly r3.large (15G RAM, 2 cores, 32G local SSD), the cost per month is around 150$ each (including additional cost for more disk spaces, data transmission and etc).

Quote
I think the hosting aspect is also extremely important. Until recently I ran all my nodes exclusively on VPSs. Regardless of how much RAM a server has (16GB on highest end VPS) I miss a block every week or so, sometimes every couple of weeks. A trickle. It could be due to many things. I just bought 2 dedicated servers. They are both with hosting companies I have not used before. When I put the first one located in Romania into operation as a seed, I ran into an odd problem I never saw before. It turned out to be an issue with the OS image (LOCALE was not set at all, no default) used by that hosting company. After resolving the LOCALE issue I ran it as a seed node for over a week and saw no issues, ran like a clock.

Last night I attempted to put that node into use as the block producing witness the same way I always do, but it missed 2 blocks in under a minute. My luck to be picked to generate 2 blocks so close together. It looks like there is a missing library or some other code problem looking at the errors. The binaries were compiled on that platform. Not sure if the issue is due to an OS difference (for example a missing shared lib normally supplied with the OS) or a failed package installation or an issue in the executable binary. The problem didn't happen until the node was called on to produce a block. Dbl chked the signing keys on all nodes which were correct. I'll get to the bottom of that today, or tomorrow if it's elusive to find.

I use the same setup script to ready a system to run, and I used it on another host after the one in Romania and had no issues. I will update my setup script to make sure the LOCALE is setup for English as required by the code. Probably been lucky using VPSs all around the world that I never ran into the LOCALE issue before.
Thanks for sharing the experience.
« Last Edit: May 03, 2017, 08:10:51 am by abit »
BitShares committee member: abit
BitShares witness: in.abit


Offline Thom

Well, that's a fantastic record abit, especially since you are only manually intervening. Have you been able to make any profit from Oct 2015 to Feb 2017 using AWS servers for 3+ nodes? Of the total witness pay what % was necessary to pay server bills?

I think the hosting aspect is also extremely important. Until recently I ran all my nodes exclusively on VPSs. Regardless of how much RAM a server has (16GB on highest end VPS) I miss a block every week or so, sometimes every couple of weeks. A trickle. It could be due to many things. I just bought 2 dedicated servers. They are both with hosting companies I have not used before. When I put the first one located in Romania into operation as a seed, I ran into an odd problem I never saw before. It turned out to be an issue with the OS image (LOCALE was not set at all, no default) used by that hosting company. After resolving the LOCALE issue I ran it as a seed node for over a week and saw no issues, ran like a clock.

Last night I attempted to put that node into use as the block producing witness the same way I always do, but it missed 2 blocks in under a minute. My luck to be picked to generate 2 blocks so close together. It looks like there is a missing library or some other code problem looking at the errors. The binaries were compiled on that platform. Not sure if the issue is due to an OS difference (for example a missing shared lib normally supplied with the OS) or a failed package installation or an issue in the executable binary. The problem didn't happen until the node was called on to produce a block. Dbl chked the signing keys on all nodes which were correct. I'll get to the bottom of that today, or tomorrow if it's elusive to find.

I use the same setup script to ready a system to run, and I used it on another host after the one in Romania and had no issues. I will update my setup script to make sure the LOCALE is setup for English as required by the code. Probably been lucky using VPSs all around the world that I never ran into the LOCALE issue before.

Injustice anywhere is a threat to justice everywhere - MLK |  Verbaltech2 Witness Reports: https://bitsharestalk.org/index.php/topic,23902.0.html