Author Topic: $5000 Bounty - Momentum Proof-of-Work Hacking [PAID]  (Read 54638 times)

0 Members and 1 Guest are viewing this topic.

Offline AnonyMint

  • Jr. Member
  • **
  • Posts: 30
    • View Profile
The point of Scrypt was to make parallel operation difficult and its primary weakness is that validation time suffers as you scale it.

If you don't require that the first match in the sequence of generated pseudo-random numbers, then my upthread described parallelism attack works to allow GPUs to blow away CPUs, assuming the pseudo-random sequence can be generated much faster than the random memory latency bound on the CPU so the GPU can try 100s of values from the same pseudo-random sequence in parallel to hide the memory latency.

If you do require it, then you lose the fast validation time.

Do you get it now or you going to continue to ignore me?

Offline bytemaster

I decided I would graph the probability of NOT FINDING a hash after searching the NONCE space using three different amounts of memory... 100%   50% and 25%



Both use cases search the same number of nonces and thus do the same number of SHA512... but you take a performance hit if you reduce your memory...

Obviously, slight reductions in memory don't hurt too much, but if you attempted to reduce your memory from 768 MB down to 1 MB so that you could operate 1000 runs in parallel then your probability of NOT finding a match would be: 
0.368196 WITH MEMORY,
0.997402 WITHOUT MEMORY

So your probability of finding a match is 243x smaller for the same number of computations. 

So I guess this means that you can trade 765 MB of RAM for 243 parallel processes each with 1 MB of RAM and have the SAME hash rate.    Now you only actually decreased your RAM requirements by 33%...  while increasing the number of computations required by 243x...
« Last Edit: November 07, 2013, 07:48:29 pm by bytemaster »
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline bytemaster

Imagine a coin toss, with only one unit of memory your odds of finding a collision are 1/N.     
With M units of memory your odds are M/N

Because these are independent events.... we can calculate the probability of NOT finding a collision after X birthdays...

(n-1)/N  * (n-2)/N * (n-3)/N............. * (n-x)/N

As x increases so does your chance of finding a match and it multiplies out significantly over time.

If you limit your memory to 2... then the math looks like:
(n-1)/N  * (n-2)/N * (n-2)/N............. * (n-2)/N

Clearly your odds of not finding a match drop much faster the higher X gets... but if you cap X then you lose performance.

The code is out there, the bounty is for someone to prove it wrong with an implementation not fancy handwaving and unbacked claims. 



 
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline bytemaster

To clarify, the 50% chance for at least 1 collision for 23 people in your example is with doing n^2 operations, but when you're hashing you check each one against the ones you already have saved each time you add it to the hash table. You're not doing all n^2 operations each time since you've already compared the previous hashes to each other. You're just going n comparisons each time to see if the new hash matches. There's a disconnect between what you describe in your graphs and the actual algorithm. That 50% chance would be after 23 hashes you'd find AT LEAST 1 collision, but miners want to find MULTIPLE collisions which changes the whole scaling.

I like the core idea of using the bday hash to force high memory usage. I think you're idea of saturating the memory bandwidth has promise. I think that with the right parameters you could design mining that would have a very low improvement (perhaps none without the scale of ATI behind it) going to ASICs over GPUs due to GPU optimization of parallel operations and memory bandwidth. You could give CPUs a fighting chance as well since they can leverage high amounts of memory to make up for less raw hashes/memory bandwidth.

Your algorithm as it is though does not scale the way you think it does.

Nathaniel

Nathaniel... this is why in my description I reference the AREA under the graph as the sum of probability.    If I cut off the memory then I am losing part of that area with every hash I perform.
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline Nathaniel B

  • Newbie
  • *
  • Posts: 6
    • View Profile
To clarify, the 50% chance for at least 1 collision for 23 people in your example is with doing n^2 operations, but when you're hashing you check each one against the ones you already have saved each time you add it to the hash table. You're not doing all n^2 operations each time since you've already compared the previous hashes to each other. You're just going n comparisons each time to see if the new hash matches. There's a disconnect between what you describe in your graphs and the actual algorithm. That 50% chance would be after 23 hashes you'd find AT LEAST 1 collision, but miners want to find MULTIPLE collisions which changes the whole scaling.

I like the core idea of using the bday hash to force high memory usage. I think you're idea of saturating the memory bandwidth has promise. I think that with the right parameters you could design mining that would have a very low improvement (perhaps none without the scale of ATI behind it) going to ASICs over GPUs due to GPU optimization of parallel operations and memory bandwidth. You could give CPUs a fighting chance as well since they can leverage high amounts of memory to make up for less raw hashes/memory bandwidth.

Your algorithm as it is though does not scale the way you think it does.

Nathaniel

Offline Nathaniel B

  • Newbie
  • *
  • Posts: 6
    • View Profile
All your graphs are about finding 1 birthday collision, but miners want to find as many collisions as possible which completely changes the problem. Run the empirical experiment I suggested. Just add the one check so that after half the nounce space is exhausted you stop adding new birthdayHashed nounces to the hashmap to simulate having half the memory. You will see the result I presented.

Offline AnonyMint

  • Jr. Member
  • **
  • Posts: 30
    • View Profile
Anyway, the proof is in the implementation.   

I think we are sufficiently expert with 3 decades of programming experience, that we can make correct analysis without implementing.

I am still waiting for your replies. Hope you feel better. Try taking 50,000 IU of Vit D3. That kicks my colds to the curb within 24 hours.

Offline bytemaster

Nathaniel B,
   Let me say that your post has a lot of math that I am too sick to follow completely to identify the specific problem.  Instead, I would like to go back to some first principles of the algorithm and see if I can summarize what you say your math proves.

   
     
   Based upon this graph it would appear that it all depends upon how far up the curve you set your target.    If you set the target below 23 then you have non linear increases in probability of finding someone.  If you set it above 23 it scales approximately linearly until you hit 40 at which point you start getting less and less benefit by increasing your RAM. 

   Now, with this hash algorithm the total hash power is proportional to the AREA under the curve.   The first person you add has 0 chance.  Everyone less than 10 has less than 10% chance.  Every hash over 23 has a 50% chance.    So performance is clearly tied to how quickly you can increase your probability of finding a result which of course has diminishing returns. 

   Clearly, once you get near 50 people your memory usage can become constant and you will have 95% of the hash power that you would have if your doubled your memory.   Fortunately, we limit the NONCE search space which prevents you from ever going past ~23.    So I stay on the portion of the curve that does gain by increasing RAM.   

 
 
   Taken to the extreme, of only having memory for 1 birthday you would end up on Q(n) curve. 

   There are two approaches to finding a collision and you can make a trade off of memory for parallelism.    So let me try some math here...

   Lets assume you only have memory for 10 items (50% of target) then your probability of finding a result is .1 * P  where P is the level of parallelism.  As a result if P is 10 then the probability of finding a result after 1 step is 50/50.

   Lets assume P is 1 in the CPU case.   Now instead of doing 10 calculating in parallel you do them sequentially.   In this case after you have performed 10 steps your next calculation is 50/50 which is equal to the parallel case, except that you get the benefit of the area under the curve which means that on average it will take you less than 10 steps to find the result.  Also, in the GPU case the cost of trying again is another 10 checks in parallel, where as the cost of the CPU case with memory increase is less than 1 additional calculation.   

   Conclusion:  reducing memory does cause a linear reduction in the probability of finding one collision,  but you must take the area under the curve to see the true benefits.

   Now at this point you may point out that the GPU could do the first 10 steps in parallel and thus the GPU could have a 50/50 chance of having solved it in 2 steps while the CPU would be 10 steps behind.    I do not think this would be the case however with a 50 bit birthday space. 

Anyway, the proof is in the implementation.   
   
 

For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline Nathaniel B

  • Newbie
  • *
  • Posts: 6
    • View Profile
Here's a proof that your algorithm scales sublinear with RAM i.e. adding more RAM up to the TARGET_RAM_USE actually has slight diminishing returns and even in the asymptomatic case is at best linear in speed up.
Informally, my impression is that you misunderstand what the actual steps in your algorithm look like. Because of the difficulty function, a user wants to find as many BirthdayHash collisions as possible rather than just the first match and each time a new nonce is BirthdayHashed and put into the hashtable it is compared to the others rather than all hashes being compared in one n^2 computation at the end.
I'll reference my original post on the bitcoin forum and quote extensively from it.
https://bitcointalk.org/index.php?topic=313479.msg3365654#msg3365654

Let B = the number of ‘birthdays’ (hashspace of BirthdayHash).
Let S = the number of BirthdayHash(nonce+H) stored in RAM.
Let M = maximum number that S can be given a particular system’s RAM.

If we find a new BirthdayHash(A + H), the expected number of matches ( BirthdayHash(A+H)==BirthdayHash(B+H) ) = S / B

Now we calculate the total number of expected matches during a time frame (average block creation time for instance). we take the number of BirthdayHash() we compute in that time (N). S will increase from 0 to N unless M<N, meaning we do not have enough RAM to store N results, from your comments I gather that you intend for M (TARGET_RAM_USE/(nounce size +hashtable overhead per nounce)) >= N.

N * Mean(S) / B = Total expected matches
If M>=N:
Mean(S) = (1 + 2 + 3 + … + N)/N = (N+1)/2

Now let's see what happens when a user only has M/2 RAM available and M/2<N<=M. To when your challenge I must prove that with M/2 RAM a user would still have at least half as many matches ((N+1)/2)*.5.

For the first M/2 hashes both users produce the same results. Then, the user stuck at M/2 RAM only has M/2 nonces to compare each new one to like this:
(1+2+3 + … M/2 + M/2 + … + M/2) = ((M/2)*(M/2 + 1)/2 + (M/2)*(N-M/2))
The mean then is  ((M/2)*(M/2 + 1)/2 + (M/2)*(N-M/2))/N
Total expected matches for M/2 RAM is (Ns cancel out): ((M/2)*(M/2 + 1)/2 + (M/2)*(N-M/2))/B

(M/2 scenario) ((M/2)*(M/2 + 1)/2 + (M/2)*(N-M/2))/B vs (N*(N+1)/2)/B (N<=M scenario)
Plug in the numbers M = 2^26 nonces worth of RAM, N = 2^26 hashes during the time period since it’s always more efficient to continue hashing until the nonce space is complete to avoid extra cost in rebuild the hashtable of potential matches.
(M/2 scenario)
((2^26/2)*(2^26/2 + 1)/2 + (2^26/2)*(2^26-2^26/2)) /2^50= 1.5000000149
vs,
(M scenario)
(2^26*(2^26+1)/2)/2^50 = 2.0000000298

So in the same time frame with half the RAM, I will find ~1.5 matches vs your ~2 matches your scaling is actually worse than a linear increase when you double the RAM.

Let me know if you have any questions. I suggest you run the empirical experiment yourself if you’re not convinced. Just change the code to stop adding new results to the hash table once you’re half way through the nonce space to simulate halving your RAM.

Hope this helps your design. You can get asymptotically linear increases per RAM increase if you make the TARGET_RAM_USE ridiculously huge like 1TB+.


Thanks,

Nathaniel
16EeLa9BrZc1A4sasztaRdnfp3aU8DmuVC

Offline digitalindustry

  • Jr. Member
  • **
  • Posts: 41
    • View Profile
I will independently verify that Anonymint engendered  these ideas first , (well spoke about then anyhow) and if an implementation come from these ideas , surely (some) credit has to go his way.

Offline iruu

  • Jr. Member
  • **
  • Posts: 46
    • View Profile
What about the bounty on bitcointalk. So if someone provides an implementation before or at 15th November, he gets max(30 BTC, $5000), but after, only $5000?

Also, do you have access to windows with a mining capable gpu?


Offline AnonyMint

  • Jr. Member
  • **
  • Posts: 30
    • View Profile
While the GPU can hide latency it cannot change bandwidth and your bandwidth usage is still N^2.

Memory bandwidth is several times faster on the GPU than CPU and you won't get close to saturating memory bandwidth on the CPU because the full random memory access latency is too high and you can only run at most 8 threads on a core i7 thus unable to hide the latency entirely. Software threads above that 8 count don't matter.

Your constant time hardware gains in memory bandwidth and parallelism are combating a N^2 algorithmic disadvantage.    It may be possible to develop a GPU based solution, but how much faster it will be is TBD... will it be 10x faster?  I don't think so.

Why do you assume the GPU can't use a hash table also?

Offline AnonyMint

  • Jr. Member
  • **
  • Posts: 30
    • View Profile
This would probably be the FASTEST way to implement this proof of work on a GPU but still requires TARGET_MEMORY and thus does not qualify for the bounty.

The CPU requires the target memory also and would thus be significantly slower due to not even saturating its slower memory bandwidth due to being memory latency bound.
« Last Edit: November 06, 2013, 10:00:59 pm by AnonyMint »

Offline AnonyMint

  • Jr. Member
  • **
  • Posts: 30
    • View Profile
Ok... a few corrections for you to consider.   The address space of the birthdays is 50 bits.   The TARGET_MEMORY is the number of birthdays that must be stored to have a 90+% probability of finding 1 collision once you have filled TARGET_MEMORY.    As a result, this isn't sparse memory access unless you are operating on some MULTIPLE of the TARGET_MEMORY.

Correct that by using a hash table it is no longer physically sparse, so CPU and GPU only need to have the target physical memory. So thus the birthday aspect is useless except for causing the collisions on the serial hash bucks.

Your GPU algorithm is still O(C * N^2) where C is some constant factor improvement.
CPU algorithm is O( C * N )

While the GPU can hide latency it cannot change bandwidth and your bandwidth usage is still N^2.

Only if you assume the GPU can't use a hash table too. I see no reason for that to be the case, and as I wrote in my prior post. And I expect the collisions to be irrelevant. If that last sentence is the only one that is in contention, I can explain more?

Also re-read my prior post, I added to it.
« Last Edit: November 06, 2013, 10:03:10 pm by AnonyMint »

Offline bytemaster

While the GPU can hide latency it cannot change bandwidth and your bandwidth usage is still N^2.

Memory bandwidth is several times faster on the GPU than CPU and you won't get close to saturating memory bandwidth on the CPU because the full random memory access latency is too high and you can only run at most 8 threads on a core i7 thus unable to hide the latency entirely. Software threads above that 8 count don't matter.

Your constant time hardware gains in memory bandwidth and parallelism are combating a N^2 algorithmic disadvantage.    It may be possible to develop a GPU based solution, but how much faster it will be is TBD... will it be 10x faster?  I don't think so.
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.