Author Topic: $5000 Bounty - Momentum Proof-of-Work Hacking [PAID]  (Read 54055 times)

0 Members and 1 Guest are viewing this topic.

Offline bytemaster

bytemaster, hope you saw my response. I think you underestimated the issue:

https://bitcointalk.org/index.php?topic=325261.msg3523201#msg3523201

Integrated Graphics can saturate its memory bus (60 GB/Sec on latest hardware) using the same techniques as hyper-threading to perform parallel execution.  GPUs can get 5x faster memory performance, though I think in this particular use case moving more than 64 bytes (cache line size) per cycle doesn't help due to the random access nature.   In fact, it may hurt the GPU by being forced to consume its memory bandwidth in larger than necessary chunks when it is randomly accessed.

I think it is fair to say that performance of the memory-based approach is limited by memory bandwidth and that 'in theory' a GPU could have 260 Gbs while high end desktops can get 60 Gbs and low end systems get 10 Gbs.  So there is a 26x difference between average CPU based systems and high end graphics cards.

By the way, I am very much considering paying the bounty based upon information provided thus far.  I just need to understand the degree to which Momentum has been harmed by the algorithms mentioned. 

1) What is the performance of SCRYPT on CPU vs GPU? 
2) Does momentum have a lower performance ratio than SCRYPT?
3) When it comes to making an ASIC, which would be harder SCRYPT or Momentum?

As far as how to divided the bounty among the various contributors I am thinking that AnonyMint and gigawatt have both provided some reasonable contributions.   I will wait to see gigawatt's proof-of-concept implementation to be sure there is a cycle and that the computational time is not larger than available parallelism before making the final decision about relative payout.   Gigawatts proof-of-concept shows more time and energy invested and is also more compelling than theoretical estimations of performance gains provided by AnonyMint.  Clearly writing code is of greater value than writing forum posts and also more definitive.

Summary of Algorithms Presented
Bloom Filter to Reduce Memory  -  comes at performance cost, may be able to perform in parallel with R&D which leads to GPUs...
Memory Bandwidth of GPU arch vs CPU arch - gives a potential 4 to 24x advantage to GPU based upon specs alone, must be discounted by overhead of accessing memory in larger chunks than necessary and/or bloom filter and/or different time complexity sort/search algorithms
Constant-Memory Cycle Search Algorithms - trade CPU for Memory in a manner that might make parallelism free from memory bus constraints.  Must prove we have cycles and that calculation time does not exceed  level of parallelism. 

Good work everyone, it is a pleasure debating this with you all.  I hope you recognize that I am in this to find the best algorithm and NOT to defend my ego and put my head in the sand.   This bounty and ProtoShares has brought out great minds and we sharpen one another. 



 
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline bytemaster

Quote
The fact that the algorithm uses a constant amount of memory regardless of the hash output size means that no matter how large MAX_MOMENTUM_NONCE or SEARCH_SPACE_BITS values are, the memory requirements are essentially zero.  This means that no matter how slow the algorithm, it's infinitely more efficient (memory vs collision rate) than the current implementation.
Given that it requires a constant amount of memory, Momentum as a proof-of-work is essentially equivalent to scrypt/md5/sha512 on a higher difficulty.

Quote
Now, if H is a random function on an m-element set, then, by the birthday paradox, the expected number of steps
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline AnonyMint

  • Jr. Member
  • **
  • Posts: 30
    • View Profile
bytemaster, hope you saw my response. I think you underestimated the issue:

https://bitcointalk.org/index.php?topic=325261.msg3523201#msg3523201

Offline bytemaster

Wow great work guys!   I am taking some time to digest it all and really wish the bounty would have been taken more seriously prior to launch.
Quote
If the input is given as a subroutine for calculating ƒ, the cycle detection problem may be trivially solved using only λ+μ function applications, simply by computing the sequence of values xi and using a data structure such as a hash table to store these values and test whether each subsequent value has already been stored. However, the space complexity of this algorithm is λ+μ, unnecessarily large. Additionally, to implement this method as a pointer algorithm would require applying the equality test to each pair of values, resulting in quadratic time overall. Thus, research in this area has concentrated on two goals: using less space than this naive algorithm, and finding pointer algorithms that use fewer equality tests.

So assuming an ASIC can implement f(x) very quickly... then this algorithm can be reduced to 2 memory locations using something like the tortoise and hare algorithm. 

There is one thing it seems to require and that is mapping an input space to itself.   In the case of momentum the input space does not fit this criteria.   Nonces are 20 bit, birthdays are 50 bits.   But a birthday is not just a SHA512(nonce), but SHA512( nonce + header ) and this generates 8+ birthdays which would not have the property of mapping back to itself in a cycle.

Let me look into some of the other techniques and their requirements.
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline AnonyMint

  • Jr. Member
  • **
  • Posts: 30
    • View Profile
Wow great work guys!   I am taking some time to digest it all and really wish the bounty would have been taken more seriously prior to launch.

I didn't know about it. You could have PM'ed me.  I had told you in the past I was researching CPU-only.

Offline bytemaster

Wow great work guys!   I am taking some time to digest it all and really wish the bounty would have been taken more seriously prior to launch.

For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline AnonyMint

  • Jr. Member
  • **
  • Posts: 30
    • View Profile
It looks like there's a major flaw with birthday collisions as a proof of work:

For any hash algorithm of length L bits, a collision can be found in 2^(L/2) + 1 computations using a constant amount of memory using a cycle detection algorithm.  Given that Momentum uses a 50-bit hash, that means that a collision can be found in 2^25 + 1 steps which 1) is, ironically, less than MAX_MOMENTUM_NONCE, and 2) because it relies almost exclusively on SHA512 calls, can be made massively parallel.
Not only does it require fewer steps, it can also be made stupid parallel.

In fact, there's a whole section of the book "Introduction to Modern Cryptography" dedicated to the subject of collision attacks.  An errata referring to the section can be found here: http://www.cs.umd.edu/~jkatz/imc/hash-erratum.pdf (contains mathematical proofs of the claims listed above)

Additional discussion on the matter can be found here: http://crypto.stackexchange.com/questions/3295/how-does-a-birthday-attack-on-a-hashing-algorithm-work

The fact that the algorithm uses a constant amount of memory regardless of the hash output size means that no matter how large MAX_MOMENTUM_NONCE or SEARCH_SPACE_BITS values are, the memory requirements are essentially zero.  This means that no matter how slow the algorithm, it's infinitely more efficient (memory vs collision rate) than the current implementation.
Given that it requires a constant amount of memory, Momentum as a proof-of-work is essentially equivalent to scrypt/md5/sha512 on a higher difficulty.


To reinforce these claims, I'm actively working on a proof of concept that uses constant memory and almost nothing but SHA512 calls.

This is essentially taking advantage of the same point I made, which is there are numerous solution candidates in each sequence, yet employing an mathematical insight on detecting duplicate candidates method instead of leveraging the GPU's memory and thread advantage with parallelism.
« Last Edit: November 08, 2013, 05:47:14 pm by AnonyMint »

Offline gigawatt

  • Jr. Member
  • **
  • Posts: 27
    • View Profile
It looks like there's a major flaw with birthday collisions as a proof of work:

For any hash algorithm of length L bits, a collision can be found in 2^(L/2) + 1 computations using a constant amount of memory using a cycle detection algorithm.  Given that Momentum uses a 50-bit hash, that means that a collision can be found in 2^25 + 1 steps which 1) is, ironically, less than MAX_MOMENTUM_NONCE, and 2) because it relies almost exclusively on SHA512 calls, can be made massively parallel.
Not only does it require fewer steps, it can also be made stupid parallel.

In fact, there's a whole section of the book "Introduction to Modern Cryptography" dedicated to the subject of collision attacks.  An errata referring to the section can be found here: http://www.cs.umd.edu/~jkatz/imc/hash-erratum.pdf (contains mathematical proofs of the claims listed above)

Additional discussion on the matter can be found here: http://crypto.stackexchange.com/questions/3295/how-does-a-birthday-attack-on-a-hashing-algorithm-work

The fact that the algorithm uses a constant amount of memory regardless of the hash output size means that no matter how large MAX_MOMENTUM_NONCE or SEARCH_SPACE_BITS values are, the memory requirements are essentially zero.  This means that no matter how slow the algorithm, it's infinitely more efficient (memory vs collision rate) than the current implementation.
Given that it requires a constant amount of memory, Momentum as a proof-of-work is essentially equivalent to scrypt/md5/sha512 on a higher difficulty.


To reinforce these claims, I'm actively working on a proof of concept that uses constant memory and almost nothing but SHA512 calls.
BTC: 1E2egHUcLDAmcxcqZqpL18TPLx9Xj1akcV
PTS: Pstkk1gZCxc4GEwS1eBAykYwVmcubU1P8L

Offline AnonyMint

  • Jr. Member
  • **
  • Posts: 30
    • View Profile
The point of Scrypt was to make parallel operation difficult and its primary weakness is that validation time suffers as you scale it.

If you don't require that the first match in the sequence of generated pseudo-random numbers, then my upthread described parallelism attack works to allow GPUs to blow away CPUs, assuming the pseudo-random sequence can be generated much faster than the random memory latency bound on the CPU so the GPU can try 100s of values from the same pseudo-random sequence in parallel to hide the memory latency.

If you do require it, then you lose the fast validation time.

Do you get it now or you going to continue to ignore me?

I expended the time to explain this in more detail:

https://bitcointalk.org/index.php?topic=325261.msg3520992#msg3520992

I think that is the last effort I will apply to this.

Even if you allow that a GPU is "leaps and bounds" faster than CPUs, you still run into major problems.

The time complexity of searching with a CPU using memory lookups is O(n), whereas brute-forcing combinations with a GPU is still O(n^2).
Regardless how much faster a GPU is, it's faster by a constant multiple, but the number of calculations is almost squared.

bytemaster wrote that upthread but I don't understand why you say so? He never answered.

The GPU is using the same amount of memory and the same number sequence. It is searching the same thing.

Why do you claim that the GPU has a squared time complexity? I see no basis whatsoever for that claim.

I have only described how to speed up the search of the same sequence and memory as the CPU is using, by employing more simultaneous lookups.

I must assume both of you are not reading carefully. I am not writing about replacing any memory storage with computation. I made that clear from my very first post. You are apparently not taking the time to understand what I wrote.
« Last Edit: November 08, 2013, 05:02:48 pm by AnonyMint »

Offline gigawatt

  • Jr. Member
  • **
  • Posts: 27
    • View Profile
The point of Scrypt was to make parallel operation difficult and its primary weakness is that validation time suffers as you scale it.

If you don't require that the first match in the sequence of generated pseudo-random numbers, then my upthread described parallelism attack works to allow GPUs to blow away CPUs, assuming the pseudo-random sequence can be generated much faster than the random memory latency bound on the CPU so the GPU can try 100s of values from the same pseudo-random sequence in parallel to hide the memory latency.

If you do require it, then you lose the fast validation time.

Do you get it now or you going to continue to ignore me?

I expended the time to explain this in more detail:

https://bitcointalk.org/index.php?topic=325261.msg3520992#msg3520992

I think that is the last effort I will apply to this.

Even if you allow that a GPU is "leaps and bounds" faster than CPUs, you still run into major problems.

The time complexity of searching with a CPU using memory lookups is O(n), whereas brute-forcing combinations with a GPU is still O(n^2).
Regardless how much faster a GPU is, it's faster by a constant multiple, but the number of calculations is almost squared.
BTC: 1E2egHUcLDAmcxcqZqpL18TPLx9Xj1akcV
PTS: Pstkk1gZCxc4GEwS1eBAykYwVmcubU1P8L

Offline AnonyMint

  • Jr. Member
  • **
  • Posts: 30
    • View Profile
The point of Scrypt was to make parallel operation difficult and its primary weakness is that validation time suffers as you scale it.

If you don't require that the first match in the sequence of generated pseudo-random numbers, then my upthread described parallelism attack works to allow GPUs to blow away CPUs, assuming the pseudo-random sequence can be generated much faster than the random memory latency bound on the CPU so the GPU can try 100s of values from the same pseudo-random sequence in parallel to hide the memory latency.

If you do require it, then you lose the fast validation time.

Do you get it now or you going to continue to ignore me?

I expended the time to explain this in more detail:

https://bitcointalk.org/index.php?topic=325261.msg3520992#msg3520992

I think that is the last effort I will apply to this.

Offline bytemaster

Quote
The results of the experiment in table 4 evoke some interesting discussion. If one excludes the preprocessing step, the
speed up is significant.

The preprocessing step however is an integral part of the
algorithm to port the Bloom Filter on to the GPU. Thus we
need to come up with better ways to preprocess a given set
of keys.

One more notable observation is that the actual
filter construction time and communication latency between
GPU and CPU is independent of the key size.

I haven't read the other papers... but so far it appears non-trivial.
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline gigawatt

  • Jr. Member
  • **
  • Posts: 27
    • View Profile
Quote
gigawatt,
     Thank you for providing the first innovative algorithm for reducing the memory requirements.  Let me attempt to post mitigating factors to your algorithm.

     From a CPU miner perspective, your reduction in memory comes at the expense of performance so does break the algorithmic complexity of the algorithm.

     From a GPU perspective you have to populate a bloom filter with 2^26 results... based upon my understanding of how bloom filters operate this would require updating a common data-structure from every thread and the resulting memory race conditions could create false negatives.   If you have to do this step sequentially, then you might as well use a CPU with memory.   

     So do you have any solid algorithms that can populate a bloom filter with 2^26 results in parallel? 

There's actually a few academic papers on the topic, mainly because bloom filters have lots of applications.

https://sites.google.com/site/venkateshnrao/projects/parallel-bloom-filter-implementation-on-gpu-using-cuda
https://sbs.cse.wustl.edu/pubs/mcbf11.pdf
http://www.gpucomputing.net/?q=node/13381
http://dl.acm.org/citation.cfm?id=1982363
BTC: 1E2egHUcLDAmcxcqZqpL18TPLx9Xj1akcV
PTS: Pstkk1gZCxc4GEwS1eBAykYwVmcubU1P8L

Offline bytemaster

My replies..
Quote
I suppose the performance of your algorithm would also suffer if instead of SHA512(X),  SCRYPT(X) was used because the cost of doing this step twice would be much higher and less GPU friendly.

Quote
I wonder what would happen if we used NESTED momentum proof of work? 

Change the nonce-space of the outer proof of work to a pair of 16 bit numbers that result in a X-bit collision?

Now you have a more memory intensive inner hash that is still quick to validate, but would significantly complicate GPU miners.

Note the reason we selected a fast hash like SHA512 rather than a slow hash like SCRYPT is to maximize memory bus bandwidth. 
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline bytemaster

Quote
gigawatt,
     Thank you for providing the first innovative algorithm for reducing the memory requirements.  Let me attempt to post mitigating factors to your algorithm.

     From a CPU miner perspective, your reduction in memory comes at the expense of performance so does break the algorithmic complexity of the algorithm.

     From a GPU perspective you have to populate a bloom filter with 2^26 results... based upon my understanding of how bloom filters operate this would require updating a common data-structure from every thread and the resulting memory race conditions could create false negatives.   If you have to do this step sequentially, then you might as well use a CPU with memory.   

     So do you have any solid algorithms that can populate a bloom filter with 2^26 results in parallel? 
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.