Author Topic: $5000 Bounty - Momentum Proof-of-Work Hacking [PAID]  (Read 54059 times)

0 Members and 1 Guest are viewing this topic.

Offline AnonyMint

  • Jr. Member
  • **
  • Posts: 30
    • View Profile
While the GPU can hide latency it cannot change bandwidth and your bandwidth usage is still N^2.

Memory bandwidth is several times faster on the GPU than CPU and you won't get close to saturating memory bandwidth on the CPU because the full random memory access latency is too high and you can only run at most 8 hardware hyperthreads on a core i7 thus unable to hide the latency entirely. Software threads above that 8 count don't matter.

You have leveraged a weakness of the CPU relative to the GPU. To make a CPU-only PoW, you must leverage a strength of the CPU relative to the GPU and ASIC.

On the CPU your algorithm is random main memory latency bound while on the GPU it is main memory bandwidth bound. Thus on the CPU approximately at best 64 bytes cache line size at several hundred clock cycles per random access, and on the GPU up to 260 GB per second if utilizing the full 128 byte cache line size.

Also if the GPU has 6 GB and your target memory is 0.75 GB, then the GPU can run 8 instances and each can reach maximum memory bandwidth.
« Last Edit: November 06, 2013, 10:21:56 pm by AnonyMint »

Offline bytemaster

Here is how I would implement this on a GPU:

Generate 2^26 BIRTHDAYS
GPU SORT
Perform 2^25 checks against neighbors for collisions in parallel.

This would probably be the FASTEST way to implement this proof of work on a GPU but still requires TARGET_MEMORY and thus does not qualify for the bounty.
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline bytemaster

Okay if I understand correctly your proposed PoW algorithm, we are generating pseudo-random values that can range over the address space of the memory size we target until we generate a duplicate (or triplicate). The Birthday math tells us that we will find a solution with 50% probability before visiting less than roughly 10% of the addresses, or 99% before visiting less than roughly 20% of the addresses. Thus this is a way to require large memory without actually writing to every memory address, i.e. a sparse memory access algorithm. Btw, I also considered this sort of algorithm before and dumped it.

Assuming the generation of the pseudo-random values can be accelerated by parallelism, i.e. that the algorithm is main memory latency bound, then the GPU is going to be much faster, because the GPU masks memory latency by employing 1000s of threads, i.e. we can test 1000s of pseudo-random values in parallel. This was the key myopia that caused Litecoin to fail at stopping GPUs from outperforming CPUs, although it did improve the difference relative to Bitcoin's PoW.

I don't see how the use of a hash table really changes the outcome significantly, as it will only serialize perhaps every 10 buckets and we assume the pseudo-random values are uniformly distributed over the entire address space. Collisions with a 1000 threads are going to be rare.

I have revealed my key insight into PoW. If you weren't aware of this and I am correct, I hope you tip me consumerately.

Ok... a few corrections for you to consider.   The address space of the birthdays is 50 bits.   The TARGET_MEMORY is the number of birthdays that must be stored to have a 90+% probability of finding 1 collision once you have filled TARGET_MEMORY.    As a result, this isn't sparse memory access unless you are operating on some MULTIPLE of the TARGET_MEMORY and that multiple would have to be very high (scaling with the number of GPU threads) if you want to avoid threads stomping on each other with high probability. 

Your GPU algorithm is still O(C * N^2) where C is some constant factor improvement.
CPU algorithm is O( C * N )

While the GPU can hide latency it cannot change bandwidth and your bandwidth usage is still N^2.
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline bytemaster

Okay if I understand correctly your proposed PoW algorithm, we are generating pseudo-random values that can range over the address space of the memory size we target until we generate a duplicate (or triplicate). The Birthday math tells us that we will find a solution with 50% probability before visiting less than roughly 10% of the addresses, or 99% before visiting less than roughly 20% of the addresses. Thus this is a way to require large memory without actually writing to every memory address, i.e. a sparse memory access algorithm. Btw, I also considered this sort of algorithm before and dumped it.

Assuming the generation of the pseudo-random values can be accelerated by parallelism, i.e. that the algorithm is main memory latency bound, then the GPU is going to be much faster, because the GPU masks memory latency by employing 1000s of threads, i.e. we can test 1000s of pseudo-random values in parallel. This was the key myopia that caused Litecoin to fail at stopping GPUs from outperforming CPUs, although it did improve the difference relative to Bitcoin's PoW.

I don't see how the use of a hash table really changes the outcome significantly, as it will only serialize perhaps every 10 buckets and we assume the pseudo-random values are uniformly distributed over the entire address space. Collisions with a 1000 threads are going to be rare.

I have revealed my key insight into PoW. If you weren't aware of this and I am correct, I hope you tip me consumerately.

Ok... a few corrections for you to consider.   The address space of the birthdays is 50 bits.   The TARGET_MEMORY is the number of birthdays that must be stored to have a 90+% probability of finding 1 collision once you have filled TARGET_MEMORY.    As a result, this isn't sparse memory access unless you are operating on some MULTIPLE of the TARGET_MEMORY. 

Your GPU algorithm is still O(C * N^2) where C is some constant factor improvement.
CPU algorithm is O( C * N )

While the GPU can hide latency it cannot change bandwidth and your bandwidth usage is still N^2.

For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline AnonyMint

  • Jr. Member
  • **
  • Posts: 30
    • View Profile
Okay if I understand correctly your proposed PoW algorithm, we are generating pseudo-random values that can range over the address space of the memory size we target until we generate a duplicate (or triplicate). The Birthday math tells us that we will find a solution with 50% probability before visiting less than roughly 10% of the addresses, or 99% before visiting less than roughly 20% of the addresses. Thus this is a way to require large memory without actually writing to every memory address, i.e. a sparse memory access algorithm. Btw, I also considered this sort of algorithm before and dumped it.

Assuming the generation of the pseudo-random values can be accelerated by parallelism, i.e. that the algorithm is main memory latency bound, then the GPU is going to be much faster, because the GPU masks memory latency by employing 1000s of threads, i.e. we can test 1000s of pseudo-random values in parallel. This was the key myopia that caused Litecoin to fail at stopping GPUs from outperforming CPUs, although it did improve the difference relative to Bitcoin's PoW.

I don't see how the use of a hash table really changes the outcome significantly, as it will only serialize perhaps every 10 buckets and we assume the pseudo-random values are uniformly distributed over the entire address space. Collisions with a 1000 threads are going to be rare.

I have revealed my key insight into PoW. If you weren't aware of this and I am correct, I hope you tip me consumerately regardless if it leads you to make changes or not (I should not be responsible for your success, only for proving what doesn't work).

Edited: to insert word "not" after "I should" that was missing.
« Last Edit: November 07, 2013, 09:20:40 pm by AnonyMint »

Offline bytemaster

I will consider a tip if it leads me to make improvements.
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline AnonyMint

  • Jr. Member
  • **
  • Posts: 30
    • View Profile
bytemaster, I don't have spare time to implement. How about I share my thoughts and then you decide if I deserve even a small token payment for my contribution? You might gain some insight into how I think a CPU-only PoW could be implemented, but I won't be providing a complete algorithm today.

I don't even have a GPU handy, nor the development environment for them. I would be speaking on a theoretical basis after studying Scrypt and understanding why Litecoin failed (yet still improved the ratio for GPU advantage compared to Bitcoin).
« Last Edit: November 06, 2013, 05:35:07 pm by AnonyMint »

Offline bytemaster

For example:  Suppose momentum as specked requires about 1 GB of RAM for maximum performance and produces hashes at 1hz.   If you design an algorithm that only requires 1 MB of RAM and can find hashes at .001 hz then you win because you will have proven that computational complexity is linear with RAM use rather than non-linear like I believe the problem calls for.   

Best of luck to you all!

I believe there is a weakness where GPUs can run much faster, but it doesn't requiring lowering the memory. If I explain it to you and am correct, do I qualify for at least part of the bounty?

It is possible I am incorrect or have misunderstood some aspect of the algorithm, so let's not jump to conclusions yet.

There is a big difference between theory and practice when it comes to GPUs and considering the price on the bounty you can afford to provide an implementation.  Your implementation can use a simpler hash function than SHA512 because we are testing the algorithmic complexity not the cryptographic security.

Here are my thoughts on why the GPU cannot do it faster, test them against your theory:
1) I am sure a GPU can calculate the SHA512 birthdays faster.... but it will do no good because the process of comparing every birthday to every other birthday is either:

      a) brute force linear search which could be done in parallel.  The challenge is that such a linear search would be very taxing on the GPU's memory bus to have every thread read every value.

      b) slightly better... use the GPU to sort the data after generating it... such sorting also requires heavy memory bandwidth use and then ultimately ends up with a linear scan of the memory looking for matching neighbors

      d) implement a hash table to some kind on the GPU.  Unfortunately, for the hash table to be effective at finding duplicates items have to be inserted sequentially.

2) Assume we divide and conquer this work between the GPU and CPU... in this case you generate the birthday's in parallel on the GPU then search for collisions on the CPU.  In the middle you have to transfer all of the data across the PCI-E bus which is slower than your memory bus which is already close to the bottleneck. 

3) I count integrated graphics as the CPU


For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline bytemaster

Would you please clarify exactly what TARGET_RAM_USE is? Is it the total memory required to store the hashtable of all nonce results or some smaller value?

Should we use your example from http://bitsharestalk.org/index.php?topic=21.msg26#msg26 to empirically confirm? Also could you release FreeTrade's current implementation or standalone subset you mentioned there for testing?

Thanks,

Nathaniel

The algorithm should work at any scale, but for the sake of this bounty lets set TARGET_RAM_USE at 768MB.

https://github.com/InvictusInnovations/ProtoShares/blob/psforkinit/src/momentum.cpp
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline Nathaniel B

  • Newbie
  • *
  • Posts: 6
    • View Profile
Would you please clarify exactly what TARGET_RAM_USE is? Is it the total memory required to store the hashtable of all nonce results or some smaller value?

Should we use your example from http://bitsharestalk.org/index.php?topic=21.msg26#msg26 to empirically confirm? Also could you release FreeTrade's current implementation or standalone subset you mentioned there for testing?

Thanks,

Nathaniel

Offline AnonyMint

  • Jr. Member
  • **
  • Posts: 30
    • View Profile
For example:  Suppose momentum as specked requires about 1 GB of RAM for maximum performance and produces hashes at 1hz.   If you design an algorithm that only requires 1 MB of RAM and can find hashes at .001 hz then you win because you will have proven that computational complexity is linear with RAM use rather than non-linear like I believe the problem calls for.   

Best of luck to you all!

I believe there is a weakness where GPUs can run much faster, but it doesn't requiring lowering the memory. If I explain it to you and am correct, do I qualify for at least part of the bounty?

It is possible I am incorrect or have misunderstood some aspect of the algorithm, so let's not jump to conclusions yet.
« Last Edit: November 06, 2013, 03:12:28 pm by AnonyMint »

Offline bytemaster

is there a deadline for the bounty?
I think I found a weakness but I'm not sure I can prove it before tomorrow.
Is there a working source code that I can download and modify in order to test my theory?

I thought that I could use rainbow tables in order to store only a small part of the birthday/nonce pairs and still be able to find collisions.
But since the space of nonces is 2^26 and the space of birthdays is 2^50 it wouldn't work because it wuuld find too many artificial collisions.
IF the nonce and birhday spaces were of the same order, then it would work perfectly.... but it looks like you are safe from rainbow tables: I can't think of a way to make them work.

Thanks for trying!  The more people who try the better :)   One ProtoShares is out there will be a far bigger bounty :)
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline gatra

  • Newbie
  • *
  • Posts: 3
    • View Profile
is there a deadline for the bounty?
I think I found a weakness but I'm not sure I can prove it before tomorrow.
Is there a working source code that I can download and modify in order to test my theory?

I thought that I could use rainbow tables in order to store only a small part of the birthday/nonce pairs and still be able to find collisions.
But since the space of nonces is 2^26 and the space of birthdays is 2^50 it wouldn't work because it wuuld find too many artificial collisions.
IF the nonce and birhday spaces were of the same order, then it would work perfectly.... but it looks like you are safe from rainbow tables: I can't think of a way to make them work.

Offline bytemaster

is there a deadline for the bounty?
I think I found a weakness but I'm not sure I can prove it before tomorrow.
Is there a working source code that I can download and modify in order to test my theory?

Bounty has no deadline. 
https://github.com/InvictusInnovations/ProtoShares/blob/psforkinit/src/momentum.cpp

For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline gatra

  • Newbie
  • *
  • Posts: 3
    • View Profile
is there a deadline for the bounty?
I think I found a weakness but I'm not sure I can prove it before tomorrow.
Is there a working source code that I can download and modify in order to test my theory?