bytemaster, hope you saw my response. I think you underestimated the issue:
https://bitcointalk.org/index.php?topic=325261.msg3523201#msg3523201
Integrated Graphics can saturate its memory bus (60 GB/Sec on latest hardware) using the same techniques as hyper-threading to perform parallel execution. GPUs can get 5x faster memory performance, though I think in this particular use case moving more than 64 bytes (cache line size) per cycle doesn't help due to the random access nature. In fact, it may hurt the GPU by being forced to consume its memory bandwidth in larger than necessary chunks when it is randomly accessed.
I think it is fair to say that performance of the memory-based approach is limited by memory bandwidth and that 'in theory' a GPU could have 260 Gbs while high end desktops can get 60 Gbs and low end systems get 10 Gbs. So there is a 26x difference between average CPU based systems and high end graphics cards.
By the way, I am very much considering paying the bounty based upon information provided thus far. I just need to understand the degree to which Momentum has been harmed by the algorithms mentioned.
1) What is the performance of SCRYPT on CPU vs GPU?
2) Does momentum have a lower performance ratio than SCRYPT?
3) When it comes to making an ASIC, which would be harder SCRYPT or Momentum?
As far as how to divided the bounty among the various contributors I am thinking that AnonyMint and gigawatt have both provided some reasonable contributions. I will wait to see gigawatt's proof-of-concept implementation to be sure there is a cycle and that the computational time is not larger than available parallelism before making the final decision about relative payout. Gigawatts proof-of-concept shows more time and energy invested and is also more compelling than theoretical estimations of performance gains provided by AnonyMint. Clearly writing code is of greater value than writing forum posts and also more definitive.
Summary of Algorithms Presented Bloom Filter to Reduce Memory - comes at performance cost, may be able to perform in parallel with R&D which leads to GPUs...
Memory Bandwidth of GPU arch vs CPU arch - gives a potential 4 to 24x advantage to GPU based upon specs alone, must be discounted by overhead of accessing memory in larger chunks than necessary and/or bloom filter and/or different time complexity sort/search algorithms
Constant-Memory Cycle Search Algorithms - trade CPU for Memory in a manner that might make parallelism free from memory bus constraints. Must prove we have cycles and that calculation time does not exceed level of parallelism.
Good work everyone, it is a pleasure debating this with you all. I hope you recognize that I am in this to find the best algorithm and NOT to defend my ego and put my head in the sand. This bounty and ProtoShares has brought out great minds and we sharpen one another.