FT,
There are several different algo variations.
av=1 is HT-friendly. On 4-core with HT you shall run it 8 threads and 1Gb RAM per thread.
av=3 is for non-HT. If you plan to run 4 threads on HT mach, ensure that you select av=3. The best is if you can pin your threads to cores to unlock full potential of av=3.
av=2 is mediocre. In between 1 and 3, but you may give it a try.
av=4 is for older AMD CPUs, ppl reported this was the best in some cases.
av=5 is targeting Atoms in 64-bit mode, has completely different mem usage. Very slow one, but still faster than original one.
As of GPUs, I will evaluate possibilities, but as soon as GPUs are in the game, I will definitely jump in to be on par with this.
This is equally probable that GPUs will have some bottlenecks, but we shall see.
There are still some more optimizations that are possible for CPU, I even did not collect enough statistics to optimize memory access patterns in full, just added some basic stuff. There is equal chance that doubling memory per thread may also bring performance gain, besides it looks completely strange.
yvg1900