@bytemaster seems like you have poor parallelized code, on 24 core machine it gives only ~50-70% load =\
You have lower latencies. There is some overhead associated with synchronizing threads and when you divide the 2^26 search space into 24 threads, you amplify the amount of time spent in single threaded code. Try running 3 instances at 8 cores.
By the way, if you have studied my fc library, you will know that my code is entirely lock-free and very well paralyzed. It looks like I will have to do higher-level division to simulate 3 processes in one. I just don't have a 24 or 32 core machine to try it on
fc is boost based lib, better parallelization may be occured using omp for example. I have also 48 and 64 core machines. So i can test on it.
PS why you dont release miner code?
I started a bounty for people to create faster algorithms, my closed source code is proof it can be done. ypool seems to have one that is faster. When an alternative implementation comes out that is equal to mine I will release the code. It is my way of encouraging more eyes to focus on optimizing the proof of work. Taunting the savvy, smart, developers to come claim the bounty with an objective measure... "better than my miner for one week".
As long as I keep my miner closed then open source developers can freely publish their better algorithms and they will not have incentive to keep it closed.
I don't want anyone else to have a closet, unpublished algorithm that is better than the open source ones available. If ypool is really 5x the stock client, then I may have some work to do
Their code is open source by the way.