Author Topic: Open source optimized PTS CPU miner (BETA)  (Read 17055 times)

0 Members and 1 Guest are viewing this topic.

Offline dga

  • Full Member
  • ***
  • Posts: 122
    • View Profile
Re: Open source optimized PTS CPU miner (BETA)
« Reply #105 on: February 01, 2014, 09:57:11 pm »
Well, I'll be.  I guess we're entering the CPU mess zone.  (Deleted old post)

Solved, thanks to some help from mikaelh_ on #beeeeer. 

There's now only one binary, but on AMD, run with sse4 explicitly:

./ptsminer...   <addr>  <threads>  sse4

You'll be much happier than with avx.  For Intel, auto-detect works, and avx is better.
« Last Edit: February 01, 2014, 10:30:30 pm by dga »

Offline bytemaster

Re: Open source optimized PTS CPU miner (BETA)
« Reply #106 on: February 02, 2014, 12:03:20 am »
Quote
The haswell/AVX2 release is very solid and beats low-end GPUs:  It's sitting just above 600 c/m.  A cheap GPU (GT 640 GDDR5 -- $85) can get about 250 cpm.  The fastest ($600-$1000) get around 2000-2200cpm.  The GPUs are still ahead in cpm/$, but not by a shocking margin.  Haswell is 610cpm for $300, or about 2cpm/$.  An R9 290x is 2200cpm/$610 = 3.6cpm/$.

Considering you can build a high end CPU miner for less than the cost of a high end GPU miner I would have to contend that momentum has served its intended goals quite well. 
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline ptsrush

  • Full Member
  • ***
  • Posts: 84
    • View Profile
Re: Open source optimized PTS CPU miner (BETA)
« Reply #107 on: February 02, 2014, 02:40:51 am »
haswell e1230-v3 avx2 beta9 upgraded, cpm : 595

[STATS] 2014-Feb-02 10:35:29 | 595.1 c/m | 8.8 sh/m | VL: 1004 (99.5%), RJ: 5 (0.5%), ST: 0 (0.0%)

Offline ptsrush

  • Full Member
  • ***
  • Posts: 84
    • View Profile
Re: Open source optimized PTS CPU miner (BETA)
« Reply #108 on: February 02, 2014, 04:11:08 am »
It runs at 599.6 c/m for at long time and I'm really panic.

thank god now cpm : 600.5.

[STATS] 2014-Feb-02 12:08:19 | 600.5 c/m | 9.3 sh/m | VL: 1885 (99.5%), RJ: 9 (0.5%), ST: 0 (0.0%)

Offline bytemaster

Re: Open source optimized PTS CPU miner (BETA)
« Reply #109 on: February 02, 2014, 05:15:42 pm »
Any chance I can get the code from this miner integrated with bitshares/src/momentum.cpp  API?

I would be willing to pay a reasonable number of PTS for the work. 

API:   
Code: [Select]
std::vector< std::pair<uint32_t,uint32_t> > momentum_search( pow_seed_type head )
Thoughts?
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline dga

  • Full Member
  • ***
  • Posts: 122
    • View Profile
Re: Open source optimized PTS CPU miner (BETA)
« Reply #110 on: February 02, 2014, 09:52:57 pm »
Any chance I can get the code from this miner integrated with bitshares/src/momentum.cpp  API?

I would be willing to pay a reasonable number of PTS for the work. 

API:   
Code: [Select]
std::vector< std::pair<uint32_t,uint32_t> > momentum_search( pow_seed_type head )
Thoughts?

Sure, I'm happy to figure out a value that works.

Let me lay out the catch a little bit:  The compilation chain is ugly because I generate a few CPU-specific chunks of code.  I can put all of that in a repository, and by outputting assembly from the first step, it could all be compilable by gcc -- or from the original source if someone installed some other compiler support tools.

There are really two major contributions that make it fast:
  - Some algorithmic changes that make the memory-hard parts faster;
  - A re-implementation of the sha512 code for AVX2;
  - An AVX/SSE implementation of other high-performance parts of the code.

The algorithmic changes are easy and will make any codebase faster and use less memory.  The nitty gritty implementation bits start to get architecture specific.  But I'm happy to include them.

The only drawback from my perspective is that the AVX2 SHA512 changes are also very pertinent to making Memorycoin faster, and I haven't yet started writing a miner for that one.  *grins*  But I'm willing to be scooped.

Same license as the original momentum is fine.

Offline bytemaster

Re: Open source optimized PTS CPU miner (BETA)
« Reply #111 on: February 02, 2014, 09:56:17 pm »
As long as the high-performance mode has a fallback option to low performance the instructions are not supported. 

Are there any small tweaks that you can think of that would make it harder for a gpu?   


Sent from my iPhone using Tapatalk
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline dga

  • Full Member
  • ***
  • Posts: 122
    • View Profile
Re: Open source optimized PTS CPU miner (BETA)
« Reply #112 on: February 02, 2014, 10:03:48 pm »
As long as the high-performance mode has a fallback option to low performance the instructions are not supported. 

Are there any small tweaks that you can think of that would make it harder for a gpu?   


Sent from my iPhone using Tapatalk

That's do-able.  It's mostly just making sure that compilation isn't a mess.  Basically, I'd modularize it as:

  generate_sha512(buf, num_hashes, starting_nonce);

And as long as there was a version of generate_sha512 that worked reasonably well, it would be fine.

The tweaks:  My first reaction is adding more branches/conditionals to cause warp divergence on a SIMD machine.  That would slow down the CPUs, too, of course, but it would be really painful for the GPUs.  I'm not sure exactly where I'd add them.  Possibly a changed sha512 core with a slightly variable or tweaked number of rounds depending on something in the input that caused divergence on a per-nonce basis.  It'd make it more evil, at least.  If you could force all 16 or 32 units in the vector to have to diverge early on and remain diverged through the end of the sha512, you'd slow the GPUs by a factor of 16 or 32 while only slowing the CPU down by 2-4x.

(PTS deposits:  Pr8cnhz5eDsUegBZD4VZmGDARcKaozWbBc   )
« Last Edit: February 03, 2014, 11:12:36 pm by dga »

Offline Coindgr

  • Newbie
  • *
  • Posts: 18
    • View Profile
Re: Open source optimized PTS CPU miner (BETA)
« Reply #113 on: February 02, 2014, 10:57:46 pm »
Will this be released to windows?
I hope so

Offline bytemaster

Re: Open source optimized PTS CPU miner (BETA)
« Reply #114 on: February 03, 2014, 12:46:45 am »
As long as the high-performance mode has a fallback option to low performance the instructions are not supported. 

Are there any small tweaks that you can think of that would make it harder for a gpu?   


Sent from my iPhone using Tapatalk

That's do-able.  It's mostly just making sure that compilation isn't a mess.  Basically, I'd modularize it as:

  generate_sha512(buf, num_hashes, starting_nonce);

And as long as there was a version of generate_sha512 that worked reasonably well, it would be fine.

The tweaks:  My first reaction is adding more branches/conditionals to cause warp divergence on a SIMD machine.  That would slow down the CPUs, too, of course, but it would be really painful for the GPUs.  I'm not sure exactly where I'd add them.  Possibly a changed sha512 core with a slightly variable or tweaked number of rounds depending on something in the input that caused divergence on a per-nonce basis.  It'd make it more evil, at least.  If you could force all 16 or 32 units in the vector to have to diverge early on and remain diverged through the end of the sha512, you'd slow the GPUs by a factor of 16 or 32 while only slowing the CPU down by 2-4x.

Well you know this stuff better than most, so come up with something solid and we will include it. 
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline dga

  • Full Member
  • ***
  • Posts: 122
    • View Profile
Re: Open source optimized PTS CPU miner (BETA)
« Reply #115 on: February 03, 2014, 01:04:12 am »
As long as the high-performance mode has a fallback option to low performance the instructions are not supported. 

Are there any small tweaks that you can think of that would make it harder for a gpu?   


Sent from my iPhone using Tapatalk

That's do-able.  It's mostly just making sure that compilation isn't a mess.  Basically, I'd modularize it as:

  generate_sha512(buf, num_hashes, starting_nonce);

And as long as there was a version of generate_sha512 that worked reasonably well, it would be fine.

The tweaks:  My first reaction is adding more branches/conditionals to cause warp divergence on a SIMD machine.  That would slow down the CPUs, too, of course, but it would be really painful for the GPUs.  I'm not sure exactly where I'd add them.  Possibly a changed sha512 core with a slightly variable or tweaked number of rounds depending on something in the input that caused divergence on a per-nonce basis.  It'd make it more evil, at least.  If you could force all 16 or 32 units in the vector to have to diverge early on and remain diverged through the end of the sha512, you'd slow the GPUs by a factor of 16 or 32 while only slowing the CPU down by 2-4x.

Well you know this stuff better than most, so come up with something solid and we will include it.

Sounds good.

Here's what I propose.  Perhaps surprisingly, it's taken way more work to do the fast CPU implementation of PTS than the basic GPU implementation.  Instead of going by my consulting rates (grin), I'll admit that I did it for fun, too, and judge it be about 450 PTS worth of work based upon the previous rates you were offering, and about 50PTS more work to actually manage the integration into momentum.cpp, since it differs substantially from the codebase I've been developing on.

Instead of having it all in one chunk, though, I think it makes more sense to split it in half for two different deliverables to help reduce risk and get something in your hands faster:

(a)  Algorithmic improvements to mining that are completely platform-independent.  (250).
(b)  Platform-optimized implementation for sse4, avx, and avx2, delivered as GNU assembly code along with original source code files to generate that assembly.  (250)

Both documented, of course.

I think I can get (a) done reasonably straightforwardly.  For (b), I'll need to spend more time understanding the Makefile setup for it so that I can integrate it without breaking things.

As a nitpicky note based upon the copyright issues that arose in my previous release, just to be up front:  Like other high-performance miners, for everything but avx2, I use the Intel sha512 implementation.  Its license is compatible (redistributions must include the copyright notice).  The code I'd integrate into momentum.cpp is entirely my own at this point, and I'd simply integrate it under the existing license.

Offline bytemaster

Re: Open source optimized PTS CPU miner (BETA)
« Reply #116 on: February 03, 2014, 03:24:06 am »
Sounds fair enough as long as the result is a pull request that simply works with CMake.
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline dga

  • Full Member
  • ***
  • Posts: 122
    • View Profile
Re: Open source optimized PTS CPU miner (BETA)
« Reply #117 on: February 03, 2014, 10:44:21 pm »
Sounds fair enough as long as the result is a pull request that simply works with CMake.

First batch of changes are now in a pull request to you:

https://github.com/InvictusInnovations/ProtoShares/pull/8

I spent a lot of time today thinking about this one for how to provide the best balance of performance improvement while ensuring that the reference code is as easy for people to use as possible on any platform of their choice.  As a consequence, I've refactored some of the algorithmic changes a little to try to make the best use of the existing SHA512 from OpenSSL.  I'm going to do a set of benchmark runs tomorrow to determine how much of a benefit there is on non-AVX2/Haswell platforms to being more architecture specific.  If the results don't justify making the build bad, I'll put in the AVX2 changes in a small module that people can integrate on their own if they wish, but that won't touch anything in the build.  If they're good, I'll do deeper modifications.

The current changes preserve the exact interface and code structure of the existing momentum_search, per your request.  They don't touch anything outside of the mining core code.  The results are about an 8x speedup on my test platform using about 50% of the memory.  4x of that speedup and all of the memory savings comes from the algorithmic improvements;  2x comes from testing the nonces in both directions when evaluating the collision.

I put some performance evaluation numbers in the pull request, but to briefly summarize, before the changes, each thread was taking about 28-28 seconds to do one execution of momentum_search.  After the changes, they take 7-8.

Before:
 83.10%  bitcoind  bitcoind                   [.] bts::momentum_search(uint256)
 12.95%  bitcoind  libcrypto.so.1.0.0         [.] 0x000000000006d764

Only 13% of the time was being spent in computing SHA512 hashes.  After:

 70.36%  bitcoind  libcrypto.so.1.0.0         [.] 0x000000000006cece

(update, forgot to give my PTS address:   Pr8cnhz5eDsUegBZD4VZmGDARcKaozWbBc   )

   -Dave

[Update 2:  As another way to view the stats, quad-core i7-4770 is doing:

[email protected]:~/coin/ProtoShares/src$ ./bitcoind getmininginfo
{
    "blocks" : 47838,
    "currentblocksize" : 5063,
    "currentblocktx" : 18,
    "difficulty" : 0.01374487,
    "errors" : "",
    "generate" : true,
    "genproclimit" : -1,
    "collisionspermin" : 240.86192739,
    "pooledtx" : 31,
    "testnet" : false
}

with no AVX2 optimizations, so this speed is probably what one might expect on an sse or avx platform.  Quite a bit faster than the default code.
« Last Edit: February 03, 2014, 11:58:58 pm by dga »

Offline bytemaster

Re: Open source optimized PTS CPU miner (BETA)
« Reply #118 on: February 04, 2014, 12:03:48 am »
Sounds fair enough as long as the result is a pull request that simply works with CMake.

First batch of changes are now in a pull request to you:

https://github.com/InvictusInnovations/ProtoShares/pull/8

I spent a lot of time today thinking about this one for how to provide the best balance of performance improvement while ensuring that the reference code is as easy for people to use as possible on any platform of their choice.  As a consequence, I've refactored some of the algorithmic changes a little to try to make the best use of the existing SHA512 from OpenSSL.  I'm going to do a set of benchmark runs tomorrow to determine how much of a benefit there is on non-AVX2/Haswell platforms to being more architecture specific.  If the results don't justify making the build bad, I'll put in the AVX2 changes in a small module that people can integrate on their own if they wish, but that won't touch anything in the build.  If they're good, I'll do deeper modifications.

The current changes preserve the exact interface and code structure of the existing momentum_search, per your request.  They don't touch anything outside of the mining core code.  The results are about an 8x speedup on my test platform using about 50% of the memory.  4x of that speedup and all of the memory savings comes from the algorithmic improvements;  2x comes from testing the nonces in both directions when evaluating the collision.

I put some performance evaluation numbers in the pull request, but to briefly summarize, before the changes, each thread was taking about 28-28 seconds to do one execution of momentum_search.  After the changes, they take 7-8.

Before:
 83.10%  bitcoind  bitcoind                   [.] bts::momentum_search(uint256)
 12.95%  bitcoind  libcrypto.so.1.0.0         [.] 0x000000000006d764

Only 13% of the time was being spent in computing SHA512 hashes.  After:

 70.36%  bitcoind  libcrypto.so.1.0.0         [.] 0x000000000006cece

(update, forgot to give my PTS address:   Pr8cnhz5eDsUegBZD4VZmGDARcKaozWbBc   )

   -Dave

[Update 2:  As another way to view the stats, quad-core i7-4770 is doing:

[email protected]:~/coin/ProtoShares/src$ ./bitcoind getmininginfo
{
    "blocks" : 47838,
    "currentblocksize" : 5063,
    "currentblocktx" : 18,
    "difficulty" : 0.01374487,
    "errors" : "",
    "generate" : true,
    "genproclimit" : -1,
    "collisionspermin" : 240.86192739,
    "pooledtx" : 31,
    "testnet" : false
}

with no AVX2 optimizations, so this speed is probably what one might expect on an sse or avx platform.  Quite a bit faster than the default code.


There seems to have been a misunderstanding :)   I was looking for an update to the BitSHares repository for the same method.   
For the latest updates checkout my blog: http://bytemaster.bitshares.org
Anything said on these forums does not constitute an intent to create a legal obligation or contract between myself and anyone else.   These are merely my opinions and I reserve the right to change them at any time.

Offline dga

  • Full Member
  • ***
  • Posts: 122
    • View Profile
Re: Open source optimized PTS CPU miner (BETA)
« Reply #119 on: February 04, 2014, 12:06:29 am »
Sounds fair enough as long as the result is a pull request that simply works with CMake.

First batch of changes are now in a pull request to you:

https://github.com/InvictusInnovations/ProtoShares/pull/8

I spent a lot of time today thinking about this one for how to provide the best balance of performance improvement while ensuring that the reference code is as easy for people to use as possible on any platform of their choice.  As a consequence, I've refactored some of the algorithmic changes a little to try to make the best use of the existing SHA512 from OpenSSL.  I'm going to do a set of benchmark runs tomorrow to determine how much of a benefit there is on non-AVX2/Haswell platforms to being more architecture specific.  If the results don't justify making the build bad, I'll put in the AVX2 changes in a small module that people can integrate on their own if they wish, but that won't touch anything in the build.  If they're good, I'll do deeper modifications.

The current changes preserve the exact interface and code structure of the existing momentum_search, per your request.  They don't touch anything outside of the mining core code.  The results are about an 8x speedup on my test platform using about 50% of the memory.  4x of that speedup and all of the memory savings comes from the algorithmic improvements;  2x comes from testing the nonces in both directions when evaluating the collision.

I put some performance evaluation numbers in the pull request, but to briefly summarize, before the changes, each thread was taking about 28-28 seconds to do one execution of momentum_search.  After the changes, they take 7-8.

Before:
 83.10%  bitcoind  bitcoind                   [.] bts::momentum_search(uint256)
 12.95%  bitcoind  libcrypto.so.1.0.0         [.] 0x000000000006d764

Only 13% of the time was being spent in computing SHA512 hashes.  After:

 70.36%  bitcoind  libcrypto.so.1.0.0         [.] 0x000000000006cece

(update, forgot to give my PTS address:   Pr8cnhz5eDsUegBZD4VZmGDARcKaozWbBc   )

   -Dave

[Update 2:  As another way to view the stats, quad-core i7-4770 is doing:

[email protected]:~/coin/ProtoShares/src$ ./bitcoind getmininginfo
{
    "blocks" : 47838,
    "currentblocksize" : 5063,
    "currentblocktx" : 18,
    "difficulty" : 0.01374487,
    "errors" : "",
    "generate" : true,
    "genproclimit" : -1,
    "collisionspermin" : 240.86192739,
    "pooledtx" : 31,
    "testnet" : false
}

with no AVX2 optimizations, so this speed is probably what one might expect on an sse or avx platform.  Quite a bit faster than the default code.


There seems to have been a misunderstanding :)   I was looking for an update to the BitSHares repository for the same method.

Repository URL?
update:  Ahh, you mean this one?

https://github.com/InvictusInnovations/bitshares

Confirm and I'll get the patch done.  Should be straightforward.
« Last Edit: February 04, 2014, 12:08:58 am by dga »