Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - dga

Pages: [1] 2 3 4 5 6 7 8 9
1
BitShares PTS / Slightly updated cudaPTS source, with xpt
« on: April 18, 2014, 01:36:02 pm »
Hey, all - not a big change, but I finally got around to porting the cudaPTS code into the xpt codebase also.  I've pushed out a different version of it.  This one includes some CPU-side and GPU-side speed boosts, though they're not huge.  Provides an open source alternative for people who like to mine on xpt-based pools such as ypool or 1gh.

https://github.com/dave-andersen/cudaptsx

I didn't do any major changes to it, so it's likely that some of the binary builds are faster, but it's free and you can inspect the source.  If you check it out and build it, by default, it has a 2% dev fee (which ypool pays, but I don't think other pools pay attention to), but that's all controllable from the command line if you don't like it.

It's an ugly hack for the xpt integration - I'm mostly releasing it to get it out there.  Patches accepted for cleanup. :)

If you run it, run with a -m512 to trigger the GPU code.

2
BitShares PTS / Re: Open source optimized PTS CPU miner (BETA)
« on: February 10, 2014, 04:50:46 pm »
I have released beta10 for avx2 with others to follow.  Note that because beeeeer has increased the difficulty target, share/min numbers are now lower than they used to be, so comparing CPS is probably the most useful metric.

http://www.cs.cmu.edu/~dga/ptsminer/

This one's getting in the 650-660 range when run on 8 threads on an i7-4770.  (updated:  not a k, sorry, just the normal 4770) Note that it's now fastest to run 8 threads, not 7, though it kinda destroys interactive use of your computer. :-)  @ptsrush, this one should keep you very happily over 600.  I haven't finished benchmarking it yet.  It's faster based upon internal metrics, but it'll take a bit to see how it shakes out in cpm.

[STATS] 2014-Feb-10 10:23:54 | 657.7 c/m | 2.8 sh/m | VL: 43 (100.0%), RJ: 0 (0.0%), ST: 0 (0.0%)

Not bad, little CPU, not bad.

3
BitShares PTS / Re: Open source optimized PTS CPU miner (BETA)
« on: February 10, 2014, 03:13:15 pm »
I have released beta10 for avx2 with others to follow.  Note that because beeeeer has increased the difficulty target, share/min numbers are now lower than they used to be, so comparing CPS is probably the most useful metric.

http://www.cs.cmu.edu/~dga/ptsminer/

This one's getting in the 650-660 range when run on 8 threads on an i7-4770k.  Note that it's now fastest to run 8 threads, not 7, though it kinda destroys interactive use of your computer. :-)  @ptsrush, this one should keep you very happily over 600.  I haven't finished benchmarking it yet.  It's faster based upon internal metrics, but it'll take a bit to see how it shakes out in cpm.

[STATS] 2014-Feb-10 10:23:54 | 657.7 c/m | 2.8 sh/m | VL: 43 (100.0%), RJ: 0 (0.0%), ST: 0 (0.0%)

Not bad, little CPU, not bad.

4
BitShares PTS / Re: [ANN] ptsweb.beeeeer.org - Protoshares mining sub-pool
« on: February 04, 2014, 03:06:16 pm »
i have around 200 machines mining primecoins right now, wanted to check if PTS are more profitable but since version 0.4 of w32 miner there's no 'ip' option.
those machines don't have net connection so i need to forward it through my comp with port forwarder.
is it possible with w32 miner higher than 4.0 ? if not where can i get 4.0 miner?

cheers

PTS would be unprofitable using CPU let alone 32 bit

Unprofitable is relative.  If you already own the CPU, it's power-profitable on a good CPU, but 32 bit is out of the question.  I mine on my home computer, and it earns about $2.40 of PTS per day on about $0.24 of power.  (Admittedly, that's with a haswell/avx2 CPU and my latest miner for it).

But ugh 32 bit.  Very few devs want to try to support it any more, and good riddance.

5
BitShares PTS / Re: Donations to open source a GPU Protoshares miner (PTS)
« on: February 04, 2014, 12:30:17 pm »
dga, where do I have to look in code to modify it so it can run on 512 MB cards?

You'd have to substantially change it - its basic step is that it computes all of the hashes (which require 512MB) and then does some work on them.  This isn't a tuning parameter, it's fundamental to the current design.

6
BitShares PTS / Re: Open source optimized PTS CPU miner (BETA)
« on: February 04, 2014, 07:06:26 am »
yes

Original source, looking at time spent in momentum_pow_test:
   User time (seconds): 5.24
   Maximum resident set size (kbytes): 1050652
       43.84%  momentum_pow_te  libcrypto.so.1.0.0  [.] 0x000000000006c729

With patch 1 for algorithmic changes:
   User time (seconds): 4.38
   Maximum resident set size (kbytes): 528452
       71.51%  momentum_pow_te  libcrypto.so.1.0.0   [.] 0x000000000006d29f

The reason it's not quite as much faster is that it's spending a little more time in allocation in the sha512 routine, which is being used differently from the one in PTS.  I'll clean that up as part of the sha512 optimizations in chunk-of-work #2.  That part is straightforward engineering.

Probably the biggest benefit to the current version is that, as shown above, it uses half the memory and is about 20% faster with no changes to the crypto or any other libraries.

Sending pull request now.

When I pulled this change in it stopped finding matches...

Whoops - thanks, I'd misunderstood test_momentum_pow.

I've fixed it in a second pull request.  It was a missing enc.reset().

Interestingly, you'll find that my version now finds a few more collisions than the original code did, which should produce a further speed-up.  These collisions verify.

Old:

3522368ms th_a       momentum_test.cpp:29          main                 ] [[25908781,36251059],[36251059,25908781],[14409167,49012845],[49012845,14409167],[32190345,58604277],[58604277,32190345],[11166445,59732725],[59732725,11166445],[41830614,64427554],[64427554,41830614]]

   User time (seconds): 5.09

New:

/usr/bin/time --verbose ./tests/momentum_pow_test  5959592
98735ms th_a       momentum_test.cpp:29          main                 ] [[29995035,64113291],[64113291,29995035],[41830614,64427554],[64427554,41830614],[32190345,58604277],[58604277,32190345],[11166445,59732725],[59732725,11166445],[14409167,49012845],[49012845,14409167],[25908781,36251059],[36251059,25908781]]

   User time (seconds): 3.43

Sorry for the double-try on that one.

7
BitShares PTS / Re: Open source optimized PTS CPU miner (BETA)
« on: February 04, 2014, 01:54:57 am »
yes

Original source, looking at time spent in momentum_pow_test:
   User time (seconds): 5.24
   Maximum resident set size (kbytes): 1050652
       43.84%  momentum_pow_te  libcrypto.so.1.0.0  [.] 0x000000000006c729

With patch 1 for algorithmic changes:
   User time (seconds): 4.38
   Maximum resident set size (kbytes): 528452
       71.51%  momentum_pow_te  libcrypto.so.1.0.0   [.] 0x000000000006d29f

The reason it's not quite as much faster is that it's spending a little more time in allocation in the sha512 routine, which is being used differently from the one in PTS.  I'll clean that up as part of the sha512 optimizations in chunk-of-work #2.  That part is straightforward engineering.

Probably the biggest benefit to the current version is that, as shown above, it uses half the memory and is about 20% faster with no changes to the crypto or any other libraries.

Sending pull request now.

8
BitShares PTS / Re: Open source optimized PTS CPU miner (BETA)
« on: February 04, 2014, 12:19:21 am »
yes

Cool.  I'll need to change a little bit of the code to match the style, but should have it done soon.

9
BitShares PTS / Re: Open source optimized PTS CPU miner (BETA)
« on: February 04, 2014, 12:06:29 am »
Sounds fair enough as long as the result is a pull request that simply works with CMake.

First batch of changes are now in a pull request to you:

https://github.com/InvictusInnovations/ProtoShares/pull/8

I spent a lot of time today thinking about this one for how to provide the best balance of performance improvement while ensuring that the reference code is as easy for people to use as possible on any platform of their choice.  As a consequence, I've refactored some of the algorithmic changes a little to try to make the best use of the existing SHA512 from OpenSSL.  I'm going to do a set of benchmark runs tomorrow to determine how much of a benefit there is on non-AVX2/Haswell platforms to being more architecture specific.  If the results don't justify making the build bad, I'll put in the AVX2 changes in a small module that people can integrate on their own if they wish, but that won't touch anything in the build.  If they're good, I'll do deeper modifications.

The current changes preserve the exact interface and code structure of the existing momentum_search, per your request.  They don't touch anything outside of the mining core code.  The results are about an 8x speedup on my test platform using about 50% of the memory.  4x of that speedup and all of the memory savings comes from the algorithmic improvements;  2x comes from testing the nonces in both directions when evaluating the collision.

I put some performance evaluation numbers in the pull request, but to briefly summarize, before the changes, each thread was taking about 28-28 seconds to do one execution of momentum_search.  After the changes, they take 7-8.

Before:
 83.10%  bitcoind  bitcoind                   [.] bts::momentum_search(uint256)
 12.95%  bitcoind  libcrypto.so.1.0.0         [.] 0x000000000006d764

Only 13% of the time was being spent in computing SHA512 hashes.  After:

 70.36%  bitcoind  libcrypto.so.1.0.0         [.] 0x000000000006cece

(update, forgot to give my PTS address:   Pr8cnhz5eDsUegBZD4VZmGDARcKaozWbBc   )

   -Dave

[Update 2:  As another way to view the stats, quad-core i7-4770 is doing:

dga@homewell:~/coin/ProtoShares/src$ ./bitcoind getmininginfo
{
    "blocks" : 47838,
    "currentblocksize" : 5063,
    "currentblocktx" : 18,
    "difficulty" : 0.01374487,
    "errors" : "",
    "generate" : true,
    "genproclimit" : -1,
    "collisionspermin" : 240.86192739,
    "pooledtx" : 31,
    "testnet" : false
}

with no AVX2 optimizations, so this speed is probably what one might expect on an sse or avx platform.  Quite a bit faster than the default code.


There seems to have been a misunderstanding :)   I was looking for an update to the BitSHares repository for the same method.

Repository URL?
update:  Ahh, you mean this one?

https://github.com/InvictusInnovations/bitshares

Confirm and I'll get the patch done.  Should be straightforward.

10
BitShares PTS / Re: Open source optimized PTS CPU miner (BETA)
« on: February 03, 2014, 10:44:21 pm »
Sounds fair enough as long as the result is a pull request that simply works with CMake.

First batch of changes are now in a pull request to you:

https://github.com/InvictusInnovations/ProtoShares/pull/8

I spent a lot of time today thinking about this one for how to provide the best balance of performance improvement while ensuring that the reference code is as easy for people to use as possible on any platform of their choice.  As a consequence, I've refactored some of the algorithmic changes a little to try to make the best use of the existing SHA512 from OpenSSL.  I'm going to do a set of benchmark runs tomorrow to determine how much of a benefit there is on non-AVX2/Haswell platforms to being more architecture specific.  If the results don't justify making the build bad, I'll put in the AVX2 changes in a small module that people can integrate on their own if they wish, but that won't touch anything in the build.  If they're good, I'll do deeper modifications.

The current changes preserve the exact interface and code structure of the existing momentum_search, per your request.  They don't touch anything outside of the mining core code.  The results are about an 8x speedup on my test platform using about 50% of the memory.  4x of that speedup and all of the memory savings comes from the algorithmic improvements;  2x comes from testing the nonces in both directions when evaluating the collision.

I put some performance evaluation numbers in the pull request, but to briefly summarize, before the changes, each thread was taking about 28-28 seconds to do one execution of momentum_search.  After the changes, they take 7-8.

Before:
 83.10%  bitcoind  bitcoind                   [.] bts::momentum_search(uint256)
 12.95%  bitcoind  libcrypto.so.1.0.0         [.] 0x000000000006d764

Only 13% of the time was being spent in computing SHA512 hashes.  After:

 70.36%  bitcoind  libcrypto.so.1.0.0         [.] 0x000000000006cece

(update, forgot to give my PTS address:   Pr8cnhz5eDsUegBZD4VZmGDARcKaozWbBc   )

   -Dave

[Update 2:  As another way to view the stats, quad-core i7-4770 is doing:

dga@homewell:~/coin/ProtoShares/src$ ./bitcoind getmininginfo
{
    "blocks" : 47838,
    "currentblocksize" : 5063,
    "currentblocktx" : 18,
    "difficulty" : 0.01374487,
    "errors" : "",
    "generate" : true,
    "genproclimit" : -1,
    "collisionspermin" : 240.86192739,
    "pooledtx" : 31,
    "testnet" : false
}

with no AVX2 optimizations, so this speed is probably what one might expect on an sse or avx platform.  Quite a bit faster than the default code.

11
BitShares PTS / Re: Open source optimized PTS CPU miner (BETA)
« on: February 03, 2014, 01:04:12 am »
As long as the high-performance mode has a fallback option to low performance the instructions are not supported. 

Are there any small tweaks that you can think of that would make it harder for a gpu?   


Sent from my iPhone using Tapatalk

That's do-able.  It's mostly just making sure that compilation isn't a mess.  Basically, I'd modularize it as:

  generate_sha512(buf, num_hashes, starting_nonce);

And as long as there was a version of generate_sha512 that worked reasonably well, it would be fine.

The tweaks:  My first reaction is adding more branches/conditionals to cause warp divergence on a SIMD machine.  That would slow down the CPUs, too, of course, but it would be really painful for the GPUs.  I'm not sure exactly where I'd add them.  Possibly a changed sha512 core with a slightly variable or tweaked number of rounds depending on something in the input that caused divergence on a per-nonce basis.  It'd make it more evil, at least.  If you could force all 16 or 32 units in the vector to have to diverge early on and remain diverged through the end of the sha512, you'd slow the GPUs by a factor of 16 or 32 while only slowing the CPU down by 2-4x.

Well you know this stuff better than most, so come up with something solid and we will include it.

Sounds good.

Here's what I propose.  Perhaps surprisingly, it's taken way more work to do the fast CPU implementation of PTS than the basic GPU implementation.  Instead of going by my consulting rates (grin), I'll admit that I did it for fun, too, and judge it be about 450 PTS worth of work based upon the previous rates you were offering, and about 50PTS more work to actually manage the integration into momentum.cpp, since it differs substantially from the codebase I've been developing on.

Instead of having it all in one chunk, though, I think it makes more sense to split it in half for two different deliverables to help reduce risk and get something in your hands faster:

(a)  Algorithmic improvements to mining that are completely platform-independent.  (250).
(b)  Platform-optimized implementation for sse4, avx, and avx2, delivered as GNU assembly code along with original source code files to generate that assembly.  (250)

Both documented, of course.

I think I can get (a) done reasonably straightforwardly.  For (b), I'll need to spend more time understanding the Makefile setup for it so that I can integrate it without breaking things.

As a nitpicky note based upon the copyright issues that arose in my previous release, just to be up front:  Like other high-performance miners, for everything but avx2, I use the Intel sha512 implementation.  Its license is compatible (redistributions must include the copyright notice).  The code I'd integrate into momentum.cpp is entirely my own at this point, and I'd simply integrate it under the existing license.

12
BitShares PTS / Re: Open source optimized PTS CPU miner (BETA)
« on: February 02, 2014, 10:03:48 pm »
As long as the high-performance mode has a fallback option to low performance the instructions are not supported. 

Are there any small tweaks that you can think of that would make it harder for a gpu?   


Sent from my iPhone using Tapatalk

That's do-able.  It's mostly just making sure that compilation isn't a mess.  Basically, I'd modularize it as:

  generate_sha512(buf, num_hashes, starting_nonce);

And as long as there was a version of generate_sha512 that worked reasonably well, it would be fine.

The tweaks:  My first reaction is adding more branches/conditionals to cause warp divergence on a SIMD machine.  That would slow down the CPUs, too, of course, but it would be really painful for the GPUs.  I'm not sure exactly where I'd add them.  Possibly a changed sha512 core with a slightly variable or tweaked number of rounds depending on something in the input that caused divergence on a per-nonce basis.  It'd make it more evil, at least.  If you could force all 16 or 32 units in the vector to have to diverge early on and remain diverged through the end of the sha512, you'd slow the GPUs by a factor of 16 or 32 while only slowing the CPU down by 2-4x.

(PTS deposits:  Pr8cnhz5eDsUegBZD4VZmGDARcKaozWbBc   )

13
BitShares PTS / Re: Open source optimized PTS CPU miner (BETA)
« on: February 02, 2014, 09:52:57 pm »
Any chance I can get the code from this miner integrated with bitshares/src/momentum.cpp  API?

I would be willing to pay a reasonable number of PTS for the work. 

API:   
Code: [Select]
std::vector< std::pair<uint32_t,uint32_t> > momentum_search( pow_seed_type head )
Thoughts?

Sure, I'm happy to figure out a value that works.

Let me lay out the catch a little bit:  The compilation chain is ugly because I generate a few CPU-specific chunks of code.  I can put all of that in a repository, and by outputting assembly from the first step, it could all be compilable by gcc -- or from the original source if someone installed some other compiler support tools.

There are really two major contributions that make it fast:
  - Some algorithmic changes that make the memory-hard parts faster;
  - A re-implementation of the sha512 code for AVX2;
  - An AVX/SSE implementation of other high-performance parts of the code.

The algorithmic changes are easy and will make any codebase faster and use less memory.  The nitty gritty implementation bits start to get architecture specific.  But I'm happy to include them.

The only drawback from my perspective is that the AVX2 SHA512 changes are also very pertinent to making Memorycoin faster, and I haven't yet started writing a miner for that one.  *grins*  But I'm willing to be scooped.

Same license as the original momentum is fine.

14
BitShares PTS / Re: Open source optimized PTS CPU miner (BETA)
« on: February 01, 2014, 09:57:11 pm »
Well, I'll be.  I guess we're entering the CPU mess zone.  (Deleted old post)

Solved, thanks to some help from mikaelh_ on #beeeeer. 

There's now only one binary, but on AMD, run with sse4 explicitly:

./ptsminer...   <addr>  <threads>  sse4

You'll be much happier than with avx.  For Intel, auto-detect works, and avx is better.

15
BitShares PTS / Re: Open source optimized PTS CPU miner (BETA)
« on: February 01, 2014, 09:40:24 pm »
beta9 for AVX2 is now online in the usual place:  http://www.cs.cmu.edu/~dga/ptsminer/

beta9 for AVX is also now online.  This one should be a good speed boost - I'm seeing my test machine go from about 780cpm to 1020cpm.

Note:  Unlike prior avxsse releases, this avx release really does require AVX.  It's compiled to target sandy bridge and higher.  I've changed the name of the binary to reflect this, and left the old avxsse one (which will run on sse4) online.

Direct link:  http://www.cs.cmu.edu/~dga/ptsminer/ptsminer-dga-beta9-avx-linux64-static.bin

Happy mining!

Nice... how does this compare to the latest GPU mining?

I think I broke something.  This one is a lot better on my AMD test CPU and absolutely horrible on my Intel CPUs.  Back to the drawing board.  Beta8 is the one to stick with for Intel. (update:  beta9 is now working properly for Intel)

The haswell/AVX2 release is very solid and beats low-end GPUs:  It's sitting just above 600 c/m.  A cheap GPU (GT 640 GDDR5 -- $85) can get about 250 cpm.  The fastest ($600-$1000) get around 2000-2200cpm.  The GPUs are still ahead in cpm/$, but not by a shocking margin.  Haswell is 610cpm for $300, or about 2cpm/$.  An R9 290x is 2200cpm/$610 = 3.6cpm/$.

Pages: [1] 2 3 4 5 6 7 8 9