Author Topic: Open source optimized PTS CPU miner (BETA) (Read 50470 times)

dga · « **Reply #116 on:** February 03, 2014, 01:04:12 am »

Quote from: bytemaster on February 03, 2014, 12:46:45 am

Quote from: dga on February 02, 2014, 10:03:48 pm
Quote from: bytemaster on February 02, 2014, 09:56:17 pm
As long as the high-performance mode has a fallback option to low performance the instructions are not supported.

Are there any small tweaks that you can think of that would make it harder for a gpu?

Sent from my iPhone using Tapatalk

That's do-able. It's mostly just making sure that compilation isn't a mess. Basically, I'd modularize it as:

generate_sha512(buf, num_hashes, starting_nonce);

And as long as there was a version of generate_sha512 that worked reasonably well, it would be fine.

The tweaks: My first reaction is adding more branches/conditionals to cause warp divergence on a SIMD machine. That would slow down the CPUs, too, of course, but it would be really painful for the GPUs. I'm not sure exactly where I'd add them. Possibly a changed sha512 core with a slightly variable or tweaked number of rounds depending on something in the input that caused divergence on a per-nonce basis. It'd make it more evil, at least. If you could force all 16 or 32 units in the vector to have to diverge early on and remain diverged through the end of the sha512, you'd slow the GPUs by a factor of 16 or 32 while only slowing the CPU down by 2-4x.

Well you know this stuff better than most, so come up with something solid and we will include it.

Sounds good.

Here's what I propose. Perhaps surprisingly, it's taken way more work to do the fast CPU implementation of PTS than the basic GPU implementation. Instead of going by my consulting rates (grin), I'll admit that I did it for fun, too, and judge it be about 450 PTS worth of work based upon the previous rates you were offering, and about 50PTS more work to actually manage the integration into momentum.cpp, since it differs substantially from the codebase I've been developing on.

Instead of having it all in one chunk, though, I think it makes more sense to split it in half for two different deliverables to help reduce risk and get something in your hands faster:

(a) Algorithmic improvements to mining that are completely platform-independent. (250).
(b) Platform-optimized implementation for sse4, avx, and avx2, delivered as GNU assembly code along with original source code files to generate that assembly. (250)

Both documented, of course.

I think I can get (a) done reasonably straightforwardly. For (b), I'll need to spend more time understanding the Makefile setup for it so that I can integrate it without breaking things.

As a nitpicky note based upon the copyright issues that arose in my previous release, just to be up front: Like other high-performance miners, for everything but avx2, I use the Intel sha512 implementation. Its license is compatible (redistributions must include the copyright notice). The code I'd integrate into momentum.cpp is entirely my own at this point, and I'd simply integrate it under the existing license.

bytemaster · « **Reply #115 on:** February 03, 2014, 12:46:45 am »

Quote from: dga on February 02, 2014, 10:03:48 pm

Quote from: bytemaster on February 02, 2014, 09:56:17 pm
As long as the high-performance mode has a fallback option to low performance the instructions are not supported.

Are there any small tweaks that you can think of that would make it harder for a gpu?

Sent from my iPhone using Tapatalk

That's do-able. It's mostly just making sure that compilation isn't a mess. Basically, I'd modularize it as:

generate_sha512(buf, num_hashes, starting_nonce);

And as long as there was a version of generate_sha512 that worked reasonably well, it would be fine.

The tweaks: My first reaction is adding more branches/conditionals to cause warp divergence on a SIMD machine. That would slow down the CPUs, too, of course, but it would be really painful for the GPUs. I'm not sure exactly where I'd add them. Possibly a changed sha512 core with a slightly variable or tweaked number of rounds depending on something in the input that caused divergence on a per-nonce basis. It'd make it more evil, at least. If you could force all 16 or 32 units in the vector to have to diverge early on and remain diverged through the end of the sha512, you'd slow the GPUs by a factor of 16 or 32 while only slowing the CPU down by 2-4x.

Well you know this stuff better than most, so come up with something solid and we will include it.

Coindgr · « **Reply #114 on:** February 02, 2014, 10:57:46 pm »

Will this be released to windows?
I hope so

dga · « **Reply #113 on:** February 02, 2014, 10:03:48 pm »

Quote from: bytemaster on February 02, 2014, 09:56:17 pm

As long as the high-performance mode has a fallback option to low performance the instructions are not supported.

Are there any small tweaks that you can think of that would make it harder for a gpu?

Sent from my iPhone using Tapatalk

That's do-able. It's mostly just making sure that compilation isn't a mess. Basically, I'd modularize it as:

generate_sha512(buf, num_hashes, starting_nonce);

And as long as there was a version of generate_sha512 that worked reasonably well, it would be fine.

The tweaks: My first reaction is adding more branches/conditionals to cause warp divergence on a SIMD machine. That would slow down the CPUs, too, of course, but it would be really painful for the GPUs. I'm not sure exactly where I'd add them. Possibly a changed sha512 core with a slightly variable or tweaked number of rounds depending on something in the input that caused divergence on a per-nonce basis. It'd make it more evil, at least. If you could force all 16 or 32 units in the vector to have to diverge early on and remain diverged through the end of the sha512, you'd slow the GPUs by a factor of 16 or 32 while only slowing the CPU down by 2-4x.

(PTS deposits: Pr8cnhz5eDsUegBZD4VZmGDARcKaozWbBc )

bytemaster · « **Reply #112 on:** February 02, 2014, 09:56:17 pm »

As long as the high-performance mode has a fallback option to low performance the instructions are not supported.

Are there any small tweaks that you can think of that would make it harder for a gpu?

Sent from my iPhone using Tapatalk

dga · « **Reply #111 on:** February 02, 2014, 09:52:57 pm »

Quote from: bytemaster on February 02, 2014, 05:15:42 pm

Any chance I can get the code from this miner integrated with bitshares/src/momentum.cpp API?

I would be willing to pay a reasonable number of PTS for the work.

API:
Code: [Select]
std::vector< std::pair<uint32_t,uint32_t> > momentum_search( pow_seed_type head )
Thoughts?

Sure, I'm happy to figure out a value that works.

Let me lay out the catch a little bit: The compilation chain is ugly because I generate a few CPU-specific chunks of code. I can put all of that in a repository, and by outputting assembly from the first step, it could all be compilable by gcc -- or from the original source if someone installed some other compiler support tools.

There are really two major contributions that make it fast:
- Some algorithmic changes that make the memory-hard parts faster;
- A re-implementation of the sha512 code for AVX2;
- An AVX/SSE implementation of other high-performance parts of the code.

The algorithmic changes are easy and will make any codebase faster and use less memory. The nitty gritty implementation bits start to get architecture specific. But I'm happy to include them.

The only drawback from my perspective is that the AVX2 SHA512 changes are also very pertinent to making Memorycoin faster, and I haven't yet started writing a miner for that one. *grins* But I'm willing to be scooped.

Same license as the original momentum is fine.

bytemaster · « **Reply #110 on:** February 02, 2014, 05:15:42 pm »

Any chance I can get the code from this miner integrated with bitshares/src/momentum.cpp API?

I would be willing to pay a reasonable number of PTS for the work.

API:

Code: [Select]

std::vector< std::pair<uint32_t,uint32_t> > momentum_search( pow_seed_type head )
Thoughts?

ptsrush · « **Reply #109 on:** February 02, 2014, 04:11:08 am »

It runs at 599.6 c/m for at long time and I'm really panic.

thank god now cpm : 600.5.

[STATS] 2014-Feb-02 12:08:19 | 600.5 c/m | 9.3 sh/m | VL: 1885 (99.5%), RJ: 9 (0.5%), ST: 0 (0.0%)

ptsrush · « **Reply #108 on:** February 02, 2014, 02:40:51 am »

haswell e1230-v3 avx2 beta9 upgraded, cpm : 595

[STATS] 2014-Feb-02 10:35:29 | 595.1 c/m | 8.8 sh/m | VL: 1004 (99.5%), RJ: 5 (0.5%), ST: 0 (0.0%)

bytemaster · « **Reply #107 on:** February 02, 2014, 12:03:20 am »

Quote

The haswell/AVX2 release is very solid and beats low-end GPUs: It's sitting just above 600 c/m. A cheap GPU (GT 640 GDDR5 -- $85) can get about 250 cpm. The fastest ($600-$1000) get around 2000-2200cpm. The GPUs are still ahead in cpm/$, but not by a shocking margin. Haswell is 610cpm for $300, or about 2cpm/$. An R9 290x is 2200cpm/$610 = 3.6cpm/$.

Considering you can build a high end CPU miner for less than the cost of a high end GPU miner I would have to contend that momentum has served its intended goals quite well.

dga · « **Reply #106 on:** February 01, 2014, 09:57:11 pm »

Well, I'll be. I guess we're entering the CPU mess zone. (Deleted old post)

Solved, thanks to some help from mikaelh_ on #beeeeer.

There's now only one binary, but on AMD, run with sse4 explicitly:

./ptsminer... <addr> <threads> sse4

You'll be much happier than with avx. For Intel, auto-detect works, and avx is better.

dga · « **Reply #105 on:** February 01, 2014, 09:40:24 pm »

Quote from: bytemaster on February 01, 2014, 09:12:13 pm

Quote from: dga on February 01, 2014, 08:24:32 pm
Quote from: dga on February 01, 2014, 07:19:38 pm
beta9 for AVX2 is now online in the usual place: http://www.cs.cmu.edu/~dga/ptsminer/

beta9 for AVX is also now online. This one should be a good speed boost - I'm seeing my test machine go from about 780cpm to 1020cpm.

Note: Unlike prior avxsse releases, this avx release really does require AVX. It's compiled to target sandy bridge and higher. I've changed the name of the binary to reflect this, and left the old avxsse one (which will run on sse4) online.

Direct link: http://www.cs.cmu.edu/~dga/ptsminer/ptsminer-dga-beta9-avx-linux64-static.bin

Happy mining!

Nice... how does this compare to the latest GPU mining?

I think I broke something. This one is a lot better on my AMD test CPU and absolutely horrible on my Intel CPUs. Back to the drawing board. ~~Beta8 is the one to stick with for Intel.~~ (update: beta9 is now working properly for Intel)

The haswell/AVX2 release is very solid and beats low-end GPUs: It's sitting just above 600 c/m. A cheap GPU (GT 640 GDDR5 -- $85) can get about 250 cpm. The fastest ($600-$1000) get around 2000-2200cpm. The GPUs are still ahead in cpm/$, but not by a shocking margin. Haswell is 610cpm for $300, or about 2cpm/$. An R9 290x is 2200cpm/$610 = 3.6cpm/$.

bytemaster · « **Reply #104 on:** February 01, 2014, 09:12:13 pm »

Quote from: dga on February 01, 2014, 08:24:32 pm

Quote from: dga on February 01, 2014, 07:19:38 pm
beta9 for AVX2 is now online in the usual place: http://www.cs.cmu.edu/~dga/ptsminer/

beta9 for AVX is also now online. This one should be a good speed boost - I'm seeing my test machine go from about 780cpm to 1020cpm.

Note: Unlike prior avxsse releases, this avx release really does require AVX. It's compiled to target sandy bridge and higher. I've changed the name of the binary to reflect this, and left the old avxsse one (which will run on sse4) online.

Direct link: http://www.cs.cmu.edu/~dga/ptsminer/ptsminer-dga-beta9-avx-linux64-static.bin

Happy mining!

Nice... how does this compare to the latest GPU mining?

dga · « **Reply #103 on:** February 01, 2014, 08:24:32 pm »

Quote from: dga on February 01, 2014, 07:19:38 pm

beta9 for AVX2 is now online in the usual place: http://www.cs.cmu.edu/~dga/ptsminer/

beta9 for AVX is also now online. This one should be a good speed boost - I'm seeing my test machine go from about 780cpm to 1020cpm.

Note: Unlike prior avxsse releases, this avx release really does require AVX. It's compiled to target sandy bridge and higher. I've changed the name of the binary to reflect this, and left the old avxsse one (which will run on sse4) online.

Direct link: http://www.cs.cmu.edu/~dga/ptsminer/ptsminer-dga-beta9-avx-linux64-static.bin

Happy mining!

Update: This one is producing very mixed results. Try beta8 and beta9 and use whichever is better for you. Beta9 is rocking on my AMD test CPU, but it seems slower on some others. Definitely needs improvement still.

dga · « **Reply #102 on:** February 01, 2014, 07:19:38 pm »

beta9 for AVX2 is now online in the usual place: http://www.cs.cmu.edu/~dga/ptsminer/

This is a speed-boost release. I'm still doing the benchmarking runs, but on my i7-4770, it's the first of my releases to crack 600 cpm. Looks like it's going to settle in between 610 and 620 cpm with 7 threads running on my test box.

beta9 is haswell-only right now; its optimizations are specific to avx2. I plan to address some of the portability/pool selection issues soon (because I'm running out of great ideas for how to make this thing faster without getting ugly).

Author Topic: Open source optimized PTS CPU miner (BETA) (Read 50470 times)

dga

Re: Open source optimized PTS CPU miner (BETA)

bytemaster

Re: Open source optimized PTS CPU miner (BETA)

Coindgr

Re: Open source optimized PTS CPU miner (BETA)

dga

Re: Open source optimized PTS CPU miner (BETA)

bytemaster

Re: Open source optimized PTS CPU miner (BETA)

dga

Re: Open source optimized PTS CPU miner (BETA)

bytemaster

Re: Open source optimized PTS CPU miner (BETA)

ptsrush

Re: Open source optimized PTS CPU miner (BETA)

ptsrush

Re: Open source optimized PTS CPU miner (BETA)

bytemaster

Re: Open source optimized PTS CPU miner (BETA)

dga

Re: Open source optimized PTS CPU miner (BETA)

dga

Re: Open source optimized PTS CPU miner (BETA)

bytemaster

Re: Open source optimized PTS CPU miner (BETA)

dga

Re: Open source optimized PTS CPU miner (BETA)

dga

Re: Open source optimized PTS CPU miner (BETA)