2014-01-01 update:
I've committed some new changes to the repository. Some are cosmetic, but three are important:
1) Memory use on the host side is reduced by about 500MB. This may or may not matter for you.
2) Speed is boosted by 10-20% on a lot of platforms. I have another speed boost patch coming next week once I've made it not horrible, but this one gets a decent chunk of the gains.
3) There's now a developer fee that goes to me. Kinda.
I'm doing an experiment with this code release in the developer fee: It's easy to disable. It's not hidden. But it's also just a list of addresses that share the dev fee equally.
So here's my proposal: If you port this software to another platform or release a binary, don't remove my address. Instead, add yours to the list -- I've tried to make it super easy for you to get your own share. If this works out, I'll continue to release improvements and try to make it even easier for other developers who improve upon the code, because we'll all have a reason to make software that remains open source and which is user-friendly and high performance.
If you think this is horrible, let me know and let's try to find a way to make it work better.
If you're a user who hates the idea of a dev fee, the source is yours and you can delete the addresses listed there and/or add your own.
-Dave
You mean 10-01-2013 update?

Superb work dga. Since I have a few Nvidia cards lying around, I report some of my test since yesterday.
cudarts version 08-01-2013 (v7 if I not mistaken)
GTX 780 - 1450cpm
GTX 680 - 650 cpm
GTX 580 - 850 cpm (3GB memory)
GTX 580 - 920 cpm (1.5GB memory)
GTX 570 - 750 cpm
GTX 260 - 290 cpm
cudarts version 10-01-2013 (v8)
GTX 780 - 1800cpm
GTX 680 - 950 cpm
GTX 580 - 820 cpm - 930 cpm (3GB memory) (the value varies depends on card manufactures)
GTX 580 - 960 cpm (1.5GB memory)
GTX 570 - 770 cpm
GTX 260 - 240 cpm
Yeah most of cards got very nice bump, but I notice some reduction too. But the most nice thing of V8 is my card running at least 3grad C lower. No change on memory consumption.
There some points I still don't understand:
1. Why GTX680 card not much faster than 580. The GTX680 runs with v7 even slower than GTX580.
2. I tried to compile with sm_35 for GTX780 cards. But it got around 10-15% slower than with sm_30.
Anyway I'm very happy with this. Thanks dga.