Sure thing you can get great results with 128-bit registers and vector rotates and shifts in one cycle. But GPU has just so much more ALUs it kind of compensates. Still, registers are 32-bit and there is no SHLD/SHRD to implement 64-bit ops nicely, so they are pretty much a showstopper.
I would say it differently - cpu compensates (very well) for less parallelism with vector instructions. Do you agree?
32 bit registers doesn't make it automatically slower because they're 32 bit, there's just not enough free registers left for all possible waves. Do you have 100% kernel occupancy on all your kernels? I don't think so. SHLD/SHRD isn't going to change it.
Well, it is about 99% for this kernel, and no register spilling, AMD has tons of registers, for that matter. Just take a look at what compiler generates when you shift or rotate ulong..