You are right... I owe you that tip... sorry I have been very busy  can you send me a BTC address.  Thanks
Ditto.  
In other news, I think I've finally, finally, actually done it (again).
As we discussed previously, it looks like using a cycle-detection algorithm is no-go because the hash doesn't map to itself.
However, what does work is making a pseudo-rainbow table in memory, on the fly.
The result: the ratio of memory used vs search rate remains exactly the same, but we can scale the size arbitrarily.  For example, instead of taking 768 MB memory and 1 minute per search, we can use (768 MB / X) memory and X minutes per search.
This sounds useless at first, but consider the following: given that we can arbitrarily scale the memory requirements, the only limiting factor how fast we can find collisions is our brute computing power.  What's good at brute computing power?  GPUs and ASICs.
Here's my proof:
Overview:
https://en.wikipedia.org/wiki/Rainbow_table
Make a "rainbow table" map to hold the equivalent hash space of SEARCH_SPACE
  For chains of length CHAIN_LENGTH,
    rainbow table needs (SEARCH_SPACE / CHAIN_LENGTH) elements [name: RAINBOW_ENTRIES]
  Rainbow table will require (8 bytes (hash) + 4 bytes (nonce)) * RAINBOW_ENTRIES bytes total
To compress each SEARCH_SPACE_BITS hash to a new nonce in the range MAX_MOMENTUM_NONCE,
  simply use (hash % MAX_MOMENTUM_NONCE)
Pseudocode:
For the first RAINBOW_ENTRIES
  nonce = element number
  For i = 1 to CHAIN_LENGTH
    cur_hash = HASH(nonce)
    Check if any part of cur_hash is in rainbow table
      If yes and generated by different nonce: add to result list
    For each chunk in cur_hash
      rainbow_map[chunk] = element number
    nonce = compress(cur_hash)
For the next (SEARCH_SPACE - RAINBOW_ENTRIES) elements:
  nonce = element number
  For i = 1 to CHAIN_LENGTH
    cur_hash = Hash(nonce)
    Check if any part of cur_hash is in rainbow table
      If yes and generated by different nonce: add to result list
    nonce = compress(cur_hash)
return result list
// MATH SECTION /////
Given:
CALLS(CHAIN_LENGTH) = total calls to SHA512 using a rainbow table with chain length CHAIN_LENGTH 
CALLS(CHAIN_LENGTH) = (SEARCH_SPACE / CHAIN_LENGTH) + (SEARCH_SPACE * CHAIN_LENGTH) - (SEARCH_SPACE / CHAIN_LENGTH)
  + (SEARCH_SPACE / CHAIN_LENGTH) = Calls to generate rainbow table
  + (SEARCH_SPACE * CHAIN_LENGTH) = Calls to check SEARCH_SPACE nonce values
  - (SEARCH_SPACE * CHAIN_LENGTH) = Except we skip the nonces used to make the rainbow table
CALLS(CHAIN_LENGTH) = (SEARCH_SPACE * CHAIN_LENGTH)
Given:
MEMORY(CHAIN_LENGTH) = number of bytes required to store a rainbow table representing
                         SEARCH_SPACE elements with chain length CHAIN_LENGTH
MEMORY(CHAIN_LENGTH) = rainbow table elements * (size of hash chunk + size of nonce)
MEMORY(CHAIN_LENGTH) = (8 + 4) * SEARCH_SPACE / CHAIN_LENGTH
Examples:
Minimum number of SHA512 calls is CHAIN_LENGTH = 1 (equivalent to the "default" semiordered map setup)
  Requires (SEARCH_SPACE * 12) = 768 MB memory
  Requires (SEARCH_SPACE) calls to SHA512
For CHAIN_LENGTH = N
  Requires (SEARCH_SPACE * 12 / N) bytes of memory
  Requires (SEARCH_SPACE * N) calls to SHA512
For CHAIN_LENGTH = 1024
  Requires (2^26 * 12 / 1024) = 768 KB of memory
  Requires (2*26 * 1024) calls to SHA512
Bottom line:
Using a rainbow table map of chain length N,
  memory usage is multiplied by a factor of 1/N
  and SHA512 calls are multiplied by a factor of N
This means that regardless the CHAIN_LENGTH selected,
  the ratio of CALLS(CHAIN_LENGTH) to MEMORY(CHAIN_LENGTH)
  is inversely proportional.
EFFICIENCY IS CONSTANT AT ANY MEMORY TARGET.
All that matters is brute computational strength.
Example:
If a GPU/ASIC/FPGA has PROCESSOR number of parallel computing units,
  and MEM_MAX bytes of available memory, then each
  Momentum hash instance can use up to (MEM_MAX / PROCESSOR) bytes of memory.
To calculate the CHAIN_LEN to acheive this:
  CHAIN_LEN(MEM_MAX, PROCESSOR) = (8 + 4) * SEARCH_SPACE / (MEM_MAX / PROCESSOR)
  CHAIN_LEN(MEM_MAX, PROCESSOR) = (8 + 4) * SEARCH_SPACE * PROCESSOR / MEM_MAX
  Note: This calculation is just the inverse of MEMORY(CHAIN_LEN)
Assuming: GPU with 128 compute units, 1 GB memory, and SEARCH_SPACE = MAX_MOMENTUM_NONCE
  CHAIN_LEN(2^30, 2^7) = (8 + 4) * SEARCH_SPACE * PROCESSOR / MEM_MAX
  CHAIN_LEN(2^30, 2^7) = (8 + 4) * SEARCH_SPACE * 2^7 / 2^30
  CHAIN_LEN(2^30, 2^7) = 12 * SEARCH_SPACE / 2^23
  CHAIN_LEN(2^30, 2^7) = 12 * 2^26 / 2^23
  CHAIN_LEN(2^30, 2^7) = 12 * 2^3
  CHAIN_LEN(2^30, 2^7) = 96
  Verification:
  MEMORY(CHAIN_LENGTH) = (8 + 4) * SEARCH_SPACE / CHAIN_LENGTH
  MEMORY(96) = (8 + 4) * 2^26 / 96
  MEMORY(96) = 8388608 bytes
  MEMORY(96) = 8 MB
  CALLS(CHAIN_LENGTH) = (SEARCH_SPACE * CHAIN_LENGTH)
Result: 1/96th the normal memory usage of 768 MB, 96x the SHA512 calls.
  The problem is no longer about memory, it's about computation speed.
I've actually got a somewhat working proof of concept, although it's quite sub-optimal.  Right now the main issue is using the overkill boost::unordered_map for storage instead of a more bespoke memory structure.  The boost implementation likes to decide it needs to redo the bucket arrangement from time to time, resulting in slowdowns.
It also has an unholy number of debug statements.
If you'd like to check on my progress, 
the source is right here.  It's a drop-in replacement for "momentum.cpp".
Edit: There's other optimizations possible when using the time/memory tradeoff.  For example, if we normally took 1 minute per nonce space search, cranking up the time/memory ratio to 16 would mean we'd expect 16 minutes per search, which is considerably longer than the target time per block.  To compensate, we can have the algorithm return immediately on first collision discovery.  When a new block is detected, all the miner threads should be stopped and recreated.