Since I have 4 cpus to fill at the moment, I rewrote using Perl threads, and maintain an active thread list to determine if I should release new tasks.
So on this system I hit ~380% CPU (max 400%) with occasional IO bottleneck. My RPS went from 100K to 250K, I'm not nearly there yet, on performance, but it seems I have a direction.
My tasks can be partitioned nicely, the input file can be easily split and then pieces sent to individual threads/and/or processes.
So I guess the next step is to implement the simplest/cheapest cluster available for me to program the next step on.
Any direction on that?