But my point remains. For many developers, naive parallelism is enough. For many more, it is fairly easy to get to the first stage through dividing your program up into a small number of dedicated processes that communicate over sockets.

For instance high-end video rendering software already divides the problem into smaller ones that can be farmed out easily. They do this so that they can distribute work over clusters. But the fact remains that for some things concurrency comes cheaply, and for many more you can achieve high levels of effective currency in a very naive way, and avoid a lot of pain in doing so.

Unless you know that you can't get the job done with that simple-minded approach, I would not recommend going to a pervasively multi-threaded approach.

Cheers,
Ben