My understanding is that the reason that you want NUMA is that the SMP strategy simply does not scale. The more CPUs you add, the more time each CPU spends waiting on the rest. Pretty soon you hit diminishing returns.
You can improve that by going to finer grained locks, more locks that last shorter each, making each CPU hog somewhat less of everyone else's time.
This adds overhead, but pushes off when you get diminishing returns. You still hit a wall though.
NUMA is still scaling well with a few thousand CPUs. You don't hear of people using more than 64 CPUs very often with SMP because you are wasting the other CPUs.
My further understanding is that SMP is the more widely used because it is easier to program to, and (particularly with Moore's law improving the CPUs) very few people have CPU needs beyond what SMP can provide.
Cheers,
Ben
PS Seconding what Ross said, as your machine spreads out and chips speed up, relativistic latency becomes an ever-growing issue. Sure, throughput can be scaled as far as you are willing to pay for. But Einstein ain't so cheap to buy off.