In my kernel tuning class (the week I met you in NYC) that was the mantra.
Performance it mostly a perceived phenomina, not a truly measured one, at least in the case of tuning a system for OTHER people.
So the 1st lesson to be learned was to lower expectations.
After that, true thoughput was typically ignored, in an attempt to reduce latency.
So it worked its way out, from CPU cache hits, to 2nd level memory, to disk, to network, etc.
Edit: I should have said worked its way in, from slowest to quickest. And the 1st should have been application design. That's because the items with the largest perceived benefit should be tuned 1st, not the other way around. Who cares if you shave a few milliseconds with a cache hit only to lose a few seconds in the application design side.
And as you stated, program startup was a HUGE issue with people, since interactive programs spens almost all their time waiting for people, with the exception of the startup time.
Which in turn really does not typically come into play on a multi-user system that has been running for a while. This is because the most common apps are already in memory (at least the primary working set) since a single app is really shared by all the users. Or at least is in the disk cache.