I don't think at that level often, but occasionally I wind up thinking about things at that level, particularly when I'm trying to figure out how to convince a database to do things my way. (Or while I'm figuring out why doing things my way wasn't as good as I hoped it would be.)
As for your example, you're right that shaving 20 MB off of the executable does not make it fit in cache. But that is ignoring the fact that while the overall application might be that large, most people don't use all of the application at the same time. Therefore the active set in the application that you're using has a chance of fitting in cache, and cutting its size by a third strongly improves that chance.
Furthermore even if the active set does not fit into cache, you've improved how much of it does. And this is still a win. Sure, you're going to have lots of waits while something is fetched into cache, but reducing the number of such waits increases performance.
You aren't going to get an order of magnitude improvement that way, but you might get several percent improvement. Of course loops not unrolled etc might cost you more than you could gain. You need to benchmark, benchmark, benchmark.
Another note. Apple is very aware that people notice startup time more than execution performance in interactive applications. Smaller executables might or might not run as fast, but they are likely to load faster. Thinking about it, this probably matters more to them than any possible speed gain/loss at runtime.
Cheers,
Ben