Post #214,548
7/12/05 7:39:59 PM
7/14/05 1:17:39 PM
|
But neither are you, at least not that low a level
There are different ways of implementing certain algorithms. You can take a LOT of memory and run a lot faster than if you take just a little memory.
The key is the goal. Will you be thrashing memory, which means it was a faster way until you started paging?
And how many of the same programs have to run at the same time?
So optimizing for memory usually means you want a smaller foot print and will accept some performance loss.
Cache usually does not come into play at this higher level of optimization. We are talking about decisions that might make a program 40MB rather than 60MB. That 40MB program will NOT fit in cache. Shaving that 20MB had no effect, other than let it run on a memory constrained box, or allow more copies to run at the same time.
Cache usually comes into play when dealing with very tight loops and small (under 32K) lookup tables. I used to code for cache hits when I wrote in C, and occasionally (it was VERY rare) dumped the assembly language to figure out if I was going to cause a cache miss, triggering a significant wait. At that point I might have unrolled a while loop and turned it into a case statement with a bunch of goto to keep my working set as small at possible.
{{edit: moved paragraph.}} As a perl programmer, I never think at that level any more.
This was the time I wrote a curses based, table driven editorial system. I had 40 terminal users running against a 386 box running Sco Xenix. I doubt I had more than 32MB of RAM. Every byte AND every CPU cycle was sacred here.
Edited by broomberg
July 14, 2005, 01:17:39 PM EDT
|
Post #214,555
7/12/05 8:48:30 PM
|
I might surprise you
I don't think at that level often, but occasionally I wind up thinking about things at that level, particularly when I'm trying to figure out how to convince a database to do things my way. (Or while I'm figuring out why doing things my way wasn't as good as I hoped it would be.)
As for your example, you're right that shaving 20 MB off of the executable does not make it fit in cache. But that is ignoring the fact that while the overall application might be that large, most people don't use all of the application at the same time. Therefore the active set in the application that you're using has a chance of fitting in cache, and cutting its size by a third strongly improves that chance.
Furthermore even if the active set does not fit into cache, you've improved how much of it does. And this is still a win. Sure, you're going to have lots of waits while something is fetched into cache, but reducing the number of such waits increases performance.
You aren't going to get an order of magnitude improvement that way, but you might get several percent improvement. Of course loops not unrolled etc might cost you more than you could gain. You need to benchmark, benchmark, benchmark.
Another note. Apple is very aware that people notice startup time more than execution performance in interactive applications. Smaller executables might or might not run as fast, but they are likely to load faster. Thinking about it, this probably matters more to them than any possible speed gain/loss at runtime.
Cheers, Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
|
Post #214,557
7/12/05 9:03:33 PM
7/13/05 6:36:36 AM
|
Latency latency latency
In my kernel tuning class (the week I met you in NYC) that was the mantra.
Performance it mostly a perceived phenomina, not a truly measured one, at least in the case of tuning a system for OTHER people.
So the 1st lesson to be learned was to lower expectations.
After that, true thoughput was typically ignored, in an attempt to reduce latency. So it worked its way out, from CPU cache hits, to 2nd level memory, to disk, to network, etc.
Edit: I should have said worked its way in, from slowest to quickest. And the 1st should have been application design. That's because the items with the largest perceived benefit should be tuned 1st, not the other way around. Who cares if you shave a few milliseconds with a cache hit only to lose a few seconds in the application design side.
And as you stated, program startup was a HUGE issue with people, since interactive programs spens almost all their time waiting for people, with the exception of the startup time.
Which in turn really does not typically come into play on a multi-user system that has been running for a while. This is because the most common apps are already in memory (at least the primary working set) since a single app is really shared by all the users. Or at least is in the disk cache.
Edited by broomberg
July 13, 2005, 06:36:36 AM EDT
|
Post #214,581
7/13/05 3:30:29 AM
|
Note that most macs are single-user systems
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
|
Post #214,582
7/13/05 6:32:54 AM
|
No argument
|
Post #214,636
7/13/05 12:30:24 PM
|
Our perl programmers think that way all the time
We have an app server built in C++ with a perl thread to run mason. Every operation is on a very tight time budget (we always trade size for speed - speed is king).
That's part of what makes this gig fun - its always pushing the machine capabilities.
"Whenever you find you are on the side of the majority, it is time to pause and reflect" --Mark Twain
"The significant problems we face cannot be solved at the same level of thinking we were at when we created them." --Albert Einstein
"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses." --George W. Bush
|
Post #214,772
7/14/05 1:20:29 PM
|
Please reread, I changed it
I added the 386/sacred paragraph but should have put it AFTER the comment.
As a perl programmer, I CANNOT control the cache. Perl code is too large, the syntax tree too abstract (as opposed to mapping C code to assembly language constructs).
I still feel that way about memory and cpu cycles, but I can't control the cache so I can't think at that level.
|
Post #214,775
7/14/05 1:26:13 PM
|
We have a lot of levels of caching
and have to keep them all in mind when designing. (Yes I know you were talking about instruction caching - we do something similar at the service call level).
It makes for a challenging environment but I find it a lot of fun.
"Whenever you find you are on the side of the majority, it is time to pause and reflect" --Mark Twain
"The significant problems we face cannot be solved at the same level of thinking we were at when we created them." --Albert Einstein
"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses." --George W. Bush
|
Post #214,781
7/14/05 1:38:37 PM
|
Agreed
I had an interesting meeting today.
We had moved from Oracle/Solaris 8/Veritas/Xyratex to Oracle/Linux 2.4/ext3/Clariion.
Went from quad CPU Sparc 450Mhz to Qaud CPU Opteron 2.2Ghz.
I considered the ext3 the iffiest part of the move, but this is RH AS3 , no decent file systems available, at least for large data.
So the programmers started complaining about incredibly poor performance a month or so after the move.
Note: I was involved in NONE of their code, other than the foundation of the project 5 years ago.
And all the coders who worked on it until 6 months ago are gone.
So anyway, the new guy, who is a brilliant Perl programmer, but knows NOTHING of large data, is bitching about the terrible performance.
None of my tests show anything wrong, so I just watch his processing.
When they moved from Sparc to Opteron, they figured that had a bunch of CPU, so they ALSO killed 6 dual Xeon compute servers and centralized their CPU intensive runs on the single box.
At the same time they were doing the Oracle work.
Driving their load average to 20.
Blaming the system
SMACK!
Today was the day the manager of that group said he had no worries about performance anymore, that the system I designed and gave to him to use work great, and that he was sure there would be no performance problems in the future now that his people realized they were being silly. They've been running just fine for a couple of weeks now.
|
Post #214,806
7/14/05 2:54:35 PM
|
Ahhh, cogs click into place.
I would have fallen over if you told me it was at 20.
How the FSCK would you ever get that high doing the compute stuff you are doing... unless they were throwing everything at it at the same time.
I have seen machine with load averages in the 200s and greater performing just fine. It was an application server that scheduled and ran multiple, multiple, multiple servlets (or computelets) per second. It was able to keep up, but you can only execute so many a second.
In any case, these things were the cause of the 200+ load average... The CPUs, Disk I/O, Memory I/O, Memory Cache, Buffering, etc... wasn't even being pegged at 200+ Load Average. So many things to be executed and submitted at a time, some functions submitted 300 joblets (or whatever they called them) at a time.
The only time the app servers were that busy, was during Fall Registration opening and closing. Other than that, it barely rose above 2.
This was an IBM 4 Proc P9XX system doing the work with a Gaggle of Memory and nice disk perf.
-- [link|mailto:greg@gregfolkert.net|greg], [link|http://www.iwethey.org/ed_curry|REMEMBER ED CURRY!] @ iwethey [image|http://www.danasoft.com/vipersig.jpg||||]
|
Post #214,879
7/14/05 9:21:00 PM
|
It's helpful to remember just what "load average" means.
Someone told me it was how many scheduled items in the kernel missed their turn in the schedule. Sounds like one of the few times that "load average" is actually highly misleading.
Wade.
Save Fintlewoodlewix
|
Post #214,953
7/15/05 9:38:29 AM
|
yea, That is a good analogy...
But I have always described it, Job waiting in Queue during the sample frequency.
Or in other words, Jobs waiting for their turn in the scheduling.
Sort of like Left turn lanes in the USA (Right turn lanes in places that drive on the wrong side of the road).
At busy intersections, left turn lanes typically build up a queue of cars to turn left. Most of the time only 3-5 get through per light. Sometimes takes 5 or 6 light cycles to get through it.
-- [link|mailto:greg@gregfolkert.net|greg], [link|http://www.iwethey.org/ed_curry|REMEMBER ED CURRY!] @ iwethey [image|http://www.danasoft.com/vipersig.jpg||||]
|