We always say, get it done, get it right, optimize
later.
Usually later never comes.
Sometimes it is sooner than you think.
I wrote a census apply program. It takes census
data and appends it to data, based on geocode or
zip code, whatever is available.
It was REALLY slow. It was using a indexed lookups
pulling data from disk. The census data is huge, about
4K. It takes a long time to pull the data and append
it as the user data flows through.
But is was fast enough. It took the load off the
mainframe and was faster than if the mainframe did
it, so everyone was happy. Except me. I KNEW I
could do a lot better.
And then disaster struck. The array I developed it
on lost 9 disks at once. Failed channel. And the
reconstruction effort really fried them. One of the
LUNs depended on 2 of the disks, so the LUN was GONE.
The volume that was being used striped 4 LUNs to
create a terrabyte file system. So it went bye-bye.
I'm not the sysadmin. I'm not in the systems group.
I'm not supposed to take any time to backup my own
stuff.
And of course, this particular TB was NOT in the
backup system. It was for development, so it did
not fall in the stringent backup schedule, more like:
Hey, backup my stuff this weekend, ok?
I have a nightly backup process that runs, copying all
my development directories into a file, bzipping it,
and copy it to another system.
But my census code was NOT it that either. I had
placed it outside my development tree in preparation
for handing it to another coder. Silly me.
ARRGG!!!
So I could consider this an opportunity. Make it better,
faster.
I threw away the original indexed Perl based design.
I stepped in the way back machine and used pure flat file
extract, sorts, and key joins using Syncsort.
I am often amazed by the speed of Syncsort. I typically
do not use it for flat file processing. I don't like
depending on commercial utilities that expire. It will
cost me in a few months to renew the license. But I'd
renew anyway, I'm just becoming more dependant on it.
The new process is about 20 times faster than the old one.
Test cases that took 10 hours now run in 30 minutes.
It's worth it.