Given an input file of 10 fields, each field needs to
be checked against the multiple lookup tables. Existance
in the input table is all I care about, not any value
returned.
There can be up to 20 lookup tables, with the majority of
them having under 20,000 records.
On the other hand, I do have one with 300,000,000 records.
The big one is about 7.5 GB of raw data, 20GB as a Perl Berkeley
Btree file, and 10GB (so far) as a Berkely Hash file. The btree builds
in 2 hours, the hash is still running (4 days, 170,000,000 records
so far, damn, I need to figure out how to optimize that if
I find hash is faster for lookups).
Most of my time is spent pinning a single CPU doing the lookups,
with occasional wait-for-io. It is a serial process, but can be
partitioned, since every field has to be checked against every table.
Currently running on a VMWare instance that has no throttling,
with 4 CPUs allocated. Not sure if I can have more.
I currently have 4GB of memory. Not sure if I can have more,
but it will never enough to totally do without disk backing for the
lookups so I have program assuming the limitation.
It runs too slow. About 100,000 comparisons a second. I want to
speed it up. I assume I'll split it over CPUs and disk channels,
and possibly even give it it's own box to run so I can control
the disk better.
My ultimate speed goal is 50,000,000 comparisons a second, but
I'd be happy to get to 1,000,000 on current hardware, which means
make best use of the idle CPU up to the point of IO contention.
Anyone want to propose a specific design? Programming 1st, hardware
second. I'll probably end up testing multiples as I experiment.