IWETHEY v. 0.3.0 | TODO
1,095 registered users | 0 active users | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New Goal: Really Fast File Lookup
Given an input file of 10 fields, each field needs to
be checked against the multiple lookup tables. Existance
in the input table is all I care about, not any value
returned.

There can be up to 20 lookup tables, with the majority of
them having under 20,000 records.

On the other hand, I do have one with 300,000,000 records.

The big one is about 7.5 GB of raw data, 20GB as a Perl Berkeley
Btree file, and 10GB (so far) as a Berkely Hash file. The btree builds
in 2 hours, the hash is still running (4 days, 170,000,000 records
so far, damn, I need to figure out how to optimize that if
I find hash is faster for lookups).

Most of my time is spent pinning a single CPU doing the lookups,
with occasional wait-for-io. It is a serial process, but can be
partitioned, since every field has to be checked against every table.

Currently running on a VMWare instance that has no throttling,
with 4 CPUs allocated. Not sure if I can have more.

I currently have 4GB of memory. Not sure if I can have more,
but it will never enough to totally do without disk backing for the
lookups so I have program assuming the limitation.

It runs too slow. About 100,000 comparisons a second. I want to
speed it up. I assume I'll split it over CPUs and disk channels,
and possibly even give it it's own box to run so I can control
the disk better.

My ultimate speed goal is 50,000,000 comparisons a second, but
I'd be happy to get to 1,000,000 on current hardware, which means
make best use of the idle CPU up to the point of IO contention.

Anyone want to propose a specific design? Programming 1st, hardware
second. I'll probably end up testing multiples as I experiment.
New Brute force
Hire half of Bangladesh and have them do it manually.
--

Drew
New No wetware in the budget.
New Use memcached
even though you can't cache it all it will help tremendously.
New Not yet
But when I hit the next step (baby cluster) it'll be helpful.

Thanks
New Map/Reduce?
Difficult to tell if that's applicable given your writeup.
Regards,
-scott
Welcome to Rivendell, Mr. Anderson.
New Pretty much, but in a simpler form
Since I have 4 cpus to fill at the moment, I rewrote using Perl threads, and maintain an active thread list to determine if I should release new tasks.

So on this system I hit ~380% CPU (max 400%) with occasional IO bottleneck. My RPS went from 100K to 250K, I'm not nearly there yet, on performance, but it seems I have a direction.

My tasks can be partitioned nicely, the input file can be easily split and then pieces sent to individual threads/and/or processes.

So I guess the next step is to implement the simplest/cheapest cluster available for me to program the next step on.

Any direction on that?
New Something like this maybe
http://bonsai.ims.u-...ster/software.htm

I've not used an Perl clustering software, but there's probably something performant out there for you. Ben may have (probably has) a better idea.
Regards,
-scott
Welcome to Rivendell, Mr. Anderson.
New Life just got easy
My main lookup table went from around 300 million records to a measly 10 million, or maybe even less. And the core specification dropped from a possible 500 comparisons per input record to a max of 10.

I'll get back to you on the final real requirements, but pretty soon I'll be able to code and run it on my phone.
New Can you combine methods?
For instance, the hash finishes building and turns out to do the fastest lookups. Can you use map/reduce to build the hash faster?
--

Drew
New Build time is once every 3 months
If that.

Lookup speed is where I'll put my effort.
New If space is not an issue but speed is
use a trie. It's a compressed tree. Depending on how long the word you're using for the index is, you will get to it in time k where k equals word.length.
     Goal: Really Fast File Lookup - (crazy) - (11)
         Brute force - (drook) - (1)
             No wetware in the budget. -NT - (crazy)
         Use memcached - (folkert) - (1)
             Not yet - (crazy)
         Map/Reduce? - (malraux) - (3)
             Pretty much, but in a simpler form - (crazy) - (2)
                 Something like this maybe - (malraux) - (1)
                     Life just got easy - (crazy)
         Can you combine methods? - (drook) - (2)
             Build time is once every 3 months - (crazy) - (1)
                 If space is not an issue but speed is - (jake123)

The kids should've been walking to school with EMP lunchboxes.
123 ms