Goal: Really Fast File Lookup

Post #328,391 by crazy 6/22/10 12:40:54 PM Reply	Goal: Really Fast File Lookup Given an input file of 10 fields, each field needs to be checked against the multiple lookup tables. Existance in the input table is all I care about, not any value returned. There can be up to 20 lookup tables, with the majority of them having under 20,000 records. On the other hand, I do have one with 300,000,000 records. The big one is about 7.5 GB of raw data, 20GB as a Perl Berkeley Btree file, and 10GB (so far) as a Berkely Hash file. The btree builds in 2 hours, the hash is still running (4 days, 170,000,000 records so far, damn, I need to figure out how to optimize that if I find hash is faster for lookups). Most of my time is spent pinning a single CPU doing the lookups, with occasional wait-for-io. It is a serial process, but can be partitioned, since every field has to be checked against every table. Currently running on a VMWare instance that has no throttling, with 4 CPUs allocated. Not sure if I can have more. I currently have 4GB of memory. Not sure if I can have more, but it will never enough to totally do without disk backing for the lookups so I have program assuming the limitation. It runs too slow. About 100,000 comparisons a second. I want to speed it up. I assume I'll split it over CPUs and disk channels, and possibly even give it it's own box to run so I can control the disk better. My ultimate speed goal is 50,000,000 comparisons a second, but I'd be happy to get to 1,000,000 on current hardware, which means make best use of the idle CPU up to the point of IO contention. Anyone want to propose a specific design? Programming 1st, hardware second. I'll probably end up testing multiples as I experiment.
Post #328,392 by drook 6/22/10 12:57:01 PM Reply	Brute force Hire half of Bangladesh and have them do it manually. -- Drew
Post #328,393 by crazy 6/22/10 1:01:22 PM Reply	No wetware in the budget.
Post #328,404 by folkert 6/22/10 4:34:21 PM Reply	Use memcached even though you can't cache it all it will help tremendously.
Post #328,438 by crazy 6/23/10 9:07:49 AM Reply	Not yet But when I hit the next step (baby cluster) it'll be helpful. Thanks
Post #328,426 by malraux 6/22/10 11:38:06 PM Reply	Map/Reduce? Difficult to tell if that's applicable given your writeup. Regards, -scott Welcome to Rivendell, Mr. Anderson.
Post #328,437 by crazy 6/23/10 9:07:07 AM Reply	Pretty much, but in a simpler form Since I have 4 cpus to fill at the moment, I rewrote using Perl threads, and maintain an active thread list to determine if I should release new tasks. So on this system I hit ~380% CPU (max 400%) with occasional IO bottleneck. My RPS went from 100K to 250K, I'm not nearly there yet, on performance, but it seems I have a direction. My tasks can be partitioned nicely, the input file can be easily split and then pieces sent to individual threads/and/or processes. So I guess the next step is to implement the simplest/cheapest cluster available for me to program the next step on. Any direction on that?
Post #328,458 by malraux 6/23/10 12:26:12 PM Reply	Something like this maybe http://bonsai.ims.u-...ster/software.htm I've not used an Perl clustering software, but there's probably something performant out there for you. Ben may have (probably has) a better idea. Regards, -scott Welcome to Rivendell, Mr. Anderson.
Post #328,471 by crazy 6/23/10 4:28:55 PM Reply	Life just got easy My main lookup table went from around 300 million records to a measly 10 million, or maybe even less. And the core specification dropped from a possible 500 comparisons per input record to a max of 10. I'll get back to you on the final real requirements, but pretty soon I'll be able to code and run it on my phone.
Post #328,441 by drook 6/23/10 10:15:37 AM Reply	Can you combine methods? For instance, the hash finishes building and turns out to do the fastest lookups. Can you use map/reduce to build the hash faster? -- Drew
Post #328,443 by crazy 6/23/10 10:37:44 AM Reply	Build time is once every 3 months If that. Lookup speed is where I'll put my effort.
Post #328,447 by jake123 6/23/10 10:54:45 AM Reply	If space is not an issue but speed is use a trie. It's a compressed tree. Depending on how long the word you're using for the index is, you will get to it in time k where k equals word.length.

Welcome to IWETHEY!