Post #209,840
6/4/05 4:25:24 PM
|
Suppression code
My niece has a project that involves loading suppressing 4MM phone numbers against 11,000 records. Seems like direct marketing is genetic, ehh?
Anyway, she's a web monkey wannabe, and this project fell into her lap. She could not possibly code anything like this, at least not having it run in our lifetime.
So she asked her brother to help, who is a perl wannabe.
The program has been running since yesterday on a 2.4Ghz Windows PC.
hehehehehe.
So anyway, I offered to help. My goal was to write something both quickly runnable and novice programmer friendly, at the same time. So while the code COULD be 10 lines (or 2 if you're Ben), mine is far larger with thge error checking, the stats counting, the reporting, etc.
Now that I HAVE the long version, I wanted to come up with a quick running short version for comparision.
So, here's the challenge.
Load up a suppression phone list file. Don't forget to get rid of all non-numerics. Read fixed length records, pulling the phone number, get rid of non-numerics, and output when not in the suppression list.
I seem to recall dealing with multiple files on the command line differently, actually with some Peter code. Time to go look it up.
|
Post #209,841
6/4/05 4:47:15 PM
|
Re: Suppression code
If you're willing to throw lots of memory at the problem (say, ~1.2 Gb, assuming 10 digit phone numbers), you can turn each phone number into a bit index. Allocate a bit vector large enough to hold 10 digits worth of bits, then set the corresponding bit in the vector for excluded numbers, and probe into that vector to see if a number is excluded.
Taking advantage of the area code space being sparce, you can save some memory. Break each number into a 3 digit area code, and a 7 digit number, then lazily allocate the bit vector for the 7 digit number when you see the prefix in the exclude list. A few extra steps to build up your vectors and probe them, but still damn fast if you can fit everything into physical memory.
Implementation is left as an exercise, or as a challenge to Ben to do in under 100 characters.
|
Post #209,844
6/4/05 5:36:39 PM
|
I don't golf real code
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
|
Post #209,846
6/4/05 6:31:33 PM
|
Golf - huh?
|
Post #209,848
6/4/05 6:55:38 PM
|
dws' "100 characters" comment is about golf
Golf is the art of writing programs with as few "strokes" as possible. For instance a program to print all of the permutations of 12345, each on a seperate line, might be: \nperl -le'map/(.).*\\1/||print,glob"{1,2,3,4,5}"x5'\n I've..been known to play in the past. :-) Cheers, Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
|
Post #209,852
6/4/05 7:30:34 PM
|
OH
No thanks on that.
My niece had a huge blowup with her dad, about it being wrong for me to help.
This is NOT homework - this is a paid gig! And she is currently VERY subsidised by her dad, my blind brother, so she's not THAT proud.
Her brother, OK, since he knows less than her, but me, BAD!
I called her and told her I just spent the last hour writing comments, not code, and I'd really appreciate if she could do me the favor to review what I wrote. She does not have to use it, but it'll make me feel better if she at least reads and give me feedback on it. Make an old man happy. Humor me.
Of course, in the cover email I explained how it'll run in about 20 seconds based on my test data, and it is hopefully generic enough (all command line driven, handles fixed, csv, and tab, error checks and logs out the wazzoo). And, that I've never actually fully run it, so it is up to her to determine if it works, and possibly fix it.
Hopefully it was enough of a teaser that she makes use of it and learns a bit.
|
Post #209,856
6/4/05 8:24:40 PM
|
Further thinking while walking the dog
I'd sure like to see sample exclusion data. Using the first six digits (area code + prefix) to lookup a lazily-created bit vector for the last four digits might save the most memory. Area codes are sparce, as, in some area codes, are prefixes. The combination might minimize the number of bit vectors you'd have to create to probe for the final four digits.
|
Post #209,977
6/5/05 11:27:45 PM
8/21/07 6:06:26 AM
|
Yep, that's what I'd do
This is the same problem as the Social Security Number validator - assume you have a list of all valid SSN's. Design a program/data structure to rapidly tell you if a given SSN is valid.
Bit vector turns out to be the only really fast solution that fits in memory.
"Whenever you find you are on the side of the majority, it is time to pause and reflect" --Mark Twain
"The significant problems we face cannot be solved at the same level of thinking we were at when we created them." --Albert Einstein
"This is still a dangerous world. It's a world of madmen and uncertainty and potential mental losses." --George W. Bush
|