- Number variable size 5-11 digits (and turns out to be the primary key as well)
- Proper Name (always correct)
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Acceptable Name or Previous Not-Acceptable Name
- Previous Not-Acceptable Name
- Previous Not-Acceptable Name
- Previous Not-Acceptable Name
- Previous Not-Acceptable Name
- Previous Not-Acceptable Name
- Previous Not-Acceptable Name
- Previous Not-Acceptable Name
- Previous Not-Acceptable Name
- Previous Not-Acceptable Name
- Previous Not-Acceptable Name
In this data, guess what signifies a "Previous Acceptable Name" from a "Previous Not-Acceptable Name"...
NOTHING. Sweet!
The company said the only way they can get at that kind of data (yes even the IT people don't know how the systems work) is through a web-interface that requires a login and supposedly only works with IE.
Well, it works with Firefox, lynx and curl besides IE. I can lookup by number or name. If I lookup by number, it gives me all the info I need, except it is in such a horribly formatted source. Also,
I have to dump the number lookup html response to a file. Then a run a script against the 580K files, which removes all unneeded info and coagulates all info for one number, all on one line, tab delimited properly, also an "A-" separator for Previous Acceptable Names and an "N-A" seperator for the Previous Not-Acceptable Names. Plus it put all this info in a single file for ease of use. Major pain, the website sometime does a null reply or a Network disconnect.
Then I get to run another script that chops each record up, into many fields, so I end up with a file in the end with 80 fields in it something like:
- Num_Desig(PrimaryKey and Indexed)
- CurrentName char 40 (Indexed)
- PAN[1-30] char 40
- PNAN[1-48] char 40
The pieces I used were:
- bash
- cat
- sed
- grep (and egrep)
- cut
- sort
- tee
- uniq
- the pipe symbol (|)
- curl
- and stdout redirect (>)
How you like that? Now let me see you do that on Windows without Cygwin or other *NIX tools packages added.cat $INFILE | grep -e N\\-A\\ | grep -e A\\- | \\\n sed -e s/\\ /\\|/ -e s/\\,\\ /\\\\t/g -e s/A\\-\\ /\\\\t/ -e s/\\ N\\-A\\ /\\|/ | \\\n cut -f1,3 -d\\| | sed -e s/\\|/\\\\t/ | cut -f1,$FIELD | \\\n cut -c1-6,10- | grep -e [AEIOUYRSTLN] >> $OUTFILE
Total time involved (not counting waiting for the webserver): 2 Hours, mostly getting the data into easily machine readable format, being tab delimited.