Post #241,777
1/18/06 4:11:09 PM
|

Postal address list cleansing
We do mass mailings of ~200k pieces once or twice a month. A long time ago, a few decisions were made that ended up leaving our database with a mess of duplicate names. I estimate somewhere in the neighborhood of 3-5%. At the time time, the mailing/adversting agency we worked with deduped the mailing files by address so one piece would go to each address. We have a new vendor that doesn't do this - partly because they want to sell us on thier database product (ASP.NET on SQL/Server with IIS.. current vendor is MySQL/Unify, Apache, Perl, Linux) which does have better list hygiene tools that what we currently use, but falls well short in other areas. In fact it's the same vendor I posted about [link|http://z.iwethey.org/forums/render/content/show?contentid=203201|here].
Work is underway to lessen the duplicates getting in the file in the first place, so at least that part's being addressed. The immediate problem for me to solve is twofold with a complication. The compllication is that the db and software aren't under my control, so db schema change are probably not in the cards and any software would have to be Perl. Current vendor is [link|http://www.donorware.com|DonorWare], and their infrastructure is described [link|http://www.donorware.com/about/infrastructure/|here]. The problems are 1) I need to clean our existing database, and 2) I'll still need to clean outgoing data. Once clean, the database will still have duplicate addresses - like father and son at the same house - but the desire is usually to send one piece to each address, but not always.
The software from DonorWare does include some duplicate checking and merging tools, but the process of making the decision as to whether what the computer matched is truly a duplicate or not is still left to the humans. We are working with them to improve this process and get it automated, obviously accepting some risk of false matches. I thought I'd throw this out here for the IWTGM to process, too.
Any thoughts or suggestions? Thanks in advance for any pointers. I'm off to start searching through CPAN now :)
-- Steve [link|http://www.ubuntulinux.org|Ubuntu]
|
Post #241,783
1/18/06 4:30:48 PM
|

I just did a deduping project kind of like that
Smaller scale, but the approach that I used was to use our existing contract with mapquest to have them geocode all of the addresses, then I matched them up by latitude/longitude. (We already have custom software to make those requests from mapquest.)
It took, I think, 19 hours to process just over 100,000 addresses, and they successfully geocoded a bit over 98% of them.
Barry probably has a better approach to do the same thing, but this worked for a one-off. (Given the existing contract and mapquest code...)
Cheers, Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
|
Post #241,792
1/18/06 6:13:48 PM
|

Thanks for the tip!
Did some searching for geocoding on CPAN and ended up [link|http://geocoder.us|here]. Looks useful so far, going to run some tests with it tomorrow. I'm thinking about geocoding our entire database, seems like a great idea.
-- Steve [link|http://www.ubuntulinux.org|Ubuntu]
|
Post #241,787
1/18/06 5:16:50 PM
|

do a dump then sort by address. the dupes are identified
bundlethem into a file and write a pearl dbi to seek the records and delete them. thanx, bill
Any opinions expressed by me are mine alone, posted from my home computer, on my own time as a free american and do not reflect the opinions of any person or company that I have had professional relations with in the past 50 years. meep
|
Post #241,788
1/18/06 5:32:18 PM
|

YM, delete HALF of them... Or NOBODY at that address is left
|
Post #241,790
1/18/06 5:40:57 PM
|

2 things
1. The language is Perl. Note capitalization and spelling.
2. The problem with deduping is that the same address can be entered in many different ways. For instance Street vs St. So you have to handle variations. And trust me, you'll get a *lot* of variations. While looking at exact matches entered twice finds some duplicates, it leaves enough of a problem that people generally want to do something better. For instance sometimes you'll see St and others Street. People move around where the apartment number goes in the address. (Not an issue for my project, but for this one it would be.)
If it is small enough to go through by hand, humans will handle lots of those issues correctly. But coding up a program to handle these issues is surprisingly tricky.
And even humans run into problems with this. For instance a human who is not familiar with Denver may think that Colorado St and Colorado Rd are the same. A human who is not familiar with Boston may think that if two address have the same city name, street name and street number then having the zip be off in one digit has to be a typo. Both times you'd be wrong.
Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
|
Post #241,791
1/18/06 6:08:25 PM
|

so going by lat and long makes more sense, thx
Any opinions expressed by me are mine alone, posted from my home computer, on my own time as a free american and do not reflect the opinions of any person or company that I have had professional relations with in the past 50 years. meep
|
Post #241,793
1/18/06 6:18:11 PM
|

There are apps that standardize addresses
Yes, like Ben said it's complex. But there are off-the-shelf apps to do it for you. What we did (at my last gig) was keep the original address and the standardized address for each record. Match on standardized for dupes, and show a person the matches.
We were checking one-by-one in real time to see if we had done work on a property before, but you could easily do this for a large set and query for dupes to get a sense of scale. The package we had also had an optional web service to check whether the standardized address was recognized by the USPS.
Checking for lat/lon is just an abstracted version of this process. The advantage of using addresses is that looking at what comes out the other end as "standardized" is human-verifiable.
===
Purveyor of Doc Hope's [link|http://DocHope.com|fresh-baked dog biscuits and pet treats]. [link|http://DocHope.com|http://DocHope.com]
|
Post #241,808
1/18/06 8:29:18 PM
|

Here ya go
#1 - Isolate out address elements into \taddress1, \taddress2, \tcity, \tstate, \tzip.
#2 - Run through a CASS certified address standardization program that does zip+4 appends. Do not bother trying to write this yourself. You MUST buy it or rent someone else's. They have the Postal database with all the street synonyms and aliases, route number->friendly name, list of legal address ranges per street, etc. There is NO WAY you can code a program to do this without the postal database.
I've used a variety of them depending on accuracy desired vs cost and speed. If you like, give me your list and I'll run it through.
#3 - Merge/Purge. You need to determine what to keep VS drop, right? Start thinking about the different ways people write their 1st name (Rob, Robert, Bob), etc. Then add in typos. Account for titles, etc. Suffixes. Then start adding in confidence factors - you want to error on the side of sending more or less mail? Oh, don't forget the married vs maiden name or the multi-family / mother-inlaw living together. Who exactly IS Mrs John Smith? Do you drop her if you have another female smith in the household? Oh, and do you want individual, family (single last name), and household (multiple last names). What about company?
The better MP programs are amazing. Poorer matchode based ones will consider a lot of obvious (to the human eye) dupes as unique people. My home brewed one is matchode based which means more false uniques, but I can't offer my corp one to you.
Dunno about Ben's geocoding answer. Rooftop geocoding is usually way too expensive for casual use, and centroid is way to wide for deduping purposes.
|
Post #241,815
1/18/06 9:02:40 PM
|

I think #2 does what I did
I only used geocoding because it was free for me and I had code for it. I was using it to turn an address into a location, and using that to match addresses. But if you can standardize addresses another way, go for it.
For my project I was able to ignore the name issue - I was just merging several lists into one, and needed to spit out "these aren't duplicated" and "these may be, look them over by eye". So it was only semi-automated.
Cheers, Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
|
Post #241,819
1/18/06 9:10:51 PM
|

Your geocoding process standardized first
When we geocode stuff, 1st step is to standardize. They just did it for you under the covers.
|
Post #241,821
1/18/06 9:26:04 PM
1/18/06 9:28:16 PM
|

Exactly
Come to think of it, I really did need the geocoding.
There is a problem with apartment buildings where on one lot you have 2 buildings with different addresses, but they're really the same. So I counted as possible dupes to review any two that were close. (I used "within adjacent 100' by 100' boxes, so that range was 100-300 feet.) That I couldn't have done without geocoding, but it was only a nice to have anyways. And it obviously doesn't apply in this case.
Cheers, Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)

Edited by ben_tilly
Jan. 18, 2006, 09:28:16 PM EST
|
Post #241,833
1/18/06 11:29:46 PM
|

Re: Here ya go
#1: have those as discrete fields already
#2: actually did CASS and NCOA to the entire database Tuesday (Donnelly Marketing was the vendor), and loaded the changes today.
#3: ahhh the kicker. Is there a specific merge-purge program you've used or recommend? More on what we do later, need to spend some time with my son.
-- Steve [link|http://www.ubuntulinux.org|Ubuntu]
|
Post #241,834
1/18/06 11:36:11 PM
|

ObLRPD: "Vote him off the island!"
The Dot is feeling cranky tonight.
;-)
Enjoy being with your boy.
Cheers, Scott.
|
Post #241,836
1/18/06 11:49:43 PM
|

That'd be harder than giving a bath to a bobcat.
Regards,
-scott anderson
"Welcome to Rivendell, Mr. Anderson..."
|
Post #241,842
1/19/06 1:43:28 AM
|

the trick is...
the anaesthetic dart before hand. hth :-)
Have fun, Carl Forde
|
Post #241,853
1/19/06 7:45:39 AM
|

Talk your talk, wee man.
Regards,
-scott anderson
"Welcome to Rivendell, Mr. Anderson..."
|
Post #241,837
1/18/06 11:50:33 PM
|

Firstlogic match/consolidate is verra nice
Group1 MP is (shudder) OK, but the interface is horrendous.
But there is a large amount of tuning and expertise that goes into setting these up. You might spend more on consulting and setup than on the software itself.
How much is the cost to mail each duplicate each month? What is your expected payback time?
|
Post #241,844
1/19/06 2:09:27 AM
|

Thanks, having a look.
-- Steve [link|http://www.ubuntulinux.org|Ubuntu]
|
Post #241,850
1/19/06 6:36:40 AM
|

Re "How much is the cost to mail each duplicate each month?"
One cost that people might not think of, because it's not a direct monetary outlay at the time of mailing, is the extent to which the efficiency of the marketing dross (presumably?) you're sending is lessened because of the annoyance of recieving two identical copies of it.
Hard as Hell to quantify, of course... But I know that shit like that, being a sign of inefficiency and thus stupidity (not to mention being a waste of natural resources), pisses *me* off quite a bit; and since I'm hardly unique in this respect, there's *some* opportunity cost incurred there.
[link|mailto:MyUserId@MyISP.CountryCode|Christian R. Conrad] (I live in Finland, and my e-mail in-box is at the Saunalahti company.)
Yes Mr. Garrison, genetic engineering lets us correct God's horrible, horrible mistakes, like German people. - [link|http://maxpages.com/southpark2k/Episode_105|Mr. Hat]
|
Post #241,870
1/19/06 11:02:52 AM
|

Exactly
And we have some who gripe if everyone in the house doesn't get it. Though I think the risk is on the side of sending too many.
Can't please everyone :)
-- Steve [link|http://www.ubuntulinux.org|Ubuntu]
|
Post #241,872
1/19/06 11:17:35 AM
|

Send me a dump of the list in e-mail.
That is the kind of stuff we do.
Plus, we can CASS certify.
We can De-Dupe based on any kind of criteria.
-- [link|mailto:greg@gregfolkert.net|greg], [link|http://www.iwethey.org/ed_curry|REMEMBER ED CURRY!] @ iwetheyFreedom is not FREE. Yeah, but 10s of Trillions of US Dollars? SELECT * FROM scog WHERE ethics > 0;
0 rows returned.
|