IWETHEY v. 0.3.0 | TODO
1,095 registered users | 0 active users | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New Postal address list cleansing
We do mass mailings of ~200k pieces once or twice a month. A long time ago, a few decisions were made that ended up leaving our database with a mess of duplicate names. I estimate somewhere in the neighborhood of 3-5%. At the time time, the mailing/adversting agency we worked with deduped the mailing files by address so one piece would go to each address. We have a new vendor that doesn't do this - partly because they want to sell us on thier database product (ASP.NET on SQL/Server with IIS.. current vendor is MySQL/Unify, Apache, Perl, Linux) which does have better list hygiene tools that what we currently use, but falls well short in other areas. In fact it's the same vendor I posted about [link|http://z.iwethey.org/forums/render/content/show?contentid=203201|here].

Work is underway to lessen the duplicates getting in the file in the first place, so at least that part's being addressed. The immediate problem for me to solve is twofold with a complication. The compllication is that the db and software aren't under my control, so db schema change are probably not in the cards and any software would have to be Perl. Current vendor is [link|http://www.donorware.com|DonorWare], and their infrastructure is described [link|http://www.donorware.com/about/infrastructure/|here]. The problems are 1) I need to clean our existing database, and 2) I'll still need to clean outgoing data. Once clean, the database will still have duplicate addresses - like father and son at the same house - but the desire is usually to send one piece to each address, but not always.

The software from DonorWare does include some duplicate checking and merging tools, but the process of making the decision as to whether what the computer matched is truly a duplicate or not is still left to the humans. We are working with them to improve this process and get it automated, obviously accepting some risk of false matches. I thought I'd throw this out here for the IWTGM to process, too.

Any thoughts or suggestions? Thanks in advance for any pointers. I'm off to start searching through CPAN now :)
--
Steve
[link|http://www.ubuntulinux.org|Ubuntu]
New I just did a deduping project kind of like that
Smaller scale, but the approach that I used was to use our existing contract with mapquest to have them geocode all of the addresses, then I matched them up by latitude/longitude. (We already have custom software to make those requests from mapquest.)

It took, I think, 19 hours to process just over 100,000 addresses, and they successfully geocoded a bit over 98% of them.

Barry probably has a better approach to do the same thing, but this worked for a one-off. (Given the existing contract and mapquest code...)

Cheers,
Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
New Thanks for the tip!
Did some searching for geocoding on CPAN and ended up [link|http://geocoder.us|here]. Looks useful so far, going to run some tests with it tomorrow. I'm thinking about geocoding our entire database, seems like a great idea.
--
Steve
[link|http://www.ubuntulinux.org|Ubuntu]
New do a dump then sort by address. the dupes are identified
bundlethem into a file and write a pearl dbi to seek the records and delete them.
thanx,
bill
Any opinions expressed by me are mine alone, posted from my home computer, on my own time as a free american and do not reflect the opinions of any person or company that I have had professional relations with in the past 50 years. meep
New YM, delete HALF of them... Or NOBODY at that address is left
New 2 things
1. The language is Perl. Note capitalization and spelling.

2. The problem with deduping is that the same address can be entered in many different ways. For instance Street vs St. So you have to handle variations. And trust me, you'll get a *lot* of variations. While looking at exact matches entered twice finds some duplicates, it leaves enough of a problem that people generally want to do something better. For instance sometimes you'll see St and others Street. People move around where the apartment number goes in the address. (Not an issue for my project, but for this one it would be.)

If it is small enough to go through by hand, humans will handle lots of those issues correctly. But coding up a program to handle these issues is surprisingly tricky.

And even humans run into problems with this. For instance a human who is not familiar with Denver may think that Colorado St and Colorado Rd are the same. A human who is not familiar with Boston may think that if two address have the same city name, street name and street number then having the zip be off in one digit has to be a typo. Both times you'd be wrong.

Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
New so going by lat and long makes more sense, thx
Any opinions expressed by me are mine alone, posted from my home computer, on my own time as a free american and do not reflect the opinions of any person or company that I have had professional relations with in the past 50 years. meep
New There are apps that standardize addresses
Yes, like Ben said it's complex. But there are off-the-shelf apps to do it for you. What we did (at my last gig) was keep the original address and the standardized address for each record. Match on standardized for dupes, and show a person the matches.

We were checking one-by-one in real time to see if we had done work on a property before, but you could easily do this for a large set and query for dupes to get a sense of scale. The package we had also had an optional web service to check whether the standardized address was recognized by the USPS.

Checking for lat/lon is just an abstracted version of this process. The advantage of using addresses is that looking at what comes out the other end as "standardized" is human-verifiable.
===

Purveyor of Doc Hope's [link|http://DocHope.com|fresh-baked dog biscuits and pet treats].
[link|http://DocHope.com|http://DocHope.com]
New Here ya go
#1 - Isolate out address elements into
\taddress1,
\taddress2,
\tcity,
\tstate,
\tzip.

#2 - Run through a CASS certified address standardization
program that does zip+4 appends. Do not bother trying
to write this yourself. You MUST buy it or rent someone
else's. They have the Postal database with all the street
synonyms and aliases, route number->friendly name, list
of legal address ranges per street, etc. There is NO WAY
you can code a program to do this without the postal database.

I've used a variety of them depending on accuracy desired vs
cost and speed. If you like, give me your list and I'll run
it through.


#3 - Merge/Purge. You need to determine what to keep VS drop,
right? Start thinking about the different ways people write
their 1st name (Rob, Robert, Bob), etc. Then add in typos.
Account for titles, etc. Suffixes. Then start adding in
confidence factors - you want to error on the side of sending
more or less mail? Oh, don't forget the married vs maiden
name or the multi-family / mother-inlaw living together.
Who exactly IS Mrs John Smith? Do you drop her if you have
another female smith in the household? Oh, and do you
want individual, family (single last name), and household
(multiple last names). What about company?

The better MP programs are amazing. Poorer matchode based
ones will consider a lot of obvious (to the human eye) dupes
as unique people. My home brewed one is matchode based which
means more false uniques, but I can't offer my corp one to you.

Dunno about Ben's geocoding answer. Rooftop geocoding is usually
way too expensive for casual use, and centroid is way to wide
for deduping purposes.
New I think #2 does what I did
I only used geocoding because it was free for me and I had code for it. I was using it to turn an address into a location, and using that to match addresses. But if you can standardize addresses another way, go for it.

For my project I was able to ignore the name issue - I was just merging several lists into one, and needed to spit out "these aren't duplicated" and "these may be, look them over by eye". So it was only semi-automated.

Cheers,
Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
New Your geocoding process standardized first
When we geocode stuff, 1st step is to standardize.
They just did it for you under the covers.
New Exactly
Come to think of it, I really did need the geocoding.

There is a problem with apartment buildings where on one lot you have 2 buildings with different addresses, but they're really the same. So I counted as possible dupes to review any two that were close. (I used "within adjacent 100' by 100' boxes, so that range was 100-300 feet.) That I couldn't have done without geocoding, but it was only a nice to have anyways. And it obviously doesn't apply in this case.

Cheers,
Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
Expand Edited by ben_tilly Jan. 18, 2006, 09:28:16 PM EST
New Re: Here ya go
#1: have those as discrete fields already

#2: actually did CASS and NCOA to the entire database Tuesday (Donnelly Marketing was the vendor), and loaded the changes today.

#3: ahhh the kicker. Is there a specific merge-purge program you've used or recommend? More on what we do later, need to spend some time with my son.
--
Steve
[link|http://www.ubuntulinux.org|Ubuntu]
New ObLRPD: "Vote him off the island!"
The Dot is feeling cranky tonight.

;-)

Enjoy being with your boy.

Cheers,
Scott.
New That'd be harder than giving a bath to a bobcat.
Regards,

-scott anderson

"Welcome to Rivendell, Mr. Anderson..."
New the trick is...
the anaesthetic dart before hand.
hth :-)
Have fun,
Carl Forde
New Talk your talk, wee man.
Regards,

-scott anderson

"Welcome to Rivendell, Mr. Anderson..."
New Firstlogic match/consolidate is verra nice
Group1 MP is (shudder) OK, but the interface is horrendous.

But there is a large amount of tuning and expertise that goes into setting these up. You might spend more on consulting and setup than on the software itself.

How much is the cost to mail each duplicate each month? What is your expected payback time?
New Thanks, having a look.
--
Steve
[link|http://www.ubuntulinux.org|Ubuntu]
New Re "How much is the cost to mail each duplicate each month?"
One cost that people might not think of, because it's not a direct monetary outlay at the time of mailing, is the extent to which the efficiency of the marketing dross (presumably?) you're sending is lessened because of the annoyance of recieving two identical copies of it.

Hard as Hell to quantify, of course... But I know that shit like that, being a sign of inefficiency and thus stupidity (not to mention being a waste of natural resources), pisses *me* off quite a bit; and since I'm hardly unique in this respect, there's *some* opportunity cost incurred there.


   [link|mailto:MyUserId@MyISP.CountryCode|Christian R. Conrad]
(I live in Finland, and my e-mail in-box is at the Saunalahti company.)
Yes Mr. Garrison, genetic engineering lets us correct God's horrible, horrible mistakes, like German people. - [link|http://maxpages.com/southpark2k/Episode_105|Mr. Hat]
New Exactly
And we have some who gripe if everyone in the house doesn't get it. Though I think the risk is on the side of sending too many.

Can't please everyone :)
--
Steve
[link|http://www.ubuntulinux.org|Ubuntu]
New Send me a dump of the list in e-mail.
That is the kind of stuff we do.

Plus, we can CASS certify.

We can De-Dupe based on any kind of criteria.
--
[link|mailto:greg@gregfolkert.net|greg],
[link|http://www.iwethey.org/ed_curry|REMEMBER ED CURRY!] @ iwethey
Freedom is not FREE.
Yeah, but 10s of Trillions of US Dollars?
SELECT * FROM scog WHERE ethics > 0;

0 rows returned.
     Postal address list cleansing - (Steve Lowe) - (21)
         I just did a deduping project kind of like that - (ben_tilly) - (1)
             Thanks for the tip! - (Steve Lowe)
         do a dump then sort by address. the dupes are identified - (boxley) - (4)
             YM, delete HALF of them... Or NOBODY at that address is left -NT - (CRConrad)
             2 things - (ben_tilly) - (2)
                 so going by lat and long makes more sense, thx -NT - (boxley) - (1)
                     There are apps that standardize addresses - (drewk)
         Here ya go - (broomberg) - (12)
             I think #2 does what I did - (ben_tilly) - (2)
                 Your geocoding process standardized first - (broomberg) - (1)
                     Exactly - (ben_tilly)
             Re: Here ya go - (Steve Lowe) - (8)
                 ObLRPD: "Vote him off the island!" - (Another Scott) - (3)
                     That'd be harder than giving a bath to a bobcat. -NT - (admin) - (2)
                         the trick is... - (cforde) - (1)
                             Talk your talk, wee man. -NT - (admin)
                 Firstlogic match/consolidate is verra nice - (broomberg) - (3)
                     Thanks, having a look. -NT - (Steve Lowe)
                     Re "How much is the cost to mail each duplicate each month?" - (CRConrad) - (1)
                         Exactly - (Steve Lowe)
         Send me a dump of the list in e-mail. - (folkert)

A long time ago, in a galaxy far, far away...
92 ms