IWETHEY v. 0.3.0 | TODO
1,095 registered users | 0 active users | 0 LpH | Statistics
Login | Create New User
IWETHEY Banner

Welcome to IWETHEY!

New 2 things
1. The language is Perl. Note capitalization and spelling.

2. The problem with deduping is that the same address can be entered in many different ways. For instance Street vs St. So you have to handle variations. And trust me, you'll get a *lot* of variations. While looking at exact matches entered twice finds some duplicates, it leaves enough of a problem that people generally want to do something better. For instance sometimes you'll see St and others Street. People move around where the apartment number goes in the address. (Not an issue for my project, but for this one it would be.)

If it is small enough to go through by hand, humans will handle lots of those issues correctly. But coding up a program to handle these issues is surprisingly tricky.

And even humans run into problems with this. For instance a human who is not familiar with Denver may think that Colorado St and Colorado Rd are the same. A human who is not familiar with Boston may think that if two address have the same city name, street name and street number then having the zip be off in one digit has to be a typo. Both times you'd be wrong.

Ben
I have come to believe that idealism without discipline is a quick road to disaster, while discipline without idealism is pointless. -- Aaron Ward (my brother)
New so going by lat and long makes more sense, thx
Any opinions expressed by me are mine alone, posted from my home computer, on my own time as a free american and do not reflect the opinions of any person or company that I have had professional relations with in the past 50 years. meep
New There are apps that standardize addresses
Yes, like Ben said it's complex. But there are off-the-shelf apps to do it for you. What we did (at my last gig) was keep the original address and the standardized address for each record. Match on standardized for dupes, and show a person the matches.

We were checking one-by-one in real time to see if we had done work on a property before, but you could easily do this for a large set and query for dupes to get a sense of scale. The package we had also had an optional web service to check whether the standardized address was recognized by the USPS.

Checking for lat/lon is just an abstracted version of this process. The advantage of using addresses is that looking at what comes out the other end as "standardized" is human-verifiable.
===

Purveyor of Doc Hope's [link|http://DocHope.com|fresh-baked dog biscuits and pet treats].
[link|http://DocHope.com|http://DocHope.com]
     Postal address list cleansing - (Steve Lowe) - (21)
         I just did a deduping project kind of like that - (ben_tilly) - (1)
             Thanks for the tip! - (Steve Lowe)
         do a dump then sort by address. the dupes are identified - (boxley) - (4)
             YM, delete HALF of them... Or NOBODY at that address is left -NT - (CRConrad)
             2 things - (ben_tilly) - (2)
                 so going by lat and long makes more sense, thx -NT - (boxley) - (1)
                     There are apps that standardize addresses - (drewk)
         Here ya go - (broomberg) - (12)
             I think #2 does what I did - (ben_tilly) - (2)
                 Your geocoding process standardized first - (broomberg) - (1)
                     Exactly - (ben_tilly)
             Re: Here ya go - (Steve Lowe) - (8)
                 ObLRPD: "Vote him off the island!" - (Another Scott) - (3)
                     That'd be harder than giving a bath to a bobcat. -NT - (admin) - (2)
                         the trick is... - (cforde) - (1)
                             Talk your talk, wee man. -NT - (admin)
                 Firstlogic match/consolidate is verra nice - (broomberg) - (3)
                     Thanks, having a look. -NT - (Steve Lowe)
                     Re "How much is the cost to mail each duplicate each month?" - (CRConrad) - (1)
                         Exactly - (Steve Lowe)
         Send me a dump of the list in e-mail. - (folkert)

You are delightfully evil. Come sit by me.
146 ms