Format (normalization) a postal address to search for duplicates?

In the database there are 100500 records with the email address (Moscow, Mira Ave., 10-204). You want to find duplicates. How can I do that? Is there any ready solutions, or will need to reinvent the wheel?
October 3rd 19 at 04:36
6 answers
October 3rd 19 at 04:38
cheap labor will save you :)
but in General I can normalize address through and then try to ometzit
only houses, buildings and apartments will have to Tinker
October 3rd 19 at 04:40
Can be done in case of absence of any normalization very strange way — to find a API maps and to compare the coordinates returned by hash for example.
Now and done, but it covered only about 80% of the addresses - Lyda_Wuckert commented on October 3rd 19 at 04:43
October 3rd 19 at 04:42
Try to use API Yandex.Cards, comparing the results of the geocoder. Documentation.
Cm. the response above. Method is good, but not very full. Bad for small towns, for which no detailed maps - Lyda_Wuckert commented on October 3rd 19 at 04:45
October 3rd 19 at 04:44
Well, you will save cheap children labor.
Hire freelancers for a dollar per hour and let sorted.
Sorted — what? It is necessary to compare each of 100500 addresses with each 100499 - Lyda_Wuckert commented on October 3rd 19 at 04:47
Well firstly you've already done too much for API Yandex maps, and secondly, nothing prevents you to sort the cities and streets, it's not tricky.
And there to touch just.
Well, or just do the normalization, and bring all to one mind, then don't have anything to compare, but simply to iterate through all the addresses one time that is O(n) instead of O(n^2) - Lyda_Wuckert commented on October 3rd 19 at 04:50
October 3rd 19 at 04:46
I have once upon a time (12-13 years ago) there was a similar task. I was led to normalized view in several passes
1) Moved everything in one register (upper)
2) Replaced all "St.", "Avenue" and "prospect" for one thing. The same is done with flats, houses, buildings and other things.
3) based On all this, already made 3 norms. form.
4) have Dealt with the addresses that could not be normalized
ON the base with 25,000 to 30,000 addresses I spent 2 or 3 days.
It is clear that the decision in the forehead and can be quite effective, but the alternative was manual perezhevyvaya all this information that some did not suit me :)
October 3rd 19 at 04:48
System Papyrus ( is able to do. But your problem is — customized: import-parse-exports. For little money you can easily carry out.

Find more questions by tags Algorithms