Comparing Like strings in Java

theTinker · 04-03-2007 10:59pm #1

I have a really large number of strings, perhaps 3,000+, i know theres some overlap of them.
Its a list of customers and their details.
I'm looking to eliminate duplicate customers but the data is not exactly the same. I guess I'm looking for a way to find similar strings.

I'm not looking for code, i can do that myself.
I'm interested in finding opinions on a strategy for doing this, i cant find any on google, its too messy given the keywords in question.

Is the best solution basically to set a threshold (eg: 95%), write a algorithm to detect this, and delete/mark of them?

Is there algorithms or template solutions that anyone already know off for doing this? i hardly think its an original problem.

Any suggestions on the problem itself, perhaps a experience of the same problem?
Thanks

theTinker · 04-03-2007 11:08pm

well ill be damn....
http://www.merriampark.com/ld.htm

this gives a close solution explaination + the source code in major lanaguages.

Im not sure if it is the best approach though. If i replace all double spaces between words with single spaces first, that would certianly make it a little more accurate. More suggestions welcomed

Hobbes · 04-03-2007 11:11pm

First up define similar and what kind of data.

Lets say it is addresses.

10 Main street
10 main st
10 maiin streeet
10 the main street

Would be the same place but different ways to write it. In such instances Levenshtein Distance would not work well for this.

I know LanguageWare could do this easily (you would have to build the dictionary, or program to machine learn). Not sure of the price of it though, might be other alternatives.

markf909 · 04-03-2007 11:48pm

This sounds like Hirschbergs algorithm for the longest common subsequence might be of use also.

http://www.ics.uci.edu/~eppstein/161/960229.html

Like Levenshtein Distance it uses dynamic programming and it damn easy to implement, I have written versions in Java and Python to analyse large files with lists of method call chains from my own software.

Comparing Like strings in Java

Comments