Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Comparing Like strings in Java

Options
  • 04-03-2007 10:59pm
    #1
    Closed Accounts Posts: 638 ✭✭✭


    I have a really large number of strings, perhaps 3,000+, i know theres some overlap of them.
    Its a list of customers and their details.
    I'm looking to eliminate duplicate customers but the data is not exactly the same. I guess I'm looking for a way to find similar strings.

    I'm not looking for code, i can do that myself.
    I'm interested in finding opinions on a strategy for doing this, i cant find any on google, its too messy given the keywords in question.

    Is the best solution basically to set a threshold (eg: 95%), write a algorithm to detect this, and delete/mark of them?

    Is there algorithms or template solutions that anyone already know off for doing this? i hardly think its an original problem.

    Any suggestions on the problem itself, perhaps a experience of the same problem?
    Thanks


Comments

  • Closed Accounts Posts: 638 ✭✭✭theTinker


    well ill be damn....
    http://www.merriampark.com/ld.htm

    this gives a close solution explaination + the source code in major lanaguages.

    Im not sure if it is the best approach though. If i replace all double spaces between words with single spaces first, that would certianly make it a little more accurate. More suggestions welcomed


  • Registered Users Posts: 21,264 ✭✭✭✭Hobbes


    First up define similar and what kind of data.

    Lets say it is addresses.

    10 Main street
    10 main st
    10 maiin streeet
    10 the main street

    Would be the same place but different ways to write it. In such instances Levenshtein Distance would not work well for this.

    I know LanguageWare could do this easily (you would have to build the dictionary, or program to machine learn). Not sure of the price of it though, might be other alternatives.


  • Closed Accounts Posts: 461 ✭✭markf909


    This sounds like Hirschbergs algorithm for the longest common subsequence might be of use also.

    http://www.ics.uci.edu/~eppstein/161/960229.html

    Like Levenshtein Distance it uses dynamic programming and it damn easy to implement, I have written versions in Java and Python to analyse large files with lists of method call chains from my own software.


Advertisement