Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Quickly search through MASSIVE file in java

Options
  • 07-03-2007 11:56am
    #1
    Registered Users Posts: 378 ✭✭


    I have a csv file with roughly 300,000 entries and I need to be able to compare a string to a column in the file and based on the second columns value I need to perform an action... Unfortunately this has to be seemless to the user and happen in under 1 sec max... I'm considering putting this data in a database...

    I would prefer to have this data loaded into memory, I would like this to happen when JBoss starts up... anyone have any suggestions how to go about this...


Comments

  • Closed Accounts Posts: 461 ✭✭markf909


    Are there any special properties that the keys exhibit?
    Are they numbers?

    Are the CSV values always of length two with a key and an action or are the varied in length?


  • Registered Users Posts: 378 ✭✭sicruise


    The first field is between 4 and 8 characters long. All characters being numeric.

    The second field is a 3 digit identifier. There are other fields but I am not going to be using them.

    Is JBossCache the correct approach to take?


  • Registered Users Posts: 37,485 ✭✭✭✭Khannie


    is JNI an option?

    300,000 entries isn't massive IMO. You should easily be able to search it in under a second with a reasonable processor / memory.

    If memory isn't an issue, loading the entire file into your own structure is an option, this will definitely speed up search time.

    edit: I noticed that regex's have been added to the string class since 1.5. These will probably be implemented with an efficient c / c++ algorithm in the background (I found it hard to outpace perl's regexing with c++ string searches for example).

    If you're not familiar with regex's, post up a few example strings and I'll give you what you're looking for.


  • Registered Users Posts: 378 ✭✭sicruise


    First of all cheers for the helpful post

    It is basically a card bin table

    e.g.

    440102, 978
    430204, 826

    So basically if card starts with 440102 I will want my method to return 978.

    public String checkCurrency(String cardNumber)

    The file has been filtered down now to 50,000 entries. This data needs to be accessed mutliple times per second. I don't think i'd be able to use regular expressions with this... am I thinking about this the wrong way?

    I think JNI could be over complicating implementing this seeing I've never used it, I do plan to look into this though.. eventually. I'd like to have it all contained within JBoss... is an indexed table in mySQL an option? The other way I'm thinking is using a tool like JBossCache... basically I'm just not sure how exactly to load this data into memory and have it accessible by my application. Also every month this data is due to change so I don't want to have to restart JBoss once a month either.

    What dya think?


  • Closed Accounts Posts: 461 ✭✭markf909


    Do you know the largest number that is used as a key?
    If its less than 1,000,000 then why not create a short array with the keys as indexes to the actions.

    That will create an array of 2MB in size which is reasonable.
    Then you could use a binary search to trawl through that array and pull out the action.

    I'd imagine that would be the quickest solution for multiple accesses per second.


  • Advertisement
  • Registered Users Posts: 37,485 ✭✭✭✭Khannie


    I'm not familiar with JBoss tbh.

    There are a number of ways that I would deal with this, the first one that springs to mind is a hash table (not sure if java supports these). Alternatively, you can use the string method startsWith(String) for example. Using ints would definitely be substantially faster, since they will be implemented natively behind the scenes.

    You could sort the array numerically first too, which would greatly speed up search times. There are a variety of search algorithms that would work very quickly on a sorted array like that.

    I'm assuming (in recommending a hash) that each 6 digit number is unique.

    edit: A quick google shows that there is a HashTable class in java.util. This is what I would recommend. It will be fast and will result in clean code. Essentially, you will read each line in the file, splitting on the ", ", then sticking key / value pairs in a hashmap. It uses objects, so I would recommend converting Strings to Integers, as the equals() method that it almost certainly uses under the hood will result in a much faster comparisson.


  • Closed Accounts Posts: 4,943 ✭✭✭Mutant_Fruit


    If you had your stuff in a 2d integer array you'd be able to binary search the thing in a few microseconds to find any entry. It'd take less than 20 comparisons to find the value you need based on the key you're searching for.


  • Registered Users Posts: 37,485 ✭✭✭✭Khannie


    If you had your stuff in a 2d integer array you'd be able to binary search the thing in a few microseconds to find any entry. It'd take less than 20 comparisons to find the value you need based on the key you're searching for.

    Sorted hashtable almost certainly uses a binary search anyway and will produce cleaner / more readable code. I must say though, I do love my binary searches. :D


  • Closed Accounts Posts: 461 ✭✭markf909


    Sorted hashtable almost certainly uses a binary search anyway and will produce cleaner / more readable code.

    The problem is that in this case that will result in 300,000 pairs.
    As Java stores its hashtable keys and values as Objects, thats 600,000 unnecessary (for this problem) Objects created on the heap.

    If the values in the file were not numbers then the hashtable strategy would be a solution but the numbers can be exploited to be used as indexes to an array instead.


  • Registered Users Posts: 597 ✭✭✭bambam


    What about a text indexing solution?
    Try lucene: http://lucene.apache.org/java/docs/


  • Advertisement
  • Registered Users Posts: 21,264 ✭✭✭✭Hobbes


    tbh stick in a database and use JDBC. Your not going to get better response times dicking around with CSV file for that many entries.

    Here's a list of whats available for free.
    http://java-source.net/open-source/database-engines


  • Registered Users Posts: 1,996 ✭✭✭lynchie


    Hobbes wrote:
    tbh stick in a database and use JDBC. Your not going to get better response times dicking around with CSV file for that many entries.

    Here's a list of whats available for free.
    http://java-source.net/open-source/database-engines

    Agreed.. and if possible use hibernate as the persistence engine.. it has some good caching built in which should speed up multiple lookups.


  • Registered Users Posts: 37,485 ✭✭✭✭Khannie


    markf909 wrote:
    thats 600,000 unnecessary (for this problem) Objects created on the heap.

    It's 50,000 pairs now and what you're advocating creates an array of 1,000,000 items, 95% of which will be empty. As I said, the hashtable will be fast and will generate clean code.


  • Closed Accounts Posts: 461 ✭✭markf909


    Khannie wrote:
    It's 50,000 pairs now and what you're advocating creates an array of 1,000,000 items, 95% of which will be empty. As I said, the hashtable will be fast and will generate clean code.


    Apologies I thought it was 300,000 entries, its not as MASSIVE as before :D
    For 50k entries the Hashtable is a better solution, the memory footprint is only ~200k and obviously the searching is quicker.


  • Registered Users Posts: 21,264 ✭✭✭✭Hobbes


    markf909 wrote:
    Apologies I thought it was 300,000 entries, its not as MASSIVE as before :D
    For 50k entries the Hashtable is a better solution, the memory footprint is only ~200k and obviously the searching is quicker.

    Your still going to have to parse 300k lines of text, convert the lookup numbers to a primative and add it to the hashtable. That will take a while.


  • Registered Users Posts: 37,485 ✭✭✭✭Khannie


    Hobbes wrote:
    That will take a while.

    I'm on the verge of doing this as an exercise. :D

    300,000 lines, at < 20 characters per line = nothing (again though, it's down to 50K lines, which is even nothinger than nothing!)

    Also, it's a one off cost at startup and can't really be avoided. :(


  • Registered Users Posts: 21,264 ✭✭✭✭Hobbes


    If you use JDBC then you don't get the one off cost at startup and you will get better memory management and faster already prebuilt responses. Is there a reason you need to revinvent the wheel? :)


  • Registered Users Posts: 1,591 ✭✭✭kyote00


    what about an array list ?

    e.g.

    public CopyOnWriteArrayList<String> mylist = new CopyOnWriteArrayList<String>();

    you can then use mylist.contains(string) to find items in the list....

    I recently used it on a file with about 500k small entries and it had very fast performance...

    Note you may need to use CopyonWriteArrayList rather than ArrayList to avoid concurrency issues ....


  • Registered Users Posts: 37,485 ✭✭✭✭Khannie


    Hobbes wrote:
    If you use JDBC then you don't get the one off cost at startup and you will get better memory management and faster already prebuilt responses. Is there a reason you need to revinvent the wheel? :)

    I consider a database total overkill for this. He'd have to get the database set up, get the data into the database, the cost of indexing is still there, it's just done by the database at runtime rather than at startup, etc. Granted, the database will have better caching, etc. Still....Hashtable is far less hassle IMO.

    We're talking about milliseconds here anyway lads. Indexing 50,000 elements is nothing. Absolutely nothing.


  • Closed Accounts Posts: 4,943 ✭✭✭Mutant_Fruit


    Hashtable, sorted array, whatever. Either way he's going to be able to do thousands (or more) of searches a second... A database is overkill by a long shot (unless he's going to use SQLite). But even still you have the once-off setup costs of loading everything into the database and whatnot


  • Advertisement
Advertisement