Quickly search through MASSIVE file in java

sicruise · 07-03-2007 11:56AM #1

I have a csv file with roughly 300,000 entries and I need to be able to compare a string to a column in the file and based on the second columns value I need to perform an action... Unfortunately this has to be seemless to the user and happen in under 1 sec max... I'm considering putting this data in a database...

I would prefer to have this data loaded into memory, I would like this to happen when JBoss starts up... anyone have any suggestions how to go about this...

markf909 · 07-03-2007 01:26PM

Are there any special properties that the keys exhibit?
Are they numbers?

Are the CSV values always of length two with a key and an action or are the varied in length?

sicruise · 07-03-2007 01:34PM

The first field is between 4 and 8 characters long. All characters being numeric.

The second field is a 3 digit identifier. There are other fields but I am not going to be using them.

Is JBossCache the correct approach to take?

Khannie · 08-03-2007 03:16PM

is JNI an option?

300,000 entries isn't massive IMO. You should easily be able to search it in under a second with a reasonable processor / memory.

If memory isn't an issue, loading the entire file into your own structure is an option, this will definitely speed up search time.

edit: I noticed that regex's have been added to the string class since 1.5. These will probably be implemented with an efficient c / c++ algorithm in the background (I found it hard to outpace perl's regexing with c++ string searches for example).

If you're not familiar with regex's, post up a few example strings and I'll give you what you're looking for.

sicruise · 08-03-2007 03:38PM

First of all cheers for the helpful post

It is basically a card bin table

e.g.

440102, 978
430204, 826

So basically if card starts with 440102 I will want my method to return 978.

public String checkCurrency(String cardNumber)

The file has been filtered down now to 50,000 entries. This data needs to be accessed mutliple times per second. I don't think i'd be able to use regular expressions with this... am I thinking about this the wrong way?

I think JNI could be over complicating implementing this seeing I've never used it, I do plan to look into this though.. eventually. I'd like to have it all contained within JBoss... is an indexed table in mySQL an option? The other way I'm thinking is using a tool like JBossCache... basically I'm just not sure how exactly to load this data into memory and have it accessible by my application. Also every month this data is due to change so I don't want to have to restart JBoss once a month either.

What dya think?

markf909 · 08-03-2007 04:33PM

Do you know the largest number that is used as a key?
If its less than 1,000,000 then why not create a short array with the keys as indexes to the actions.

That will create an array of 2MB in size which is reasonable.
Then you could use a binary search to trawl through that array and pull out the action.

I'd imagine that would be the quickest solution for multiple accesses per second.

Khannie · 08-03-2007 04:36PM

I'm not familiar with JBoss tbh.

There are a number of ways that I would deal with this, the first one that springs to mind is a hash table (not sure if java supports these). Alternatively, you can use the string method startsWith(String) for example. Using ints would definitely be substantially faster, since they will be implemented natively behind the scenes.

You could sort the array numerically first too, which would greatly speed up search times. There are a variety of search algorithms that would work very quickly on a sorted array like that.

I'm assuming (in recommending a hash) that each 6 digit number is unique.

edit: A quick google shows that there is a HashTable class in java.util. This is what I would recommend. It will be fast and will result in clean code. Essentially, you will read each line in the file, splitting on the ", ", then sticking key / value pairs in a hashmap. It uses objects, so I would recommend converting Strings to Integers, as the equals() method that it almost certainly uses under the hood will result in a much faster comparisson.

Mutant_Fruit · 09-03-2007 02:40AM

If you had your stuff in a 2d integer array you'd be able to binary search the thing in a few microseconds to find any entry. It'd take less than 20 comparisons to find the value you need based on the key you're searching for.

Khannie · 09-03-2007 11:44AM

Mutant_Fruit wrote:

If you had your stuff in a 2d integer array you'd be able to binary search the thing in a few microseconds to find any entry. It'd take less than 20 comparisons to find the value you need based on the key you're searching for.

Sorted hashtable almost certainly uses a binary search anyway and will produce cleaner / more readable code. I must say though, I do love my binary searches.

markf909 · 09-03-2007 12:01PM

Sorted hashtable almost certainly uses a binary search anyway and will produce cleaner / more readable code.

The problem is that in this case that will result in 300,000 pairs.
As Java stores its hashtable keys and values as Objects, thats 600,000 unnecessary (for this problem) Objects created on the heap.

If the values in the file were not numbers then the hashtable strategy would be a solution but the numbers can be exploited to be used as indexes to an array instead.

bambam · 09-03-2007 02:19PM

What about a text indexing solution?
Try lucene: http://lucene.apache.org/java/docs/

Hobbes · 09-03-2007 02:51PM

tbh stick in a database and use JDBC. Your not going to get better response times dicking around with CSV file for that many entries.

Here's a list of whats available for free.
http://java-source.net/open-source/database-engines

lynchie · 09-03-2007 03:25PM

Hobbes wrote:

tbh stick in a database and use JDBC. Your not going to get better response times dicking around with CSV file for that many entries.

Here's a list of whats available for free.
http://java-source.net/open-source/database-engines

Agreed.. and if possible use hibernate as the persistence engine.. it has some good caching built in which should speed up multiple lookups.

Khannie · 09-03-2007 03:48PM

markf909 wrote:

thats 600,000 unnecessary (for this problem) Objects created on the heap.

It's 50,000 pairs now and what you're advocating creates an array of 1,000,000 items, 95% of which will be empty. As I said, the hashtable will be fast and will generate clean code.

markf909 · 09-03-2007 04:17PM

Khannie wrote:

It's 50,000 pairs now and what you're advocating creates an array of 1,000,000 items, 95% of which will be empty. As I said, the hashtable will be fast and will generate clean code.

Apologies I thought it was 300,000 entries, its not as MASSIVE as before

For 50k entries the Hashtable is a better solution, the memory footprint is only ~200k and obviously the searching is quicker.

Hobbes · 09-03-2007 06:40PM

markf909 wrote:

Apologies I thought it was 300,000 entries, its not as MASSIVE as before
For 50k entries the Hashtable is a better solution, the memory footprint is only ~200k and obviously the searching is quicker.

Your still going to have to parse 300k lines of text, convert the lookup numbers to a primative and add it to the hashtable. That will take a while.

Khannie · 09-03-2007 07:13PM

Hobbes wrote:

That will take a while.

I'm on the verge of doing this as an exercise.

300,000 lines, at < 20 characters per line = nothing (again though, it's down to 50K lines, which is even nothinger than nothing!)

Also, it's a one off cost at startup and can't really be avoided.

Hobbes · 09-03-2007 07:17PM

If you use JDBC then you don't get the one off cost at startup and you will get better memory management and faster already prebuilt responses. Is there a reason you need to revinvent the wheel?

kyote00 · 09-03-2007 09:47PM

what about an array list ?

e.g.

public CopyOnWriteArrayList<String> mylist = new CopyOnWriteArrayList<String>();

you can then use mylist.contains(string) to find items in the list....

I recently used it on a file with about 500k small entries and it had very fast performance...

Note you may need to use CopyonWriteArrayList rather than ArrayList to avoid concurrency issues ....

Khannie · 10-03-2007 12:53PM

Hobbes wrote:

If you use JDBC then you don't get the one off cost at startup and you will get better memory management and faster already prebuilt responses. Is there a reason you need to revinvent the wheel?

I consider a database total overkill for this. He'd have to get the database set up, get the data into the database, the cost of indexing is still there, it's just done by the database at runtime rather than at startup, etc. Granted, the database will have better caching, etc. Still....Hashtable is far less hassle IMO.

We're talking about milliseconds here anyway lads. Indexing 50,000 elements is nothing. Absolutely nothing.

Mutant_Fruit · 10-03-2007 03:13PM

Hashtable, sorted array, whatever. Either way he's going to be able to do thousands (or more) of searches a second... A database is overkill by a long shot (unless he's going to use SQLite). But even still you have the once-off setup costs of loading everything into the database and whatnot

Quickly search through MASSIVE file in java

Comments