Working with a massive file in perl

ridonkulous · 03-08-2009 09:56AM #1

Hi Guys,

I have a slight problem with a script I am trying to get working for work. I have a list of numbers ~600,000 which I have put into an array. I then need to look for the corresponding numbers in a file that has ~19,000,000 entries and print out some of the info contained in the lines. The problem I am having is that the file is simply too big and anything I try leads to my computer running out of RAM. Any suggestions? Thanks in advance.

Eoin.

remus808 · 03-08-2009 10:36AM

Are these all unique numbers? I'd imagine the first task is to go through both lists and eliminate all duplicates - depending on the data set this could result in a much easier task.

ridonkulous · 03-08-2009 11:00AM

Yep they are all unique values in both instances.

Webmonkey · 03-08-2009 02:24PM

Can you not read in a subset, process, then seek the file pointer to the next sub set instead of loading them all into RAM at once?

ridonkulous · 03-08-2009 02:38PM

Well yes and its a good idea and i had thought of it myself. But I just thought that the method might be a bit time consuming and inefficient.

bladez · 05-08-2009 12:22AM

Can you sort the 600k file?

daymobrew · 05-08-2009 01:14PM

If execution time isn't a big deal maybe a shell script could do it.

for n in `cat 600kfile.txt`
do
  # Specify whatever delimiters are required.
  grep "^$n," 19millionfile.txt
done

Cantab. · 05-08-2009 11:56PM

Why load all your data into an array at all? Just read the data line by line:

open(IN,"<filename.txt") or die "Could not open file";
while(<IN>)
{
        my $line=$_;
        #etc.
}

And 600,000 lines is nothin'!

JC 2K3 · 06-08-2009 02:00PM

Is the 19,000,000 entry file sorted and is it in sequence?

ridonkulous · 07-08-2009 02:30PM

bladez wrote: »

Can you sort the 600k file?

yep.

ridonkulous · 07-08-2009 02:31PM

daymobrew wrote: »
If execution time isn't a big deal maybe a shell script could do it.
for n in `cat 600kfile.txt`
do
  # Specify whatever delimiters are required.
  grep "^$n," 19millionfile.txt
done

Unfortunately it is and I have used bash to do some stuff along similar lines beofre and it simply takes way too long. Thanks for the suggestion though.

ridonkulous · 07-08-2009 02:34PM

daymobrew wrote: »
If execution time isn't a big deal maybe a shell script could do it.
for n in `cat 600kfile.txt`
do
  # Specify whatever delimiters are required.
  grep "^$n," 19millionfile.txt
done

Surely this would take an age as it has to grep the entire 19,000,000 line file for each line in the 600,000 so basically 19,000,000 x 600,000 operations(Or I could be grossly misinformed).

ridonkulous · 07-08-2009 02:34PM

JC 2K3 wrote: »

Is the 19,000,000 entry file sorted and is it in sequence?

Yep.

JC 2K3 · 07-08-2009 02:52PM

You can calculate the line number which corresponds to a certain value then.

Is it possible to open a file and jump straight to a line without putting the whole thing in memory in Perl?

If it is, then this should work well.

EDIT: hmm... I don't think you can, but you don't have to do any pattern matching, just loop until you get to the desired line.

Webmonkey · 07-08-2009 02:56PM

Unless there are values missing in the sequence?

You could do a binary search. Take line number 1 and line number max/2 - if the value is less than the max/2 value, then check that half, by taking half again. Other wise go up.

Not sure if it will work, I'm still not sure exactly how the file is layed out.

http://en.wikipedia.org/wiki/Binary_search_algorithm

Webmonkey · 07-08-2009 03:04PM

JC 2K3 wrote: »

You can calculate the line number which corresponds to a certain value then.

Is it possible to open a file and jump straight to a line without putting the whole thing in memory in Perl?

If it is, then this should work well.

EDIT: hmm... I don't think you can, but you don't have to do any pattern matching, just loop until you get to the desired line.

I'm guessing if the line widths are fixed, then you can calculate how many bytes occupy a lines, so seek the file to (nBrBytesInLine * lineNumber). But if the lines are variable, then I'm not sure if it is possible.

JC 2K3 · 07-08-2009 03:05PM

Webmonkey wrote: »

Unless there are values missing in the sequence?

You could do a binary search. Take line number 1 and line number max/2 - if the value is less than the max/2 value, then check that half, by taking half again. Other wise go up.

Not sure if it will work, I'm still not sure exactly how the file is layed out.

http://en.wikipedia.org/wiki/Binary_search_algorithm

Well, by "in sequence", I meant purely, 100% in sequence.

Can you do a binary search effectively without putting the whole thing in RAM?

Webmonkey · 07-08-2009 03:08PM

JC 2K3 wrote: »

Well, by "in sequence", I meant purely, 100% in sequence.

Can you do a binary search effectively without putting the whole thing in RAM?

Yep thought so.

Well only if you can do what you were saying , about reading in a line number.

If you can obtain the max line number. Then you can divide the line number by 2 and read in this line number and recursively continue from there.

But if you can't seek to a line number, then I don't think it is possible as you have to load in all data to work out the line breaks. Messy

caffrey · 07-08-2009 03:16PM

Does it have to be perl? c++ would do this much faster, I converted a pattern matching perl file which sounds quite similar to what you are doing and the difference in speed is vast. some runs of the perl program would take 2 weeks and the never took more than 10 mins. Also if you are worried about having to learn a lot dont worry too much, there are a lot of things that you will find familiar in the boost libraries, like regular expressions, hash maps etc

JC 2K3 · 07-08-2009 03:17PM

Judging by the OP's problem, it probably wouldn't be too impractical for the OP to convert the 19,000,000 line file to fixed length records, if it isn't already.

Then either seek directly to the required line, or binary search, depending on if the file is 100% in sequence or not.

If having fixed length records is impractical, they could consider building an index.

ronivek · 08-08-2009 11:43PM

Half of the problem is already solved for you; http://perldoc.perl.org/Tie/File.html

Assuming the records in your file are sorted by the field you're interested in then it's just a matter of implementing some sort of search over the array.

This would all depend entirely on what context you want to perform this operation in; there are better mechanisms out there for storing and retrieving millions of records...

boars.ie · 08-08-2009 11:48PM

Out of curiosity, I wrote a test code, generating such files and doing
a search, on me eeePC it used up 20% of mem and took:

$ /tmp/test.py 2> /tmp/out
0secs - Generating numbers...
4secs - Generating records...
440secs - Reading numbers...
1secs - Reading records...
268secs - Results...
18secs - End...

The code:

#!/usr/bin/python

import sys
import time
import random

num_file_name = '/tmp/nums'
rec_file_name = '/tmp/recs'
rand = random.getrandbits
timestamp = time.time()

print '0secs - Generating numbers...'
num_file = open(num_file_name, 'w')
[ num_file.write(str(rand(24)) + '\n') for i in range(500000) ]
num_file.close()

print str(int(time.time() - timestamp)) + 'secs - Generating records...'
timestamp = time.time()
record_file = open(rec_file_name, 'w')
[ record_file.write(str(rand(24)) + ' ' + str(rand(24)) + ' ' + str(rand(24)) + '\n') for i in range(20000000) ]
record_file.close()

print str(int(time.time() - timestamp)) + 'secs - Reading numbers...'
timestamp = time.time()
num_file = open(num_file_name)
num_set = set([ line.strip() for line in num_file ])
num_file.close()

print str(int(time.time() - timestamp)) + 'secs - Reading records...'
timestamp = time.time()
record_file = open(rec_file_name)
result = [ line for line in record_file if [ record for record in line.split() if record.strip() in num_set ] ]
record_file.close()

print str(int(time.time() - timestamp)) + 'secs - Results...'
timestamp = time.time()
for line in result:
sys.stderr.write(line)

print str(int(time.time() - timestamp)) + 'secs - End...'

JC 2K3 · 09-08-2009 12:02PM

ronivek wrote: »

Half of the problem is already solved for you; http://perldoc.perl.org/Tie/File.html

Assuming the records in your file are sorted by the field you're interested in then it's just a matter of implementing some sort of search over the array.

This would all depend entirely on what context you want to perform this operation in; there are better mechanisms out there for storing and retrieving millions of records...

Nice.

Working with a massive file in perl

Comments