Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Working with a massive file in perl

Options
  • 03-08-2009 9:56am
    #1
    Registered Users Posts: 528 ✭✭✭


    Hi Guys,

    I have a slight problem with a script I am trying to get working for work. I have a list of numbers ~600,000 which I have put into an array. I then need to look for the corresponding numbers in a file that has ~19,000,000 entries and print out some of the info contained in the lines. The problem I am having is that the file is simply too big and anything I try leads to my computer running out of RAM. Any suggestions? Thanks in advance.

    Eoin.


Comments

  • Registered Users Posts: 1,380 ✭✭✭remus808


    Are these all unique numbers? I'd imagine the first task is to go through both lists and eliminate all duplicates - depending on the data set this could result in a much easier task.


  • Registered Users Posts: 528 ✭✭✭ridonkulous


    Yep they are all unique values in both instances.


  • Registered Users Posts: 9,579 ✭✭✭Webmonkey


    Can you not read in a subset, process, then seek the file pointer to the next sub set instead of loading them all into RAM at once?


  • Registered Users Posts: 528 ✭✭✭ridonkulous


    Well yes and its a good idea and i had thought of it myself. But I just thought that the method might be a bit time consuming and inefficient.


  • Closed Accounts Posts: 7 bladez


    Can you sort the 600k file?


  • Advertisement
  • Registered Users Posts: 6,509 ✭✭✭daymobrew


    If execution time isn't a big deal maybe a shell script could do it.
    for n in `cat 600kfile.txt`
    do
      # Specify whatever delimiters are required.
      grep "^$n," 19millionfile.txt
    done
    


  • Closed Accounts Posts: 1,444 ✭✭✭Cantab.


    Why load all your data into an array at all? Just read the data line by line:
    open(IN,"<filename.txt") or die "Could not open file";
    while(<IN>)
    {
            my $line=$_;
            #etc.
    }
    

    And 600,000 lines is nothin'!


  • Closed Accounts Posts: 7,794 ✭✭✭JC 2K3


    Is the 19,000,000 entry file sorted and is it in sequence?


  • Registered Users Posts: 528 ✭✭✭ridonkulous


    bladez wrote: »
    Can you sort the 600k file?

    yep.


  • Registered Users Posts: 528 ✭✭✭ridonkulous


    daymobrew wrote: »
    If execution time isn't a big deal maybe a shell script could do it.
    for n in `cat 600kfile.txt`
    do
      # Specify whatever delimiters are required.
      grep "^$n," 19millionfile.txt
    done
    


    Unfortunately it is and I have used bash to do some stuff along similar lines beofre and it simply takes way too long. Thanks for the suggestion though.


  • Advertisement
  • Registered Users Posts: 528 ✭✭✭ridonkulous


    daymobrew wrote: »
    If execution time isn't a big deal maybe a shell script could do it.
    for n in `cat 600kfile.txt`
    do
      # Specify whatever delimiters are required.
      grep "^$n," 19millionfile.txt
    done
    


    Surely this would take an age as it has to grep the entire 19,000,000 line file for each line in the 600,000 so basically 19,000,000 x 600,000 operations(Or I could be grossly misinformed).


  • Registered Users Posts: 528 ✭✭✭ridonkulous


    JC 2K3 wrote: »
    Is the 19,000,000 entry file sorted and is it in sequence?

    Yep.


  • Closed Accounts Posts: 7,794 ✭✭✭JC 2K3


    You can calculate the line number which corresponds to a certain value then.

    Is it possible to open a file and jump straight to a line without putting the whole thing in memory in Perl?

    If it is, then this should work well.

    EDIT: hmm... I don't think you can, but you don't have to do any pattern matching, just loop until you get to the desired line.


  • Registered Users Posts: 9,579 ✭✭✭Webmonkey


    Unless there are values missing in the sequence?

    You could do a binary search. Take line number 1 and line number max/2 - if the value is less than the max/2 value, then check that half, by taking half again. Other wise go up.

    Not sure if it will work, I'm still not sure exactly how the file is layed out.

    http://en.wikipedia.org/wiki/Binary_search_algorithm


  • Registered Users Posts: 9,579 ✭✭✭Webmonkey


    JC 2K3 wrote: »
    You can calculate the line number which corresponds to a certain value then.

    Is it possible to open a file and jump straight to a line without putting the whole thing in memory in Perl?

    If it is, then this should work well.

    EDIT: hmm... I don't think you can, but you don't have to do any pattern matching, just loop until you get to the desired line.
    I'm guessing if the line widths are fixed, then you can calculate how many bytes occupy a lines, so seek the file to (nBrBytesInLine * lineNumber). But if the lines are variable, then I'm not sure if it is possible.


  • Closed Accounts Posts: 7,794 ✭✭✭JC 2K3


    Webmonkey wrote: »
    Unless there are values missing in the sequence?

    You could do a binary search. Take line number 1 and line number max/2 - if the value is less than the max/2 value, then check that half, by taking half again. Other wise go up.

    Not sure if it will work, I'm still not sure exactly how the file is layed out.

    http://en.wikipedia.org/wiki/Binary_search_algorithm
    Well, by "in sequence", I meant purely, 100% in sequence.

    Can you do a binary search effectively without putting the whole thing in RAM?


  • Registered Users Posts: 9,579 ✭✭✭Webmonkey


    JC 2K3 wrote: »
    Well, by "in sequence", I meant purely, 100% in sequence.

    Can you do a binary search effectively without putting the whole thing in RAM?
    Yep thought so.

    Well only if you can do what you were saying , about reading in a line number.

    If you can obtain the max line number. Then you can divide the line number by 2 and read in this line number and recursively continue from there.

    But if you can't seek to a line number, then I don't think it is possible as you have to load in all data to work out the line breaks. Messy :(


  • Registered Users Posts: 195 ✭✭caffrey


    Does it have to be perl? c++ would do this much faster, I converted a pattern matching perl file which sounds quite similar to what you are doing and the difference in speed is vast. some runs of the perl program would take 2 weeks and the never took more than 10 mins. Also if you are worried about having to learn a lot dont worry too much, there are a lot of things that you will find familiar in the boost libraries, like regular expressions, hash maps etc


  • Closed Accounts Posts: 7,794 ✭✭✭JC 2K3


    Judging by the OP's problem, it probably wouldn't be too impractical for the OP to convert the 19,000,000 line file to fixed length records, if it isn't already.

    Then either seek directly to the required line, or binary search, depending on if the file is 100% in sequence or not.

    If having fixed length records is impractical, they could consider building an index.


  • Registered Users Posts: 1,916 ✭✭✭ronivek


    Half of the problem is already solved for you; http://perldoc.perl.org/Tie/File.html

    Assuming the records in your file are sorted by the field you're interested in then it's just a matter of implementing some sort of search over the array.

    This would all depend entirely on what context you want to perform this operation in; there are better mechanisms out there for storing and retrieving millions of records...


  • Advertisement
  • Closed Accounts Posts: 20 boars.ie


    Out of curiosity, I wrote a test code, generating such files and doing
    a search, on me eeePC it used up 20% of mem and took:


    $ /tmp/test.py 2> /tmp/out
    0secs - Generating numbers...
    4secs - Generating records...
    440secs - Reading numbers...
    1secs - Reading records...
    268secs - Results...
    18secs - End...


    The code:


    #!/usr/bin/python

    import sys
    import time
    import random

    num_file_name = '/tmp/nums'
    rec_file_name = '/tmp/recs'
    rand = random.getrandbits
    timestamp = time.time()

    print '0secs - Generating numbers...'
    num_file = open(num_file_name, 'w')
    [ num_file.write(str(rand(24)) + '\n') for i in range(500000) ]
    num_file.close()

    print str(int(time.time() - timestamp)) + 'secs - Generating records...'
    timestamp = time.time()
    record_file = open(rec_file_name, 'w')
    [ record_file.write(str(rand(24)) + ' ' + str(rand(24)) + ' ' + str(rand(24)) + '\n') for i in range(20000000) ]
    record_file.close()

    print str(int(time.time() - timestamp)) + 'secs - Reading numbers...'
    timestamp = time.time()
    num_file = open(num_file_name)
    num_set = set([ line.strip() for line in num_file ])
    num_file.close()

    print str(int(time.time() - timestamp)) + 'secs - Reading records...'
    timestamp = time.time()
    record_file = open(rec_file_name)
    result = [ line for line in record_file if [ record for record in line.split() if record.strip() in num_set ] ]
    record_file.close()

    print str(int(time.time() - timestamp)) + 'secs - Results...'
    timestamp = time.time()
    for line in result:
    sys.stderr.write(line)

    print str(int(time.time() - timestamp)) + 'secs - End...'


  • Closed Accounts Posts: 7,794 ✭✭✭JC 2K3


    ronivek wrote: »
    Half of the problem is already solved for you; http://perldoc.perl.org/Tie/File.html

    Assuming the records in your file are sorted by the field you're interested in then it's just a matter of implementing some sort of search over the array.

    This would all depend entirely on what context you want to perform this operation in; there are better mechanisms out there for storing and retrieving millions of records...
    Nice.


Advertisement