Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Python iteration help please.

Options
  • 19-08-2014 2:35pm
    #1
    Registered Users Posts: 3,131 ✭✭✭


    Hi all,

    I'm trying to divide a large text file which contains what is basically a dump of an email account. I want to extract individual emails and write them to individual files.
    Each mail starts with "From " and ends with --LFLFLF
    I've written the following script but it only extracts the first mail from the file and then stops.
    Can someone help identify where I'm going wrong?
    I'm using IDLE 2.7 on Win7
    import re
    
    target = raw_input("Filename please: ")
    
    def SplitEmails(infile):
        with open(infile, 'r') as f:
            count = 0
            for result in re.findall('(^From\s.*?--\n\n\n)', f.read(), re.S):
                try:
                    count += 1
                    result = result.rstrip()
                    fn = 'email' + str(count) + '.txt'
                    with open(fn, 'w') as f:
                        f.write(result)
                        f.close()
                except:
                    continue
    
    SplitEmails(target)
    


Comments

  • Registered Users Posts: 7,157 ✭✭✭srsly78


    Probably because there are some carriage returns in there as well, not just LFLFLF. It's platform dependent, windows and unix treat newline differently.

    Breakpoint the code and see exactly what symbols it encounters.


  • Registered Users Posts: 3,131 ✭✭✭Dermot Illogical


    srsly78 wrote: »
    Probably because there are some carriage returns in there as well, not just LFLFLF. It's platform dependent, windows and unix treat newline differently.

    Breakpoint the code and see exactly what symbols it encounters.


    Thanks.
    I have it open in notepad++ and set to display both LF and CR so I'm reasonably certain there aren't any CRs screwing it up.
    It matches the 1st email and writes it out perfectly. If I remove that email from the original file it will match the next one only, and so on.
    The regex is getting exactly what I want, but only once.

    I'll try breakpoint, but will need to google it 1st as I'm basically winging it here.


  • Registered Users Posts: 3,131 ✭✭✭Dermot Illogical


    It's always something small, isn't it?
    Adding re.M has fixed it, although I'm sure there are a million better ways to do it.
    import re
    
    target = raw_input("Filename please: ")
    
    def SplitEmails(infile):
        with open(infile, 'r') as f:
            count = 0
            for result in re.findall('(^From\s.*?--\n\n\n)', f.read(), re.S | re.M):
                try:
                    count += 1
                    result = result.rstrip()
                    fn = 'email' + str(count) + '.txt'
                    with open(fn, 'w') as f:
                        f.write(result)
                        f.close()
                except:
                    continue
    
    SplitEmails(target)
    


Advertisement