Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Parsing Name Column - Firstname, Surname

Options
  • 09-03-2009 4:55pm
    #1
    Registered Users Posts: 500 ✭✭✭


    Hi,

    I have a problem. I have a large execl file that has a Name Column.
    Names are in the format:
    Mary Hunt
    Martin S Ryan
    John O'Neill
    And god knows what other horrible formats.
    I need to write something that will extract the names out and put it in 2 columns. Firstname and Surname.

    The language I am writing it in is Python, but I dont want exact code - just the theory. Or sample code in whatever language to point me in the right direction.

    Any help would be appreciated.


Comments

  • Registered Users Posts: 68,317 ✭✭✭✭seamus


    I'm assuming the input file is just plain text?

    From my experience (though I haven't actually checked to see if there's a theory on this), your best bet is not to try and parse it forwards, the way you'd normally think of it.

    That is, you'd normally say "Find the first whitespace character and split the string there". But that falls down at double-barrelled names and names with initials in them and so forth. So if you work backwards - find the last whitespace character and split the string there, you'll have a much higher success rate.

    Of course, this is far from flawless - you'll have names like "De Valera", "Mc Duff", you'll also have people whose names are written without the proper punctuation, such as "O Neill".

    This is where pattern matching comes in. You can find a pretty comprehensive list of this prefixes and test for them too. Which should in theory give you a well split list with most names correctly parsed and caught. You may also be able to test for "odd" names and tell the script to spit those to you for manual processing.


  • Registered Users Posts: 6,509 ✭✭✭daymobrew


    I'm into doing things the easy way:
    - Split name by spaces
    - If two parts then it's easy
    - If three parts then put first part as first name and join rest to be surname.
    - If four parts then do something similar.

    I would report or otherwise flag the 3+ part ones for human review later.

    This simple method might suffice if the number of the non-trivial formations is low enough.


  • Registered Users Posts: 5,618 ✭✭✭Civilian_Target


    We do this a lot where I work, quite similar to what Daymo suggests.

    First tokenize on space. If there's two tokens, done.
    If there's 3 tokens,
    - check the first one for Mr, Mrs, Ms, Miss, Dr, Fr, etc.
    - check the middle token against a list of common surname prefixes, Mc, Mac, O, De, Van. If it matches, attach to the surname
    - If the first name is some variant of Mohamed attach to the last name, otherwise attach to the first name
    If there's 4 tokens, check to reduce it to 3 or 2 like above. If you're still left with 4, split it down the middle


  • Registered Users Posts: 500 ✭✭✭warrenaldo


    Thanks guys.

    Having thought about iot last night it looked like the only logical way of doing it was toeknizing it and handling special cases like outlined above.

    Thanks for all the help. I should be sorted now.


  • Registered Users Posts: 68,317 ✭✭✭✭seamus


    Yep, the other ideas are more elegant than mine :)


  • Advertisement
  • Registered Users Posts: 25 Malached


    All depends on what you want. If you want first name/surname looking for first space will probably work. In western alphabets. If you have the freedom, column for firstname, column for surname. In western societies.


Advertisement