Parsing Name Column - Firstname, Surname

warrenaldo · 09-03-2009 04:55PM #1

Hi,

I have a problem. I have a large execl file that has a Name Column.
Names are in the format:
Mary Hunt
Martin S Ryan
John O'Neill
And god knows what other horrible formats.
I need to write something that will extract the names out and put it in 2 columns. Firstname and Surname.

The language I am writing it in is Python, but I dont want exact code - just the theory. Or sample code in whatever language to point me in the right direction.

Any help would be appreciated.

seamus · 09-03-2009 05:06PM

I'm assuming the input file is just plain text?

From my experience (though I haven't actually checked to see if there's a theory on this), your best bet is not to try and parse it forwards, the way you'd normally think of it.

That is, you'd normally say "Find the first whitespace character and split the string there". But that falls down at double-barrelled names and names with initials in them and so forth. So if you work backwards - find the last whitespace character and split the string there, you'll have a much higher success rate.

Of course, this is far from flawless - you'll have names like "De Valera", "Mc Duff", you'll also have people whose names are written without the proper punctuation, such as "O Neill".

This is where pattern matching comes in. You can find a pretty comprehensive list of this prefixes and test for them too. Which should in theory give you a well split list with most names correctly parsed and caught. You may also be able to test for "odd" names and tell the script to spit those to you for manual processing.

daymobrew · 09-03-2009 10:16PM

I'm into doing things the easy way:
- Split name by spaces
- If two parts then it's easy
- If three parts then put first part as first name and join rest to be surname.
- If four parts then do something similar.

I would report or otherwise flag the 3+ part ones for human review later.

This simple method might suffice if the number of the non-trivial formations is low enough.

Civilian_Target · 10-03-2009 12:18AM

We do this a lot where I work, quite similar to what Daymo suggests.

First tokenize on space. If there's two tokens, done.
If there's 3 tokens,
- check the first one for Mr, Mrs, Ms, Miss, Dr, Fr, etc.
- check the middle token against a list of common surname prefixes, Mc, Mac, O, De, Van. If it matches, attach to the surname
- If the first name is some variant of Mohamed attach to the last name, otherwise attach to the first name
If there's 4 tokens, check to reduce it to 3 or 2 like above. If you're still left with 4, split it down the middle

warrenaldo · 10-03-2009 09:53AM

Thanks guys.

Having thought about iot last night it looked like the only logical way of doing it was toeknizing it and handling special cases like outlined above.

Thanks for all the help. I should be sorted now.

seamus · 10-03-2009 10:16AM

Yep, the other ideas are more elegant than mine

Malached · 12-03-2009 01:44AM

All depends on what you want. If you want first name/surname looking for first space will probably work. In western alphabets. If you have the freedom, column for firstname, column for surname. In western societies.

Parsing Name Column - Firstname, Surname

Comments