Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Words of wisdom? Count words in a document

Options
  • 04-03-2007 4:46am
    #1
    Registered Users Posts: 4,276 ✭✭✭


    Greetings,

    Maybe I'm not thinking right but I would greatly appreciate it if someone could point me into the direction of how to count how many times words appear in a document (Just plain text).

    I was thinking...
    1. Open the document
    2. Run through it scanning for white space copy the stuff between white space
    3. Slap into some form of collection (Straight forward array?)

    Then I get stuck as scanning for matching words is going to be awkward with that approach.

    Would appreciate any advice I'm sure I am missing something. I'm using C#

    I dont need source code but pointers are a great help :)


Comments

  • Moderators, Recreation & Hobbies Moderators, Science, Health & Environment Moderators, Technology & Internet Moderators Posts: 91,696 Mod ✭✭✭✭Capt'n Midnight


    You could run through the characters one at a time, and note if alphabetic or non alphabetic. You increase the count when an alphabetic character is followed by a non-alphabetic one.

    When defining alphabetic characters, how would you define a word, are numbers words "three" vs. 3 , "23" vs. "twenty three" ? What about hyphens ?
    Don't forget that punctuation mark also count as white space - but watch out for contractions - don't .
    Also do you want foreign letters ?


  • Closed Accounts Posts: 461 ✭✭markf909


    Surely a lexing tool in conjuction with some code to store the word in a hashtable would be enough.
    See here for C# Lex
    http://www.infosys.tuwien.ac.at/cuplex/lex.htm#1.%20About%20C#%20CUP

    I can give you loads of pointers for writing the regular expressions that describe each type of word you need, if you want.


    If you dont want to use an external tool, then what about using the C# equivalent of Java's String.split(" ") which will just read a line at a time and return an array of every word on that line.


  • Registered Users Posts: 4,276 ✭✭✭damnyanks


    Sorry - just up now. I am trying to count how many times a certain word appears as opposed to how many words in total.

    contractions / foreign letters shouldnt be a concern now that I think about it so no need to handle that


  • Moderators, Recreation & Hobbies Moderators, Science, Health & Environment Moderators, Technology & Internet Moderators Posts: 91,696 Mod ✭✭✭✭Capt'n Midnight


    ID the words
    put them in an array
    a$(n,0)=string
    a%(n,1)=count
    sort the array
    re-scan and each word you find do a (binary?) search on the array and increase the count
    you might want to convert all words to the same case first

    a more complex way would be some sort of linked list that you populate on the fly and re-sort at intervals when it becomes lop-sided

    if you want to make life really hard you could try to work out the context for different words that are spelt the same ( to lead / lead weight, to sow / a sow , to row / a row )


  • Registered Users Posts: 4,276 ✭✭✭damnyanks


    Nope as easy as possible. It's a fairly small part of my project :)


  • Advertisement
  • Registered Users Posts: 4,188 ✭✭✭pH


    Rather silly of them, but microsoft didn't add a Set collection to C#, which would do exactly what you want
    - add each word to the Set and at the end get the count the number of objects in the Set (Sets do not allow duplicates).

    If you are allowed reuse code from elsewhere do a google for a C# Set Collection object, download and use that.


  • Registered Users Posts: 4,276 ✭✭✭damnyanks


    Just incase anyone ever finds this on google, I found a set of collections called Wintellect power collections. Really useful library.

    http://www.wintellect.com/PowerCollections.aspx


  • Closed Accounts Posts: 4,943 ✭✭✭Mutant_Fruit


    The C5 collection classes are also pretty nifty:

    http://www.itu.dk/research/c5/


  • Users Awaiting Email Confirmation Posts: 351 ✭✭ron_darrell


    ID the words
    put them in an array
    a$(n,0)=string
    a%(n,1)=count
    sort the array
    re-scan and each word you find do a (binary?) search on the array and increase the count
    you might want to convert all words to the same case first

    I'd suggest this is the most efficient way to do it. If you turn the entire document into a text string, change it to lowercase then using a split on the whitespace will give you an instant array of all words in the document (or if there is no spacing between sentences, do a second split on the period). Sort the array and then do a binary search till you come to the word you are looking for then increment a counter until the match case no longer holds true.

    pseudo should look like:

    function get_num_occurs(the_word, the_string)
    var the_array, counter, flag, pos
    the_array = split(the_string, " ")
    counter = 0
    pos = 0
    flag = false

    the_array.sort()
    while flag = false && pos < the_array.length
    if(the_array[pos] == the_word)
    counter ++
    if(counter > 0 && the_array[pos] != the_word)
    flag = true
    pos++
    wend

    return counter
    end function

    Obviously you'll need to change my funcs, vars etc to the equivalent in the lang you are using but this should do you. (And I'm sure many others will point out ways this could be made better and more efficient :) )

    Regards,
    -RD


Advertisement