Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

JavaScript & XML

Options
  • 06-02-2007 10:44pm
    #1
    Moderators, Education Moderators Posts: 2,432 Mod ✭✭✭✭


    Hey,

    I'm looking for a way to only search the characters that are outside tags.

    e.g.

    [PHP]<h1 class="firstHeading">United States dollar</h1>
    <div id="bodyContent">
    <h3 id="siteSub">From Wikipedia, the free encyclopedia</h3>
    <div id="contentSub"></div>[/PHP]

    I want my JavaScript to ignore everything thats in the tags, and only parse the "United States Dollar" and "From Wikipedia, the free encyclopedia", irregardless of what tag its in (Because generally text that isn't enclosed in a tag is the raw text)

    The closest example I can find is something like this

    [PHP]var x=xmlDoc.getElementsByTagName("title")[0].childNodes[0][/PHP]

    Which returns the 'text' within the title tag.

    I want something that will do this irregardless of that TagName (i.e. if no text in that particular tag, go onto next tag and try there, when you get to a tag that has text, do something)

    edit: Apparently the 'wholeText' from http://www.w3schools.com/dom/dom_text.asp does something similar, but it's unsupported!


Comments

  • Registered Users Posts: 2,781 ✭✭✭amen


    so what you are trying to do is extract all the non html formating code from the dom

    I suppose you could set up a regular expression and extract the text between > and < on a per line basis

    I don't think you xml example is really going to help
    you could take the html and try and turn it into XML and then go through it but very mess

    maybe if you explained why you want the text from the page you might get some more suggestions?
    Is this for screen scraping or some other reason?


  • Moderators, Education Moderators Posts: 2,432 Mod ✭✭✭✭Peteee


    amen wrote:
    so what you are trying to do is extract all the non html formating code from the dom

    I suppose you could set up a regular expression and extract the text between > and < on a per line basis

    I don't think you xml example is really going to help
    you could take the html and try and turn it into XML and then go through it but very mess

    maybe if you explained why you want the text from the page you might get some more suggestions?
    Is this for screen scraping or some other reason?

    I'll try the reg ex.

    Yeah, its for screen scraping (I'm looking for a certain string, extract it, then replace it with a new value)


Advertisement