Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Screen Scraper

Options
  • 09-07-2010 9:20pm
    #1
    Registered Users Posts: 2,945 ✭✭✭


    Hi all

    I suppose prefix of n00b would also suit this. Basically I need to acquire data from a website over a large number of pages 500+. My background in programming is C, C++ but a lot of the tutorials I have found seem to use Ruby, PHP or Java to do screen scraping.
    Would anyone have an idea of which language would be easiest to pick up for this task, or a very good tutorial (for beginners) or even a good screen scraping application?

    Thanks


Comments

  • Moderators, Technology & Internet Moderators Posts: 1,335 Mod ✭✭✭✭croo


    I did some c/c++ before java and found the transition fairly easy - in syntax terms anyway. just forget anything to do with memory. If you go the java route I would look at XQuery which is an XML Document query tool (and html is a subset of xml).
    http://www.ibm.com/developerworks/xml/library/j-jtp03225.html

    There might be application tools already based on this I never checked (a little google should answer that); if it were 500 sites routinely I might consider building something anew in xquery but for just 500 pages that might be the best route.


  • Registered Users Posts: 527 ✭✭✭Sean^DCT4


    If you opt for the C# / ASP.NET route. There is a DLL HTMLAgilityPack. This is by far the best HTML screen scraping library I have come across.

    I have written 3 web applications for the company I work for which regularly screen-scrape (legally).

    You can easily point it to HTML attributes and get the text.

    The sample code below will pull out all bolded text within a div with a class called 'myClass':
    HtmlNode myNode = _markup.DocumentNode.SelectSingleNode("//div[contains(@class, 'myStyle')]/b");
    string text = myNode.InnerText;
    


  • Closed Accounts Posts: 8,015 ✭✭✭CreepingDeath


    croo wrote: »
    If you go the java route I would look at XQuery which is an XML Document query tool (and html is a subset of xml).

    Actually a lot of HTML pages are not true XML.
    They can be badly formed XML, but the browsers can still interpret them.

    JTidy is much more tolerant of badly formed HTML and returns a DOM document which you can then query.

    JTidy Sourceforge Link


Advertisement