Screen Scraper

D-Generate · 09-07-2010 9:20pm #1

Hi all

I suppose prefix of n00b would also suit this. Basically I need to acquire data from a website over a large number of pages 500+. My background in programming is C, C++ but a lot of the tutorials I have found seem to use Ruby, PHP or Java to do screen scraping.
Would anyone have an idea of which language would be easiest to pick up for this task, or a very good tutorial (for beginners) or even a good screen scraping application?

Thanks

croo · 10-07-2010 11:32am

I did some c/c++ before java and found the transition fairly easy - in syntax terms anyway. just forget anything to do with memory. If you go the java route I would look at XQuery which is an XML Document query tool (and html is a subset of xml).
http://www.ibm.com/developerworks/xml/library/j-jtp03225.html

There might be application tools already based on this I never checked (a little google should answer that); if it were 500 sites routinely I might consider building something anew in xquery but for just 500 pages that might be the best route.

Sean^DCT4 · 10-07-2010 9:36pm

If you opt for the C# / ASP.NET route. There is a DLL HTMLAgilityPack. This is by far the best HTML screen scraping library I have come across.

I have written 3 web applications for the company I work for which regularly screen-scrape (legally).

You can easily point it to HTML attributes and get the text.

The sample code below will pull out all bolded text within a div with a class called 'myClass':

HtmlNode myNode = _markup.DocumentNode.SelectSingleNode("//div[contains(@class, 'myStyle')]/b");
string text = myNode.InnerText;

CreepingDeath · 11-07-2010 11:39am

croo wrote: »

If you go the java route I would look at XQuery which is an XML Document query tool (and html is a subset of xml).

Actually a lot of HTML pages are not true XML.
They can be badly formed XML, but the browsers can still interpret them.

JTidy is much more tolerant of badly formed HTML and returns a DOM document which you can then query.

JTidy Sourceforge Link

Screen Scraper

Comments