Screen Scraping

stevire · 06-12-2008 12:04PM #1

Has anyone ever done screen scraping? How easy/hard is it to do? Also, I presume it scrapes the html code and not the frontend of the page?

Trampas · 06-12-2008 02:43PM

I did it using .NET and regular expression for a college project.

You will be reading the html code returned.

CreepingDeath · 06-12-2008 08:28PM

If you are considering using Java, then there's a handy library called JTidy definitely worth checking out.

It parses HTML, and is a bit more fault tolerant of basic HTML parsers.

The Corinthian · 08-12-2008 01:52PM

stevire wrote: »

Has anyone ever done screen scraping? How easy/hard is it to do?

Depends upon what you're screen scraping. If all you're doing is scraping data from a public HTML page, then it's pretty straightforward. If you also have to emulate certain user behaviour such as first logging into the site, then you need to first do this, get your session ID, and then emulate passing this back when going to the content you want to scrape - naturally this is far more complex. HTTP referrers are another thing to keep a look-out for.

The best approach, IMHO, is to go through the process as from the point of view of a human. Detail the steps involved (login, navigate to the correct page, enter search criteria, read results, etc.), then try to break these down into inputs (HTTP POST parameters, use of sessions/cookies) and outputs (error or expected behaviour messages), and then see how these can be emulated by the system. Finally you can build your application, emulating the various stages of user behaviour, until you get to the content you're seeking.

Also, I presume it scrapes the html code and not the frontend of the page?

The HTML is the front end of the page - or at least the most important part of it as all the other content will be called from it.

Again, it depends however on what it is you want to scrape. Sometimes the content to be scraped is not in the HTML itself, but is in an attached file (such as an external JavaScript document). Other times, the content you're looking for is not text, but an image or other binary file, in which case you will want to grab that.

stevire · 08-12-2008 02:49PM

Cheers for the replies!

The main thing I was looking to scrape was an asp page. The asp page contains up to date google markers, so I was looking to extract these and store them in a database. And then time the process to run every hour or so... Delete the old entries and store new ones.

Páid · 08-12-2008 06:35PM

You don't say which language you need to program it in. This is some sample code (vb asp) to scrape a table from a particular webpage. You would need to alter the regex pattern to suit your needs. With a bit of work you could save it as a vbs file and use task scheduler to run it every so often.

<%
' Url of the webpage we want to retrieve
thisURL = "http://www.i44speedway.com/Trackpoints.htm" 

' Creation of the xmlHTTP object
Set GetConnection = CreateObject("Microsoft.XMLHTTP")

' Connection to the URL
GetConnection.Open "get", thisURL, False
GetConnection.Send 

' ResponsePage now have the response of
' the remote web server
ResponsePage = GetConnection.responseText

' We write out now
' the content of the ResponsePage var
Response.write (getTable(ResponsePage))

Set GetConnection = Nothing

Function getTable(pageString)
dim myMatches

Set RegularExpressionObject = New RegExp

With RegularExpressionObject
.Pattern = "<table.*>(.|\n)*?</table>"
.IgnoreCase = True
.Global = True
End With

set myMatches = RegularExpressionObject.Execute(pageString)

getTable = myMatches(4) ' Get the fifth table on the page

Set RegularExpressionObject = nothing

End Function
%>

AARRRGH · 08-12-2008 07:21PM

I do lots of screen scraping (see my signature.)

It's fairly straightforward. Read HTML, parse for certain information (using a regular expression) and then write to a database.

The key is the regular expression bit. The O'Reilly book on regular expressions is good.

Draupnir · 09-12-2008 11:08AM

stevire wrote: »

Cheers for the replies!

The main thing I was looking to scrape was an asp page. The asp page contains up to date google markers, so I was looking to extract these and store them in a database. And then time the process to run every hour or so... Delete the old entries and store new ones.

Screen scraping is relatively simple, its really just reading HTML and parsing it. Be careful that you have permission to screen scrape first, there have been a number of court cases caused by people screen scraping without permission.

Screen Scraping

Comments