Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Screen Scraping

Options
  • 06-12-2008 12:04pm
    #1
    Registered Users Posts: 1,512 ✭✭✭


    Has anyone ever done screen scraping? How easy/hard is it to do? Also, I presume it scrapes the html code and not the frontend of the page?


Comments

  • Registered Users Posts: 7,680 ✭✭✭Trampas


    I did it using .NET and regular expression for a college project.

    You will be reading the html code returned.


  • Closed Accounts Posts: 8,015 ✭✭✭CreepingDeath


    If you are considering using Java, then there's a handy library called JTidy definitely worth checking out.

    It parses HTML, and is a bit more fault tolerant of basic HTML parsers.


  • Closed Accounts Posts: 19,777 ✭✭✭✭The Corinthian


    stevire wrote: »
    Has anyone ever done screen scraping? How easy/hard is it to do?
    Depends upon what you're screen scraping. If all you're doing is scraping data from a public HTML page, then it's pretty straightforward. If you also have to emulate certain user behaviour such as first logging into the site, then you need to first do this, get your session ID, and then emulate passing this back when going to the content you want to scrape - naturally this is far more complex. HTTP referrers are another thing to keep a look-out for.

    The best approach, IMHO, is to go through the process as from the point of view of a human. Detail the steps involved (login, navigate to the correct page, enter search criteria, read results, etc.), then try to break these down into inputs (HTTP POST parameters, use of sessions/cookies) and outputs (error or expected behaviour messages), and then see how these can be emulated by the system. Finally you can build your application, emulating the various stages of user behaviour, until you get to the content you're seeking.
    Also, I presume it scrapes the html code and not the frontend of the page?
    The HTML is the front end of the page - or at least the most important part of it as all the other content will be called from it.

    Again, it depends however on what it is you want to scrape. Sometimes the content to be scraped is not in the HTML itself, but is in an attached file (such as an external JavaScript document). Other times, the content you're looking for is not text, but an image or other binary file, in which case you will want to grab that.


  • Registered Users Posts: 1,512 ✭✭✭stevire


    Cheers for the replies!

    The main thing I was looking to scrape was an asp page. The asp page contains up to date google markers, so I was looking to extract these and store them in a database. And then time the process to run every hour or so... Delete the old entries and store new ones.


  • Registered Users Posts: 916 ✭✭✭Páid


    You don't say which language you need to program it in. This is some sample code (vb asp) to scrape a table from a particular webpage. You would need to alter the regex pattern to suit your needs. With a bit of work you could save it as a vbs file and use task scheduler to run it every so often.
    <%
    ' Url of the webpage we want to retrieve
    thisURL = "http://www.i44speedway.com/Trackpoints.htm" 
    
    ' Creation of the xmlHTTP object
    Set GetConnection = CreateObject("Microsoft.XMLHTTP")
    
    ' Connection to the URL
    GetConnection.Open "get", thisURL, False
    GetConnection.Send 
    
    ' ResponsePage now have the response of
    ' the remote web server
    ResponsePage = GetConnection.responseText
    
    ' We write out now
    ' the content of the ResponsePage var
    Response.write (getTable(ResponsePage))
    
    Set GetConnection = Nothing
    
    Function getTable(pageString)
    dim myMatches
    
    Set RegularExpressionObject = New RegExp
    
    With RegularExpressionObject
    .Pattern = "<table.*>(.|\n)*?</table>"
    .IgnoreCase = True
    .Global = True
    End With
    
    set myMatches = RegularExpressionObject.Execute(pageString)
    
    getTable = myMatches(4) ' Get the fifth table on the page
    
    Set RegularExpressionObject = nothing
    
    End Function
    %>
    


  • Advertisement
  • Closed Accounts Posts: 12,382 ✭✭✭✭AARRRGH


    I do lots of screen scraping (see my signature.)

    It's fairly straightforward. Read HTML, parse for certain information (using a regular expression) and then write to a database.

    The key is the regular expression bit. The O'Reilly book on regular expressions is good.


  • Registered Users Posts: 3,548 ✭✭✭Draupnir


    stevire wrote: »
    Cheers for the replies!

    The main thing I was looking to scrape was an asp page. The asp page contains up to date google markers, so I was looking to extract these and store them in a database. And then time the process to run every hour or so... Delete the old entries and store new ones.

    Screen scraping is relatively simple, its really just reading HTML and parsing it. Be careful that you have permission to screen scrape first, there have been a number of court cases caused by people screen scraping without permission.


Advertisement