Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Scraping data from another site on a timed basis

Options
  • 16-04-2008 1:52pm
    #1
    Registered Users Posts: 6,465 ✭✭✭


    I need to scrape some data from another site and write it to a db. I've got an ASP page which will do this, my problem is I need to run every 15/20 mins or so as the data is real-time and gets refreshed, and I need to build up a history of this.
    So basically I guess I'm looking at running an autonomous process on the web server? I know I'll need to get onto my host about this, but I'm just wondering is this something that would generally be allowed? There's not very much traffic involved, and even allowing for 50 or so iterations a day it would only use up a fraction of my allowable traffic.
    And if so, from a technical standpoint, what would be involved in setting this up?


Comments

  • Registered Users Posts: 4,468 ✭✭✭matt-dublin


    where are you pulling the data from? there could be an issue with copyright infringement.


  • Registered Users Posts: 6,465 ✭✭✭MOH


    where are you pulling the data from? there could be an issue with copyright infringement.

    Thanks, but I think I'm OK on that front. It's a public service information site, and I've been through their T&Cs fairly thoroughly. I'm also not planning on reproducing the data directly on my site, it's more for validation against data from other sources.


  • Moderators, Computer Games Moderators Posts: 10,462 Mod ✭✭✭✭Axwell


    You should look at setting up a cron job to run the process when you need to.


  • Registered Users Posts: 1,462 ✭✭✭Peanut


    MOH wrote: »
    ... I know I'll need to get onto my host about this, but I'm just wondering is this something that would generally be allowed?
    ...

    As long as you're not hitting the remote server very hard with a lot of requests and data transfer, then the robot might briefly catch the interest of their webmasters (depending on the site), but is very unlikely to have any repercussions.

    I don't think your hoster will be bothered about it at all either.


  • Registered Users Posts: 6,511 ✭✭✭daymobrew


    Axwell wrote: »
    You should look at setting up a cron job to run the process when you need to.
    I use cron to wget a page from a nra.ie website once an hour. Once a day I push the data into a database (via a password protected cgi script) and then my web page reads the db.

    Obviously the system that the cron job is running on needs to me be on 24/7. If your hosting provider allows cron then I'd use it.


  • Advertisement
  • Registered Users Posts: 6,465 ✭✭✭MOH


    Thanks folks, I'll check it out with the host and look into a cron job.


Advertisement