Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

web spider / crawler

Options
  • 12-12-2005 10:01pm
    #1
    Registered Users Posts: 2,031 ✭✭✭


    I'm looking for some kind of automated agent that I can run on a site that will find all pages with a certain piece of html - like search for '<table' so it will return all the urls of pages with tables on them.

    Using a find and replace on a flat file or database isn't really an option due to the CMS being used.

    Any one come across any of these?


Comments

  • Closed Accounts Posts: 39 Arch-Stanton


    http://kineticscripts.com/cgi-bin/search/scripts.cgi?script=57

    Something like the above could be modified for your needs


  • Registered Users Posts: 2,031 ✭✭✭colm_c


    I don't think that would work to be honest - that seems to be looking for actual files whereas what I need to get to is a site built with a database behind it... and tonnes of querystrings etc.

    I also can't install anything on the server the site's on, but I can install whatever I need on my own server...


  • Registered Users Posts: 7,412 ✭✭✭jmcc


    colm_c wrote:
    I'm looking for some kind of automated agent that I can run on a site that will find all pages with a certain piece of html - like search for '<table' so it will return all the urls of pages with tables on them.

    Using a find and replace on a flat file or database isn't really an option due to the CMS being used.

    Any one come across any of these?
    Is it your site or somebody else's site? The admins of large sites tend to take a dim view of people trying to download their database. Some will ban your IP without even blinking and if they are in a bad mood, they will make serious trouble for you with your hoster/ISP. Many of these sites now have specific references to automated downloading and scraping in their terms and conditions of use.

    The CMS in use may actually generate pages from a template so it might be easier to look at the template for the text you are looking for rather than having to download the entire site.

    There are a few agents out there that can whack a site and you could then just use a simple grep to check for the existence of the code fragment in these files.

    Regards...jmcc


  • Registered Users Posts: 2,031 ✭✭✭colm_c


    The site is one of our client's and the CMS they use supports direct HTML input but also the attachment of files - so it's these files that are the problem - they are old klunky HTML output from word, and we need to convert them asap.

    I wanted to basically log all pages that were like this so we could give them a rough time estimate of how long it would take...

    I guess using teleport pro or something then doing a find and logging the file name would work - but that's gonna be one hell of a log...


  • Closed Accounts Posts: 304 ✭✭Zaltais


    It would be relatively straight forward to write something to do this in perl. While I'm sure that scripts exist out there to do what you're looking for, unfortunately I've always tended to roll my own whenever I've had a need to do something like this, as I've found 'off the shelf' solutions too restrictive for my needs.

    I can suggest a few perl modules that'll be helpful if you do decide to roll your own (PM me if you're interested), but unfortunately I don't know of any 'off the shelf' solutions off hand...


  • Advertisement
Advertisement