web spider / crawler

colm_c · 12-12-2005 10:01pm #1

I'm looking for some kind of automated agent that I can run on a site that will find all pages with a certain piece of html - like search for '<table' so it will return all the urls of pages with tables on them.

Using a find and replace on a flat file or database isn't really an option due to the CMS being used.

Any one come across any of these?

Arch-Stanton · 12-12-2005 11:28pm

http://kineticscripts.com/cgi-bin/search/scripts.cgi?script=57

Something like the above could be modified for your needs

colm_c · 13-12-2005 12:13am

I don't think that would work to be honest - that seems to be looking for actual files whereas what I need to get to is a site built with a database behind it... and tonnes of querystrings etc.

I also can't install anything on the server the site's on, but I can install whatever I need on my own server...

jmcc · 13-12-2005 7:26am

colm_c wrote:

I'm looking for some kind of automated agent that I can run on a site that will find all pages with a certain piece of html - like search for '<table' so it will return all the urls of pages with tables on them.

Using a find and replace on a flat file or database isn't really an option due to the CMS being used.

Any one come across any of these?

Is it your site or somebody else's site? The admins of large sites tend to take a dim view of people trying to download their database. Some will ban your IP without even blinking and if they are in a bad mood, they will make serious trouble for you with your hoster/ISP. Many of these sites now have specific references to automated downloading and scraping in their terms and conditions of use.

The CMS in use may actually generate pages from a template so it might be easier to look at the template for the text you are looking for rather than having to download the entire site.

There are a few agents out there that can whack a site and you could then just use a simple grep to check for the existence of the code fragment in these files.

Regards...jmcc

colm_c · 13-12-2005 8:46pm

The site is one of our client's and the CMS they use supports direct HTML input but also the attachment of files - so it's these files that are the problem - they are old klunky HTML output from word, and we need to convert them asap.

I wanted to basically log all pages that were like this so we could give them a rough time estimate of how long it would take...

I guess using teleport pro or something then doing a find and logging the file name would work - but that's gonna be one hell of a log...

Zaltais · 14-12-2005 12:44am

It would be relatively straight forward to write something to do this in perl. While I'm sure that scripts exist out there to do what you're looking for, unfortunately I've always tended to roll my own whenever I've had a need to do something like this, as I've found 'off the shelf' solutions too restrictive for my needs.

I can suggest a few perl modules that'll be helpful if you do decide to roll your own (PM me if you're interested), but unfortunately I don't know of any 'off the shelf' solutions off hand...

web spider / crawler

Comments