Web Crawler Design

mstester · 12-11-2012 10:38am #1

Hey Guys,

In work we have a massive legacy website (probably more than 100k pages), We've seen over the years that we've a large amount of 404's across the site.

I was recently asked to create a site crawler which would go through all the pages on the site and find deadlinks. Now this by itself is no biggy to me; but they also want to know the user's journey to that deadlink.

For example. Say the user started off their journey on the homepage and then clicked through another 5 pages. On the 5 page they hit a 404. What I would like to do is to be able to map all the pages they went through to get the 404.

How would you go about doing this?

Ideally I'm looking to write this in Java, I dont want to use any open or closed source tools or services....this is something I'd like to do myself.

Any idea's or comments would be most helpful

Thanks all

biko · 12-11-2012 10:42am

For ideas on how to build it - check out Xenu link Sleuth.
It does a similar job and if it finds a 404 it also makes note of the referring page.
http://home.snafu.de/tilman/xenulink.html
No source code available.

mstester · 12-11-2012 10:47am

Hey Biko,

I've actually already got that tool installed and its pretty cool. I know I could use this tool (and I have in the past) but I'd actually like to create my own tool for this. I know its probably just reinventing the wheel but for me its more of a little side project.

Thanks for the reply.

TonyStark · 12-11-2012 12:14pm

For all of my crawling\scraping needs of late I've used: nCrawler and then ScrapySharp

http://ncrawler.codeplex.com/
https://bitbucket.org/rflechner/scrapysharp

Platform is C# .NET but there might one or two things in the approaches that might be of use. eg. the pipeline architecture of nCrawler.

mstester · 13-11-2012 3:15pm

Thanks for the links!

I'll give them a check. Any other suggestions are more than welcome!!

stephenlane80 · 13-11-2012 6:36pm

This project is a java crawler and indexing application, it uses Lucene to build an index but you could easily create your own. Any reason you don't want to use any 3rd party components?

http://www.codeproject.com/Articles/32920/Lucene-Website-Crawler-and-Indexer

mstester · 21-11-2012 4:27pm

stephenlane80 wrote: »

This project is a java crawler and indexing application, it uses Lucene to build an index but you could easily create your own. Any reason you don't want to use any 3rd party components?

http://www.codeproject.com/Articles/32920/Lucene-Website-Crawler-and-Indexer

Thanks for the link. There are two reasons why I want to build my own. Firstly for the "fun" of it and secondly I want to tie it into JMeter and into some other build tools we use here.

Thanks!

ChRoMe · 21-11-2012 6:35pm

mstester wrote: »

Thanks for the link. There are two reasons why I want to build my own. Firstly for the "fun" of it and

I hear poking your eye with a sharp object is a barrel of laughs too!

amen · 21-11-2012 10:34pm

All the web crawler has to do is works it way a given site building a list of links and recording child links and if they are active.

Think of it as recursing through a directory structure on a disk and recording attributes such as directory, sub directories etc

MagicRon · 22-11-2012 2:30am

Install http://www.microsoft.com/web/seo/ on your machine. It will give you some very detailed info about your site ... including a list of 404's and every page on the site that links to that 404.

MidnightHawk · 27-12-2012 1:15pm

Sounds like a site I am familiar with

Why not just look at the web server logs? You should see the offending links and the referring page. It is going to be extremely hard to map absolute path to a page if the browser can have multiple entry points to specific pages.

The permutations of paths that lead to a single (non-existing) page could be overwhelming, especially if you have a massive site. Once you have all the links you can store it in a database and then build your path/mapping from there through scripts.

If you need any further advice I'd be happy to provide more help.

tricky D · 27-12-2012 3:25pm

Server logs will only record 404s for pages visited by users, not all pages.

There are and have been 100s if not 1000s of link checkers doing the rounds for years now. Writing one from scratch is just reinventing the wheel, when your the company's time could be much better utilised. You should still be able to use the output from one of the many link checkers to throw at JMeter.

GreeBo · 27-12-2012 8:08pm

tricky D wrote: »

Server logs will only record 404s for pages visited by users, not all pages.

There are and have been 100s if not 1000s of link checkers doing the rounds for years now. Writing one from scratch is just reinventing the wheel, when your the company's time could be much better utilised. You should still be able to use the output from one of the many link checkers to throw at JMeter.

If people are hitting these pages then I would argue...who cares that they are broken?

OP you are going to have lots of fun if click handlers are being attached via javascript for example...

Web Crawler Design

Comments