Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Web Crawler Design

Options
  • 12-11-2012 10:38am
    #1
    Registered Users Posts: 50 ✭✭


    Hey Guys,

    In work we have a massive legacy website (probably more than 100k pages), We've seen over the years that we've a large amount of 404's across the site.

    I was recently asked to create a site crawler which would go through all the pages on the site and find deadlinks. Now this by itself is no biggy to me; but they also want to know the user's journey to that deadlink.

    For example. Say the user started off their journey on the homepage and then clicked through another 5 pages. On the 5 page they hit a 404. What I would like to do is to be able to map all the pages they went through to get the 404.

    How would you go about doing this?

    Ideally I'm looking to write this in Java, I dont want to use any open or closed source tools or services....this is something I'd like to do myself.

    Any idea's or comments would be most helpful

    Thanks all
    Tagged:


Comments

  • Registered Users Posts: 81,220 ✭✭✭✭biko


    For ideas on how to build it - check out Xenu link Sleuth.
    It does a similar job and if it finds a 404 it also makes note of the referring page.
    http://home.snafu.de/tilman/xenulink.html
    No source code available.


  • Registered Users Posts: 50 ✭✭mstester


    Hey Biko,

    I've actually already got that tool installed and its pretty cool. I know I could use this tool (and I have in the past) but I'd actually like to create my own tool for this. I know its probably just reinventing the wheel but for me its more of a little side project.

    Thanks for the reply.


  • Registered Users Posts: 851 ✭✭✭TonyStark


    For all of my crawling\scraping needs of late I've used: nCrawler and then ScrapySharp

    http://ncrawler.codeplex.com/
    https://bitbucket.org/rflechner/scrapysharp

    Platform is C# .NET but there might one or two things in the approaches that might be of use. eg. the pipeline architecture of nCrawler.


  • Registered Users Posts: 50 ✭✭mstester


    Thanks for the links!

    I'll give them a check. Any other suggestions are more than welcome!!


  • Registered Users Posts: 163 ✭✭stephenlane80


    This project is a java crawler and indexing application, it uses Lucene to build an index but you could easily create your own. Any reason you don't want to use any 3rd party components?

    http://www.codeproject.com/Articles/32920/Lucene-Website-Crawler-and-Indexer


  • Advertisement
  • Registered Users Posts: 50 ✭✭mstester


    This project is a java crawler and indexing application, it uses Lucene to build an index but you could easily create your own. Any reason you don't want to use any 3rd party components?

    http://www.codeproject.com/Articles/32920/Lucene-Website-Crawler-and-Indexer

    Thanks for the link. There are two reasons why I want to build my own. Firstly for the "fun" of it and secondly I want to tie it into JMeter and into some other build tools we use here.

    Thanks!


  • Registered Users Posts: 2,021 ✭✭✭ChRoMe


    mstester wrote: »
    Thanks for the link. There are two reasons why I want to build my own. Firstly for the "fun" of it and

    I hear poking your eye with a sharp object is a barrel of laughs too! ;)


  • Registered Users Posts: 2,781 ✭✭✭amen


    All the web crawler has to do is works it way a given site building a list of links and recording child links and if they are active.

    Think of it as recursing through a directory structure on a disk and recording attributes such as directory, sub directories etc


  • Registered Users Posts: 138 ✭✭MagicRon


    Install http://www.microsoft.com/web/seo/ on your machine. It will give you some very detailed info about your site ... including a list of 404's and every page on the site that links to that 404.


  • Registered Users Posts: 40 MidnightHawk


    Sounds like a site I am familiar with :)

    Why not just look at the web server logs? You should see the offending links and the referring page. It is going to be extremely hard to map absolute path to a page if the browser can have multiple entry points to specific pages.

    The permutations of paths that lead to a single (non-existing) page could be overwhelming, especially if you have a massive site. Once you have all the links you can store it in a database and then build your path/mapping from there through scripts.

    If you need any further advice I'd be happy to provide more help.


  • Advertisement
  • Closed Accounts Posts: 9,700 ✭✭✭tricky D


    Server logs will only record 404s for pages visited by users, not all pages.

    There are and have been 100s if not 1000s of link checkers doing the rounds for years now. Writing one from scratch is just reinventing the wheel, when your the company's time could be much better utilised. You should still be able to use the output from one of the many link checkers to throw at JMeter.


  • Registered Users Posts: 27,161 ✭✭✭✭GreeBo


    tricky D wrote: »
    Server logs will only record 404s for pages visited by users, not all pages.

    There are and have been 100s if not 1000s of link checkers doing the rounds for years now. Writing one from scratch is just reinventing the wheel, when your the company's time could be much better utilised. You should still be able to use the output from one of the many link checkers to throw at JMeter.


    If people are hitting these pages then I would argue...who cares that they are broken?

    OP you are going to have lots of fun if click handlers are being attached via javascript for example...


Advertisement