Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Robots sucking bandwidth?

Options
  • 04-02-2004 1:28pm
    #1
    Registered Users Posts: 6,315 ✭✭✭


    I have recently put in a gallery on my site and have noticed that my bandwidth usage is off the richter scale.

    Do googlebot and other bots download every file they find?

    If so how do I stop them entering the /gallery/ directory?


Comments

  • Registered Users Posts: 258 ✭✭peterd


    Many bots will obey your robots.txt file (if you have one), Google included. Lookup the http://www.robotstxt.org/ website for more info, but basically putting...
    User-agent: *
    Disallow: /gallery/
    

    in a robots.txt file in your web directory will keep them out of there.


  • Moderators, Politics Moderators Posts: 39,950 Mod ✭✭✭✭Seth Brundle


    The bots will download any files that are available to the public. If you are finding that you have all your bandwidth used then you have a problem with your hosting account - what happens if 1 million people want to view your gallery? Are you going to stop them because it is going to ruin your transfer allowance?


  • Registered Users Posts: 6,315 ✭✭✭ballooba


    The gallery section is of interest only to a few people.

    I don't want bots downloading 45megs of stuff off the site everday.


  • Banned (with Prison Access) Posts: 16,659 ✭✭✭✭dahamsta


    Strange bots if they're downloading your images. Or strange gallery if it contains 45 megs of HTML.

    adam /confused


  • Registered Users Posts: 6,315 ✭✭✭ballooba


    Originally posted by ballooba

    Do googlebot and other bots download every file they find?

    The term 'every file' including image files.


  • Advertisement
  • Banned (with Prison Access) Posts: 16,659 ✭✭✭✭dahamsta


    Most spiders download HTML only, usually up to a specified length. If your bandwidth usage is off the charts, either a spider is stuck in a loop (unlikely these days) or something else is happening. If you generate stats for the site, check them out; otherwise install something.

    adam


  • Registered Users Posts: 476 ✭✭Pablo


    the ones i don't allow :
    # Rover is a bad dog <[url]http://www.roverbot.com[/url]>
    User-agent: Roverbot
    Disallow: /
    # Another annoying bot
    User-agent: ia_archiver
    Disallow: /
    # No point in having images stored like this
    User-agent: Googlebot-Image
    Disallow: /
    
    make sure it is in your root, and titled robots.txt and not robot.txt
    HTH


  • Banned (with Prison Access) Posts: 16,659 ✭✭✭✭dahamsta


    # No point in having images stored like this
    User-agent: Googlebot-Image
    
    That could be the kiddy right there, I forgot about Google Images. I don't think I've ever seen images from galleries in Google Images, but there's plenty of other sites out there that go looking for 'em. That being said, only the stupidest spider would go in and download every image on a regular basis.

    adam


  • Registered Users Posts: 7,412 ✭✭✭jmcc


    Originally posted by dahamsta
    That being said, only the stupidest spider would go in and download every image on a regular basis.

    Unless it has completely purged its database, any returning robot should get 304 results indicating that the file/page has not changed.

    What you have to watch out for are muppets who hoover the complete site with website rippers like Xenu Link Sleuth and the like. These are best blocked using either .htaccess or via httpd.conf.

    Regards...jmcc


  • Registered Users Posts: 476 ✭✭Pablo


    .htaccess is a good way to prevent people hotlinking your images. Well worth the little effort to put on inplace.


  • Advertisement
Advertisement