Announcement

Collapse
No announcement yet.

Prevent Robots from using bandwidth

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Prevent Robots from using bandwidth

    Here is a nice simple way to stop the majority of robots spidering files that you don't want to or they shouldn't need to.

    Place the following code in robots.txt and upload it to your domain root so when you go to http://forums.site.com/robots.txt you get this file

    Code:
    User-agent: *
    Disallow: attachment.php
    Disallow: avatar.php
    Disallow: editpost.php
    Disallow: member.php
    Disallow: member2.php
    Disallow: misc.php
    Disallow: moderator.php
    Disallow: newreply.php
    Disallow: newthread.php
    Disallow: online.php
    Disallow: poll.php
    Disallow: postings.php
    Disallow: printthread.php
    Disallow: private.php
    Disallow: private2.php
    Disallow: report.php
    Disallow: search.php
    Disallow: sendtofriend.php
    Disallow: threadrate.php
    Disallow: usercp.php
    Disallow: /admin/
    Disallow: /images/
    Disallow: /mod/
    This will stop them from trying to access files that won't have anything intresting to spider. Also stops them getting images which should save on bandwidth when you have lots of spiders on your forums.

    Note: The majority of spiders check for these not all do.
    Scott MacVicar

    My Blog | Twitter

  • #2
    wow this is cool, im gonna try this out

    Comment


    • #3
      If you want to check, look at your error logs you'll probably see a few 404 errors for /robots.txt in the log. Its spiders like google looking for the file to see if you have any rules it has to follow like the ones posted.
      Scott MacVicar

      My Blog | Twitter

      Comment


      • #4
        cool, I just modified this to fit with my vbp directory

        Comment


        • #5
          Sounds Like a very good idea. Thank you!

          For folks, whose board is in a subfolder, make sure that you do not forget to put that folder name, before the php files in the list. The robots.txt has to be in the "/" folder of the web site.

          For example, http://www.vbulletin.com/robots.txt file, would look something like this:

          Code:
          User-agent: *
          Disallow: /forum/attachment.php
          Disallow: /forum/avatar.php
          Disallow: /forum/editpost.php
          Disallow: /forum/member.php
          Disallow: /forum/member2.php
          Disallow: /forum/misc.php
          Disallow: /forum/moderator.php
          Disallow: /forum/newreply.php
          Disallow: /forum/newthread.php
          Disallow: /forum/online.php
          Disallow: /forum/poll.php
          Disallow: /forum/postings.php
          Disallow: /forum/printthread.php
          Disallow: /forum/private.php
          Disallow: /forum/private2.php
          Disallow: /forum/report.php
          Disallow: /forum/search.php
          Disallow: /forum/sendtofriend.php
          Disallow: /forum/threadrate.php
          Disallow: /forum/usercp.php
          Disallow: /forum/admin/
          Disallow: /forum/images/
          Disallow: /forum/mod/
          The only thing that worries me is that a hacker, would read the robots.txt file, and know exactly what php files you have where. But since structure of vBulletin is not exactly secret anyway, it is probably not that big of a deal…

          For more info on robots.txt see this:
          http://www.robotstxt.org/wc/exclusion-admin.html

          Comment


          • #6
            ok, i've done all that, just one question. Ummm, what's a robot?

            Comment


            • #7
              my current robot.txt file for my forums has only

              Code:
              User-Agent: Googlebot-Image
              Disallow: /
              :: Always Back Up Forum Database + Attachments BEFORE upgrading !
              :: Nginx SPDY SSL - World Flags Demo [video results]
              :: vBulletin hacked forums: Clean Up Guide for VPS/Dedicated hosting users [ vbulletin.com blog summary ]

              Comment


              • #8
                Originally posted by Millward
                ok, i've done all that, just one question. Ummm, what's a robot?
                Have you ever seen Futurama? Robots like to steel things, like bandwidth, and images...

                Comment


                • #9
                  but george that stops them spidering your forums and most people like to have there forum spidered, well I think its useful at least you appear in search engines.
                  Scott MacVicar

                  My Blog | Twitter

                  Comment


                  • #10
                    Originally posted by PPN
                    but george that stops them spidering your forums and most people like to have there forum spidered, well I think its useful at least you appear in search engines.
                    look again it only prevents google from grabbing my images for indexing.. http://www.google.com/remove.html#images
                    Last edited by George L; Sat 11th May '02, 1:41pm.
                    :: Always Back Up Forum Database + Attachments BEFORE upgrading !
                    :: Nginx SPDY SSL - World Flags Demo [video results]
                    :: vBulletin hacked forums: Clean Up Guide for VPS/Dedicated hosting users [ vbulletin.com blog summary ]

                    Comment


                    • #11
                      oh hehe

                      I never noticed what the useragent was.
                      Scott MacVicar

                      My Blog | Twitter

                      Comment


                      • #12
                        lol, ah i see why it called robots now, sorry. I still dont get where they come from.......... im thick arn't i?

                        Comment


                        • #13
                          They usually come from all kinds of search engines: Altavista, Google, etc...

                          Comment


                          • #14
                            is it sort of like hotlinking where some one puts an <img> tag in their page that calls for an image on a different server?

                            Comment


                            • #15
                              robots spider you webpage and gather the documents, they index these so when you go to a search engine you can type in a word and it matches your site if it found the word.

                              Spiders following links on web pages to other pages, but they also follow images which is a bad thing sometimes as it uses your bandwidth especially if you have a big board and its getting spidered once a day by many search engines.
                              Scott MacVicar

                              My Blog | Twitter

                              Comment

                              widgetinstance 262 (Related Topics) skipped due to lack of content & hide_module_if_empty option.
                              Working...
                              X