Announcement

Collapse
No announcement yet.

Someone crawling my site - your knowledge would be helpful

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Someone crawling my site - your knowledge would be helpful

    Hi everyone

    I'm wanting to pick your brains given that I suspect many of you have a lot of knowledge about how things work!!

    I've got a situation on my site whereby someone is using a crawler to spider my site ... but this doesn't appear to be a search engine. My access logs show that they've definitely crawled the site ... x number of threads opened at the same time etc with the whole site gone through within 20 minutes or so.

    This person first arrived on 2 November and crawled my entire site ... they have returned tonight.

    Their IP address is an NTL cable one and on my Awstats they show up as a regular visitor. They're using a Firefox 1.0 browser and found my site via a Google ad I had, until tonight, running.

    How would someone do this? What for? How would they get access to us in this way? Would they need to have access to the server? Between their initial visit and the one tonight, I've upgraded my server and have different IP numbers and password. What is going on here?

    Any help would be very much appreciated in order to stop me freaking out!!

  • #2
    Ban the IP
    MCSE, MVP, CCIE
    Microsoft Beta Team

    Comment


    • #3
      Thanks Joe. I did that the moment I saw them this evening. It freaked me when I realised what was going on.

      I'm really concerned that this person has been downloading our data - I know nobody can answer this, but as a rhetorical question, what would they be using it for?

      I'm assuming they're using some sort of programme - how would this work? Does anyone know? What format would they receive the data in? How readable would it be to them?

      Comment


      • #4
        Your asking the wrong guy, a developer could answer that question alot better then me, as they know what data could be retrieved in that kind of situation..
        Hopefully one will answer, my guess is they couldnt get any data without having the database name, username and password for the database..
        MCSE, MVP, CCIE
        Microsoft Beta Team

        Comment


        • #5
          Thanks again Joe. It's kind of you to reply!

          I would like to think they can't read the data, but suspect otherwise. I've no doubt there are programmes out there that can harvest info from our sites which is what I think has happened.

          I would be very interested to hear from a developer if any of them have time. If anyone knows how these programmes work, I would like info. It's not a good feeling to think that someone has crawled our data for their own purposes.

          Comment


          • #6
            Originally posted by zanack View Post
            How would someone do this? What for? How would they get access to us in this way? Would they need to have access to the server? Between their initial visit and the one tonight, I've upgraded my server and have different IP numbers and password. What is going on here?
            Any help would be very much appreciated in order to stop me freaking out!!
            --Hello. What you are describing is a fairly typical issue for us that run discussion boards, particularly popular boards that have a lot of content on them.

            If you are currently allowing unregistered users to view your board, then that would also give read access to spider applications. This access is typically granted via your "guestuser" usergroup. You can disallow access to spiders by coding your guestuser usergroup to not allow viewing by unregistered users; however, by doing so, you will also disallow legitimate (unregistered) users from viewing your board. If this were a legitimate spider, you could control it via the "robots.txt" protocol file, but it doesn't sound like it's legit, based on your description.

            As far as why they are doing it (spidering your site).
            Well, if it's not for legitimate purposes, then it could be a spammer searching for email addresses so they can spam them. Or, it could be someone trying to steal your content via the spider application.

            Another way to control these "bad" spiders is to ban them via the IP (like Joe suggested). But unfortunately, some of these spiders have a tendency to dynamically change their IP address, which then makes your banning efforts pointless. If you can find out what the user agent is, you can ban them (the user agent) via your htaccess file. I had a similar problem like this awhile ago and resolved it by doing just that.

            You can find detailed documentation about robots.txt and htaccess by doing a google search for them.

            Good luck.



            Comment


            • #7
              Thanks TGRS - your comments are really helpful. I hadn't considered spammers - I think you're probably correct here. I did another IP check via Domain Tools and noticed that the IP number is blacklisted for spam. Presumably then they're trying to harvest email addresses.

              I already have a robot txt file on my server, but knowing what the user agent here is, seems almost impossible. If spiders like this ignore the robot txt anyway, I can only hope that the IP ban is going to work - at least for now!

              Thanks for your help!

              Comment

              widgetinstance 262 (Related Topics) skipped due to lack of content & hide_module_if_empty option.
              Working...
              X