Announcement

Collapse
No announcement yet.

Harvest HTML archive

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stryker
    replied
    The other side of the harvesting coin is of course blocking spiders and bots. You can do this by creating a list of all the intruders you would like to exclude from having access to your site and saving it as robots.txt in the root of the public access section of your server. It's not a water-tight solution, but will flummox the vast majority of site stealers. Refer to www.robotstxt.org for more info.

    Anyway, maybe we could open up the discussion a bit more. I've been tweaking my offline archive and have noticed that quite a lot of important formatting has been lost in the 'conversion'. The archive completely ignores quotes and so runs all levels of text together making it very difficult to follow. Also it doesn't seem to translate bold, italicised or underlined text. Maybe it's not the solution I was looking for after all. Does anyone know of any workarounds for this?

    Leave a comment:


  • Stryker
    replied
    I'm not connected to Spidersoft or SiteSucker in any way whatsoever. Believe me, or don't. Do whatever research on me you like, or don't. Either way is fine by me. I posted the link to the WebZip app because I could remember it off the top of my head. The URL for SiteSucker evaded me at the time. Having had the chance to look it up now, I can see why. The domain suffix is one of the new .us varieties... www.sitesucker.us

    I've had the content taken from my site without permission numerous times before so know exactly where you're coming from as it happens. Nevertheless, I've never heard of anyone downloading a whole forum for the purpose of doing anything subversive with it. For a start, the harvested format isn't all that practical for doing much besides keeping an offline archive for your own reference. Aside from that, I can't imagine there's much profit to be made from wholescale forum theft. Lastly, doing anything practical with 'raw chat' involves time and effort - something site stealers aren't willing to expend in my experience.

    In any case, the information is already in the public domain, a thousand times over in fact. Anyone who knows how to use a search engine can find it. Go to www.download.com, hardly a hacker's playground or den of iniquity, and you'll find a dozen similar utilities to perform this task. What would you like to do, put a plaster over the mouth of every shareware and freeware software site?

    I don't buy the idea that if you don't talk about undesirable activities, they won't occur. Shelter your kids from all discussion of underage sex, and what happens? They don't find out about the importance of contraception and end up with unwanted kids. Oops!

    Sorry if I took your post the wrong way, however, if you don't want people to misconstrue your intentions, perhaps you should make them more opaque.
    Last edited by Stryker; Sat 15th Jan '05, 4:44am.

    Leave a comment:


  • welo
    replied
    Originally posted by Stryker
    So you were actually planning to charge people for something they can do themselves, for free and with zero technical knowledge required? What happened to community spirit?
    Oh, so you're the hero now, huh? Genious move, posting a method that allows any whacko to come along and grab in ten minutes what takes the rest of us years to build up.

    FYI: People have contacted me about this and I told them how to do it once I confirmed they were VB owners. So far my total rake is ten bucks someone slipped me out of gratitude (I told them it wasn't necessary). Point your finger elsewhere once you've proven you aren't trying to sell spidersoft software.

    Leave a comment:


  • Stryker
    replied
    I've just completed this procedure for my own site. If you're a PC user I'd advise you to get the web harvesting app (WebZip) from www.spidersoft.com. Mac users like me can use SiteSucker instead, which is perfect for simple, non-javascript pages like the archive produces.

    So you were actually planning to charge people for something they can do themselves, for free and with zero technical knowledge required? What happened to community spirit?

    Leave a comment:


  • welo
    replied
    Thanks Steve. I actually found a way to do it. The thing is though, it makes it so insanely easy to harvest the archive I won't be telling anyone how because it could potentially turn into a full-blown legal issue (no board with archiving enabled is immune). If anyone needs this done then PM me and we can work out a deal based on the amount of content you have, but be prepared to offer written proof the site is yours, and to accept a legal agreement with a strict confidentiality clause.

    Leave a comment:


  • Steve Machol
    replied
    There is no script or function to do this. You can trey asking for a custom script over at vbulletin.org.

    Leave a comment:


  • welo
    started a topic Harvest HTML archive

    Harvest HTML archive

    It is likely I will be collapsing one of my sites and exporting a good deal of its intelligence to a new project (to include the VB license). However, the site already has strong SE rankings, a lot of time left on its domain registration, and there's a lot of content there I would like to keep public.

    Since the VB3 archives are already creating the illusion of an HTML library, does anyone know of a way I can harvest that content directly as HTML pages? That way I could just publish the archive without needing VB to power it.
widgetinstance 262 (Related Topics) skipped due to lack of content & hide_module_if_empty option.
Working...
X