Announcement

Collapse
No announcement yet.

Need some suggestions on a Mysql/data issue.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Need some suggestions on a Mysql/data issue.

    I run a fansite for an online game (Ultima online) and a fellow fansite (forums4games) lost all their data a year or so ago

    One of their members scraped all the lost posts/stories via google's cache system

    All the data he collected, he put into a mysql database, the data itself comes to a whopping 2.3gb.

    I'm in the process of removing the excess colums, the only one I need is the raw data from the scraping itself, but unfortunatly it's the raw html page, so all the html (including css style info etc) is all in there

    Can anyone recommend a good way to extract the important bits from the 58,000 or so fields? heh

    Basically each field has a copy of a thread or post, which I want to save, but I obviously don't want the CSS info, or html, I just want the raw (written) stories/text.

  • #2
    You'd have to iterate through each row, and regex out the post information. It's perfectly possible but this is the kind of thing you hire data mining experts for as it's not straightforward. There are many factors you'd have to take into consideration including multi-page threads, and ensuring posts are re-added to the database in the correct order. I could take a look for you, but it wouldn't be cheap as data mining is complex and time consuming.
    Dean Clatworthy - Web Developer/Designer

    Comment


    • #3
      just edit it but make sure you dont delete anything important. >.<

      Comment

      widgetinstance 262 (Related Topics) skipped due to lack of content & hide_module_if_empty option.
      Working...
      X