Announcement

Collapse
No announcement yet.

Unicode characters stored in database as &#N; causes massive fulltext index bloat

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Unicode characters stored in database as &#N; causes massive fulltext index bloat

    Hello,

    Because our vB 3.8.7 board stores utf-8 characters as numeric character references in the database, every single unicode character is considered a "word" by MyISAM. This is becoming a performance issue; our fulltext indices are bursting at the seams with useless entries. Here are the top 50 indexed "words" in our vb_post table:

    Code:
    [email protected]# myisam_ftdump -v -c vb_post 4 | sort -n -r -b
        Count            Weight    Word
        61576            2.8448960 12398
        51118            3.0408575 22899
        49685            3.0706301 12540
        45861            3.1542822 12531
        40456            3.2846986 12375
        38975            3.3233632 12356
        38282            3.3419443 12523
        37882            3.3528175 32654
        37773            3.3557997 12452
        37514            3.3629192 12426
        37372            3.3668427 23376
        37023            3.3765472 12394
        36388            3.3944333 22823
        35983            3.4059992 20986
        35499            3.4199874 12473
        35307            3.4255876 20154
        33555            3.4780959 12522
        32600            3.5078476 38291
        31774            3.5342704 26178
        30760            3.5676343 12524
        30079            3.5906469 26085
        30069            3.5909885 12483
        29448            3.6124265 12427
        29090            3.6249861 12363
        28985            3.6286983 12383
        28292            3.6535323 12463
        28164            3.6581839 23567
        28141            3.6590220 12373
        27736            3.6738890 12414
        27431            3.6852255 12395
        27341            3.6885941 20013
        27274            3.6911089 12459
        26980            3.7022158 21517
        26926            3.7042687 12391
        26599            3.7167863 12435
        26080            3.7369655 12392
        26044            3.7383797 2010
        25894            3.7442929 12415
        25829            3.7468656 12488
        25164            3.7735563 29255
        24900            3.7843439 12450
        24855            3.7861938 24433
        24235            3.8120205 12390
        24185            3.8141314 12471
        24068            3.8190875 12513
        24004            3.8218086 12428
        23745            3.8328932 12425
        23634            3.8376800 12521
        23497            3.8436185 12472
    <snip>
    As you can see, only one real "word" in there: 2010. That's a whopping 98.27% wasted space.

    This is crazy. How can I make vBulletin do the right thing and store utf-8 into my utf-8 vb_post and vb_thread tables? Any advice would be greatly appreciated.
    Last edited by Robly; Wed 20th Apr '11, 8:59am.

  • #2
    To answer my own question: go to AdminCP > Language Manager > Edit Settings, make sure you set your HTML Character Set is "utf-8" BEFORE you make your website public or add any content.

    So now I've got this board with millions of posts all made under the wrong charset (ISO-8859-1). Setting the HTML Character Set to "utf-8" now is not workable, because I'll end up with a mix of searchable and non-searchable content. I suspect the only solution is to extract the data, convert the numeric character references to characters, and load them back into to the database. What a pain! Does anyone have any tips on accomplishing this?

    Comment

    widgetinstance 262 (Related Topics) skipped due to lack of content & hide_module_if_empty option.
    Working...
    X