No announcement yet.

Need to know which Encoding/Collation to use (persistent foreign characters problem)

  • Filter
  • Time
  • Show
Clear All
new posts

  • Need to know which Encoding/Collation to use (persistent foreign characters problem)

    Like many boards in languages other than English, after moving the site a few times I ended up with a big salad of strange characters showing in the forum. It seems each move adds a new "fresh" batch of character set problems.

    I have searched and read as much as I could here, but I am thoroughly confused. I need to know what the recommended settings and MySQL version combinations are. It seems there are many factors, the charset of the board, the charset of the language file, the database encoding, the collation, vBulletin with 8859 as default, databases that become "hybrid" and you get "collation mix" errors, etc.

    My character encoding problem is only growing so can someone please translate the following post from Mike Sullivan into practical, step-by-step advice for "foreign" characters users?
    I think there are some misunderstandings here of what using a latin1-based encoding for XML/databases does to UTF-8 data that's inserted.

    The first key is that you have to consider ISO-8859-1/Latin1 not as a set of characters but as a set of byte values that represent characters. You can also string together multiple values (ie, multiple characters)--this part is important when you look at UTF-8 as byte values as well. If you just look at what bytes combinations are valid, you will find that UTF-8 becomes a subset of Latin1; every string of bytes is valid Latin1 but this is not the case with UTF-8. (Clearly, in terms of actual characters, Latin1 is a subset of UTF-8.)

    So, with that in mind, you need to be aware that vBulletin tries to be as unaware of the character set used by the database or the XML parser as possible. Violating this assumption by making only a few specific changes is what has broken UTF-8. You're right, there are a lot of places that are hard coded to ISO-8859-1--and for the record, we recommend DB char sets also be latin1--but since UTF-8 is valid ISO-8859-1, this is ok.
    I'm super confused after reading this (and it's the meatiest post I could find on the subject). So far each server move made all existing content display with garbled accented vowels, and only the new posts display right. And I have no clue how to stop this, don't even know why it happens.

  • #2
    There is no practical/general advice in that post, as it was specifically pointed at one person's claim about vB's UTF-8 support.

    If you have problems after a server move, that means either 2 things:
    - the export part is broken
    - something on the new server is different from the old server (ie, the import part is broken)

    I'm assuming that this a mysqldump and restore, so that would mean that all the vB settings are the same. (If that is not true, then we have to look at different things!) Unfortunately, I can't comment on why some data would be broken, if new posts work. This generally means that there was some data lost in the backup/restore process, which generally means the old data isn't retrievable.


    • #3
      Thank you Mike for your reply.

      I am trying to figure out which encodings I need to use to prevent garbled caracters to appear on my Spanish-language forum.

      It would greatly help me if you could tell me which encoding I need to use, especially in the database backup file itself. Moving servers brings problems because the database dump file (I use MySQLDumper) is not itself encoded in UTF-8. Looks like the backup is "western encode", so when restored to a phpMyAdmnin-created database that by default is UTF-8 / Latin1_swedish_ci encoding, then all accents become strange symbols. It seems that the restored portion of the database is ASCII, for the garbled characters are ASCII's representation of UTF-8's accented vowels. Thus the first layer of garbled stuff is born. The new forum activity is fine though, this one seems to be UTF-8 and the accented vowels and other special characters display fine... until you move servers again, at which time they will in turn become garbled. But the previous garbled batch will get even stranger with this iteration, it's like double-garble.

      For instance, look at this progression:

      1 (fresh content, not restored to new DB): áéíóúñ
      2 (restored content after server move): áéÃ*óúñ
      3 (restored content after two server moves: áéÃÂ*óúñ

      The new content coming from new forum activity after restoring the database, displays correctly, but the Forum's titles and descriptions never do. So with each server move there is a need to manually re-type the Forum names and descriptions so that they display right.

      The intuitive thing to do is to set everything to the same encoding, right? So everything should be UTF-8 and that's it, but vBulletin uses ISO-8859-1. So what should I do? Should I set language encoding to 8859 or UTF-8? What about the language XML document itself, should this document be UTF-8 with no BOM, or plain western encoding? How about the heading of the XML document, sometimes things worked better for me after editing the XML language file to UTF-8. How about creating a new database under phpMyAdmin, should it be UTF-8/Latin1_swedish_ci or some other encoding? And what about the database backup itself, should this file be UTF-8 before restoring it? Because if it's western, when restored to an UTF-8 database it becomes garbled.


      • #4
        Ignore vB for now. You can set it's character set by editing a language in the language manager. You don't need to touch anything else. (You definitely don't want to be editing XML files. That's what created the issue you linked to in the first post.)

        Your example content looks like UTF-8 data being displayed as Latin1/ISO-8859-1. The thing is, if this were just an issue in vB (wrong character set), then the server move from 2 to 3 shouldn't change the data. This makes me think there's something wrong with the dump/restore process. The most compatible method is detailed in our manual here, though it does require shell access.

        I'm not familiar with MySQLDumper. If you look at the CREATE TABLE statements, do they include CHARSET references. I could see it not having anything or saying it's latin1. But to get the behavior seen here, I think the new DB would have to be set to UTF-8.

        If this info isn't of any help, it'll probably be best to submit a support ticket and ask that it be assigned to me. I'd just need to know where the new forum is to look into this issue for now.


        Related Topics