Announcement

Collapse
No announcement yet.

Import Optimization

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Import Optimization

    We've got just under a million posts on a phpBB2 forum that I'm playing around with importing.

    I'm doing several test imports now, before moving the forum software over completely in a weeks time.

    I'm just wondering what else I can do to speed things up? We're running it on a dedicated server (quad core, heaps of ram etc), I'm running ImpEX on it's own using the config file rather than part of the admincp, and it still takes over 2 hours for just the posts to import.

    Is there a way I can run it via command line php on the server? I tried to use lynx on the shell but unfortunately it wont automatically refresh.
    Last edited by Stubbed; Sun 13 Jan '08, 1:16pm.

  • #2
    I've done some CLI testing and there isn't enough of an improvement (to my surprise) to warrant supporting it.

    one of the main things you can do is turn the error checking off in ImpExConfig :

    PHP Code:
    define('dupe_checking'false); 
    Give, PHP loads of memory. Typically here I give it 256-512M and do 20,000 items per page.

    Also the database is set up with a lot of caching, as ImpEx does a lot of lookups to get import id's.

    With PHP and MySQL on the same server, turning the dupe_checking off should make a considerable difference.
    I wrote ImpEx.

    Blog | Me

    Comment


    • #3
      Have given php 120M, will increase it when we're doing our final import.

      What is the disadvantage to turning dupe checking off?

      Comment


      • #4
        Originally posted by Stubbed View Post
        Have given php 120M, will increase it when we're doing our final import.

        What is the disadvantage to turning dupe checking off?
        Dupe checking will check to see if the item it is importing is already there, does the import id exist. It was put in for people who hit the back button, hit refreshed or had an unstable server that would die half way though a page, to stop duplicates.

        If you have a solid connection to the server, and the server is responsive, it's fine to turn off.
        I wrote ImpEx.

        Blog | Me

        Comment


        • #5
          Ahh. Fantastic, will definitely turn that off when we do the final import.

          Cheers for your replies

          Comment


          • #6
            Just in case anyone is interested, here's the time difference for check dupe on and off.

            When I did have check dupe off, I had to use the remove duplicates option in Impex however. Not to sure how I got dupes, as the process seemed to go through painlessly, that only added about 2 minutes.
            Attached Files

            Comment


            • #7
              2 mins, curious & interesting.

              I dare say that's a decent database caching the import id's and giving a very fast response.

              A good time for that many posts I'd think.

              I think I'll have to add in the over head for the error checking and handling into the next major version, so we can get a better idea of the time costs and trade off in speed vs proof of integrity.
              I wrote ImpEx.

              Blog | Me

              Comment


              • #8
                Originally posted by Jerry View Post
                2 mins, curious & interesting.
                I may not have been exact with that (Will be doing another test import soon, so will get the exact time), but it definitely wasn't anymore than 10 minutes, so still a better trade off!

                Comment


                • #9
                  Would be interesting to profile that here to see where the bottle neck is.
                  I wrote ImpEx.

                  Blog | Me

                  Comment


                  • #10
                    Another import test for Beta 4:

                    My current forum:
                    Our users have posted a total of 880,000 articles
                    We have 4,400 registered users

                    Import thresholds:
                    Users: 2000 (Would use larger if we had more members, but we actively cull inactive ones)
                    Posts: 50000
                    Threads: 25000
                    Private Messages: 25000
                    Polls: 1000

                    Observations:
                    When importing the posts, the first 200,000 odd go through extremely quickly, like I can hardly scroll the page fast enough to keep up with it, after that, they slow down dramatically and exponentially, as the process gets to the newer posts, it gets slower and slower.

                    Rebuild Thread Information: 10mins:30secs
                    Rebuild Forum Information: 4 secs :P
                    Rebuild Search Index: Threshold 25000, 2mins:45secs

                    This time I had no duplicate posts, I'm assuming that I must have bumped refresh when I was doing the last import.

                    On a side note, importing avatar pictures? I entered the full path as /var/www/website.com/phpBB2/images/avatars but it doesn't appear to have taken them, do I need a trailing slash or something else?

                    Also, this thread should probably be moved to the 3.7 installing/upgrading forum, that didn't exist when I started this topic
                    Attached Files

                    Comment


                    • #11
                      Thanks for the feedback

                      I have found the same with the initial speed of imports myself, last time I was testing & profiling, it was the database set up that lead me to improvments.

                      There are things such as LOAD DATA INFILE etc, though due to the current atomic nature of the way ImpEx works I can't use that, consecutive posts might need the previous ones insert id as a parent post id etc.

                      Seeing as the majority of boards are on servers where the database config is set and managed outside of the users/admins control, that I didn't spend too much time focused on that, as the answers I would of found wouldn't of been much use.
                      I wrote ImpEx.

                      Blog | Me

                      Comment


                      • #12
                        Originally posted by Jerry View Post
                        Seeing as the majority of boards are on servers where the database config is set and managed outside of the users/admins control, that I didn't spend too much time focused on that, as the answers I would of found wouldn't of been much use.
                        I'm more than interested in that information if you've got it available, as that is not the case with our setup..

                        Comment


                        • #13
                          I raised the key_buffer and turned off the indexing for the duration of the import from what I recall, then rebuilt the indexes afterwards.
                          I wrote ImpEx.

                          Blog | Me

                          Comment


                          • #14
                            Just doing a few more test imports (Yeah, so I've totally blown out the origional 'import after a week' time frame, decided to wait for vb3.7 final..), a slight improvement can also be made by the following:

                            define('shortoutput', true);

                            Less information for the browser to download and display. Appears to speed things up ever so slightly, but on a big import every little thing helps!

                            Comment


                            • #15
                              As well as dropping and rebuilding some of the indexes (full text) :

                              http://www.vbulletin.com/docs/html/med_large_import

                              Some imports speed up by a factor of 10 doing that.
                              I wrote ImpEx.

                              Blog | Me

                              Comment

                              widgetinstance 262 (Related Topics) skipped due to lack of content & hide_module_if_empty option.
                              Working...
                              X