No announcement yet.

Important issue with ImpEx

  • Filter
  • Time
  • Show
Clear All
new posts

  • Important issue with ImpEx

    Hi People!
    I just noticed something unholy with the import system

    It's noticable only when importing very large tables. In my case I am importing a vb2 forum and the post table has 5.5 million posts.

    anyway, the mysql query being used uses a "LIMIT $start, $perpage"

    The reality is that when using just a LIMIT $start,offset, mysql actually selects ALL rows, then displays just the ones described in the LIMIT condition.
    IN my case it was selecting everyithing from the post table - all 5.5 million records and then just returning the 3000 that I set 'per page'

    If you don't beleive me, just try the explain select * from post LIMIT $start,$offset and see for yourself

    Actually the same problem is mentioned on mysql site in the comments section here:

    THe much better way (I actually changed it in 001.php file in systems/vb2/
    is to use the "where" clause like where postid>$start limit $perpage

    After I made that change the importer is about 50 faster for me.

    This is a very important issue because in most cases will go unnoticed especially if table has fewer than a million records.

    Also, if gets even more interesting: when you first start the import process, it will go pretty fast, then you may noticed that it is slowing down as it imports more and more posts. This is because of the way mysql does the SELECT with LIMIT clause. For example if you have 5000000 rows in post table, at first your query will look like
    SELECT * from post LIMIT 10000,3000

    In this case mysql will start selecting ALL records but will stop as soon as it gets the required rows (required by the LIMIT clause), so in this case it will stop the full row scan as soon as it gets the posts from 10000 to 13000 (after only 13000 rows)

    but as your importer progresses, and the number or rows Mysql has to scan to satisfy the select will grow. So a statement like SELECT * from post LIMIT 1000000,3000 will result in a scan of one million 3 thousand rows before it gets all the required rows.

    You can read carefully that info on mysql website for the description of this behaviour.
    This is a MySQL way of doing things. Pretty dumb, but that's the way it is - it does not use index when using only LIMIT clause. That's why you need the add the "where" clause.
    and you can't even fool the mysql by adding the 'use index (PRIMARY)' - it will still not use index!
    Last edited by Mitrofan; Sat 28 Oct '06, 5:11pm. Reason: added stuff
    'beta tester'

  • #2
    Indeed this isn't the best way to get a page's worth of records at all, though once you start thinking about the fact that lots of source boards don't have sequential id's for data, as with the life of a board they will be deleted, removed etc.

    In cases like that you will be getting out of order id's or sets that are greater than the start_at and per_page. Meaning that you have to re-calculate the start point for the next page or you will start skipping and missing posts.

    That isn't a hard thing to do at all I know, though would change the way ImpEx works, and it is being done in ImpEx2, all of the 000.php files are being change to use a generic get_source_data function as the SQL is similar if not exactly the same across 95% of the functions for database systems.

    I appreciate the feed back of your thoughts and testing, this will be put back into ImpEx2 for sure
    I wrote ImpEx.

    Blog | Me


    widgetinstance 262 (Related Topics) skipped due to lack of content & hide_module_if_empty option.