10

I am trying to insert a lot of users into a MySQL database with two tables:

The first table contains the user data. An example INSERT looks like this (id is the primary key, mail is a unique key):

INSERT INTO users (id, mail, name) VALUES (NULL, "[email protected]", "John Smith") ON DUPLICATE KEY UPDATE name = VALUE(name) 

The second table contains the group the user belongs to. It only stores two foreign keys users_id and groups_id. An example query looks like this:

INSERT INTO users_groups (users_id, groups_id) VALUES (LAST_INSERT_ID(), 1) 

This setup works perfectly fine for small data sets. When I import large amounts of data (>1M rows) the INSERTs get slow. Obviously, it would be much better to do a batch insert:

INSERT INTO users (id, mail, name) VALUES (NULL, "[email protected]", "John Smith"), (NULL, "[email protected]", "Anna Smith") ON DUPLICATE KEY UPDATE name = VALUE(name) 

and:

INSERT INTO users_groups (users_id, groups_id) VALUES (LAST_INSERT_ID(), 1), (LAST_INSERT_ID(), 4) 

The problem of course is, that LAST_INSERT_ID() only returns one (the first) id of a batch INSERT.
So, what I would need is a "nested" batch INSERT, which IMO does not exist in MySQL.

What can I do to make my INSERTs faster?

2
  • if you have an index on the mail column for example, you could just go with an insert select statement to fill up the group table. In the select statement, you would then use a CASE WHEN expression, assigning the group depending on the email. But if harvey's solution is confirmed then it's better. Commented Dec 4, 2014 at 23:47
  • Can you give some context? Programmatically inserting such a large number of rows is unusual a use case. If this is a one-time operation, a long execution time may be a non-issue. If this is a regular synchronisation, then perhaps you should look into replication, or even sharing the user table across your applications. Commented Dec 10, 2014 at 15:51

4 Answers 4

5

Bulk inserts by default provide sequential auto increments, with this knowledge you can do your inserts like;

INSERT INTO users (id, mail, name) VALUES (NULL, "[email protected]", "John Smith"), (NULL, "[email protected]", "Anna Smith"), (...) # repeat n-times ; SET @LASTID=LAST_INSERT_ID() ; INSERT INTO users_groups (users_id, groups_id) VALUES (@LASTID - n , 1), # Note n in descending sequence (@LASTID - n-1, 1), ... (@LASTID - 1 , 1), (@LASTID - 0 , 4) ; 

For more information on bulk inserts and auto increment have a look at http://dev.mysql.com/doc/refman/5.1/en/innodb-auto-increment-handling.html

Importantly, make sure that innodb_autoinc_lock_mode=1

show global variables like 'innodb_autoinc_lock_mode' 

Otherwise consider wrapping your inserts in LOCK TABLES

LOCK TABLES tbl_name WRITE ... sqls ... UNLOCK TABLES 
Sign up to request clarification or add additional context in comments.

12 Comments

Are you sure the 1st value line corresponds to the first generated id though? I'd like to see a link about that
@Sebas; There's no way to get the insert id of the first value, which is why you get the last id and subtract backwards.
My question is: are you sure @LASTID - n is the ID generated for [email protected] ?
The INSERT query from the question has an ON DUPLICATE KEY UPDATE clause. If id is the only UNIQUE index of the table, that clause is harmless and not needed. But if there is another UNIQUE index on the table (on column mail f.e.) and a duplicate key happens during the INSERT then all this LAST_INSERT_ID() thing is wrong because in this case the id is incremented but not used (the existing row is updated).
Yes, that is exactly the case. So your approach will not work for me unfortunately
|
1

If you're putting millions of known rows into a table all at once, consider using LOAD DATA INFILE since it's intended for speed in just that type of scenario, as evidenced by this quote from the docs:

The LOAD DATA INFILE statement reads rows from a text file into a table at a very high speed.

And at Speed of INSERT Statements:

When loading a table from a text file, use LOAD DATA INFILE. This is usually 20 times faster than using INSERT statements.

This is assuming that your source data is coming from, or could be provided as, a text file. If you have the group ID in the file as well, you might do something like this:

CREATE TEMPORARY TABLE load_users_groups ( mail VARCHAR(60), name VARCHAR(60), groupid INT, PRIMARY KEY (mail, name) ); LOAD DATA INFILE '/path/to/file.csv' INTO TABLE load_users_groups FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'; -- use whatever optional syntax required to parse your file INSERT INTO users (mail, name) SELECT mail, name FROM load_users_groups ON DUPLICATE KEY UPDATE name = VALUES(name); INSERT INTO users_groups (users_id, groups_id) SELECT users.id, load_users_groups.groupid FROM users JOIN load_users_groups USING (mail, name); DROP TEMPORARY TABLE load_users_groups; 

Whether this approach ends up being faster than your current approach depends on whether you save more time using LOAD DATA INFILE than you spend performing two additional INSERT ... SELECT statements to move the data into your desired tables. You may want to tweak keys on the temporary table; I can't benchmark it for you based solely on the contents of your question. I'd be interested to know how it works out, though.

The documentation also has a decent number of tips for Bulk Data Loading for InnoDB Tables and Bulk Data Loading for MyISAM Tables. I won't go through them in detail, not least because you haven't given us any DDL or server info, but you may find it helpful to read through one or the other on your own time.

Comments

0

Had to deal with a similar issue.

MySQL doesnt really offer much of a way to reliably reserve large batches of table IDs for this purpose. I spent a good half a day researching to no avail. There are some hacks floating around, but nothing id bank my data on.

I just did the users table with one-by-one inserts (better slow than screwy) and returned the id of the new row to my ORM. I got a row ID to work with, so i was then able to throw it and the data that should be imported into a JSON that keeps them both together. That made it much easier for me to bulk insert and keep the data matched up.

Best.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.