Handling bulk insert of huge data

Question

I have some data in csv files. The volume of the data is huge (around 65GB). I want to insert them all in a database so that later they can be queried.

The csv file itself is pretty simple, it has only 5 columns. So basically all the data will be inserted into a single table.

Now I have tried to insert these data into a mysql database but the time it's taking is quite huge. I have spent almost 6 hours to insert just 1.3GB of those data (My processor is core i5 2.9 GHz, RAM is 4GB DDR3).

This loading needs to be finished pretty quickly so that all the data inserts should be done within 4/5 days.

Which database will show the best performance in this case, provided that a reasonable query speed is acceptable on the data?

Also, are there some other steps/practices that I should follow ?

how are you inserting the data? one insert statement per record or a batch approach; inserting multiple records with one insert statement? — jurgenreza
– jurgenreza, Commented Mar 16, 2013 at 17:29
@jurgenreza I am inserting the data in a batch approach with the LOAD DATA INFILE Syntax — Md. Arafat Al Mahmud
– Md. Arafat Al Mahmud, Commented Mar 16, 2013 at 17:35
Mysql website claims LOAD DATA INFILE is 20 times faster than insert! This link has some tips like avoiding index updates and consistency checking. take a look in case you haven't already: dev.mysql.com/doc/refman/5.5/en/insert-speed.html — jurgenreza
– jurgenreza, Commented Mar 16, 2013 at 18:26
Is the DB on an SSD? If not, you might want to get one, it should speed this up dramatically. — Rich Bradshaw
– Rich Bradshaw, Commented Mar 16, 2013 at 19:35

ckim · Accepted Answer · 2013-03-16 23:48:40Z

You probably don't even need to import it. You can create a table with the engine=CSV.

 mysql> create table mycsv(id int not null) engine=csv; Query OK, 0 rows affected (0.02 sec)

then go into your data directory and remove mycsv.CSV and move/copy/symlink your CSV file as mycsv.CSV. Go back to mysql and type flush tables; and you're good to go. (NOTE: it may not work with \r\n so you may need to convert those to \n first).

If you are using InnoDB, the problem is that it has to keep track of each undo log entry for every row inserted and this takes a lot of resources, taking a loooong time. Better to do it in smaller batches so it can do most of the undo log tracking in memory. The undo log is there in case you ctrl-c it in the middle of a load and it needs to roll back. After that batch has been loaded, it doesn't need to keep track of it anymore. If you do it all at once, then it has to keep track of all those undo log entries, probably having to go to disk -- and that's a killer.

I prefer to use myisam for data if I know I'm not going to do row level locking, like if I want to run one long program to analyze the data. The table is locked, but I only need one program running on it. Plus you can always use merge tables -- they take myisam tables and you can group them together into one table. I like doing this for log files where each table is a month of data. Then I have a merge table for a year. The merge table doesn't copy the data, it just points to each of the myisam tables.

Are you sure you're in the right directory? Is it the same directory when you do show variables like 'datadir'. You might also want to try doing flush tables to make sure it's written out to disk.

Collectives™ on Stack Overflow

Handling bulk insert of huge data

1 Answer 1

2 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Related