2

I have to perform a large number of inserts (in this instance 27k) and I want to find an optimal to do this. Right now this is the code that I have. As you can see I'm using prepared statement and batches and I'm executing every 1000 ( I have also tried with a lesser number such as 10 and 100 but the time was again way to long). One thing which is omitted from the query is that there is an auto-generated ID if it is of any matter to the issue:

private void parseIndividualReads(String file, DBAccessor db) { BufferedReader reader; try { Connection con = db.getCon(); PreparedStatement statement = null; statement = con.prepareStatement("INSERT INTO `vgsan01_process_log`.`contigs_and_large_singletons` (`seq_id` ,`length` ,`ws_id` ,`num_of_reads`) VALUES (?, ?, ?, ?)"); long count = 0; reader = new BufferedReader(new FileReader(logDir + "/" + file)); String line; while ((line = reader.readLine()) != null) { if(count != 0 && count % 1000 == 0) statement.executeBatch(); if (line.startsWith(">")) { count++; String res[] = parseHeader(line); statement.setString(1, res[0]); statement.setInt(2, Integer.parseInt(res[1]) ); statement.setInt(3, id); statement.setInt(4, -1); statement.addBatch(); } } statement.executeBatch(); } catch (FileNotFoundException ex) { Logger.getLogger(VelvetStats.class.getName()).log(Level.SEVERE, "Error opening file: " + file, ex); } catch (IOException ex) { Logger.getLogger(VelvetStats.class.getName()).log(Level.SEVERE, "Error reading from file: " + file, ex); } catch (SQLException ex) { Logger.getLogger(VelvetStats.class.getName()).log(Level.SEVERE, "Error inserting individual statistics " + file, ex); } } 

Any other tips regarding what might be changed in order to speed up the process. I mean a single insert statement doesn't have much information - I'd say no more than 50 characters for all 4 columns

EDIT:

Okay following the advice given I have restructured the method as follows. The speed up is immense. You could even try and play with the 1000 value which might yield better results:

private void parseIndividualReads(String file, DBAccessor db) { BufferedReader reader; PrintWriter writer; try { Connection con = db.getCon(); con.setAutoCommit(false); Statement st = con.createStatement(); StringBuilder sb = new StringBuilder(10000); reader = new BufferedReader(new FileReader(logDir + "/" + file)); writer = new PrintWriter(new BufferedWriter(new FileWriter(logDir + "/velvet-temp-contigs", true)), true); String line; long count = 0; while ((line = reader.readLine()) != null) { if (count != 0 && count % 1000 == 0) { sb.deleteCharAt(sb.length() - 1); st.executeUpdate("INSERT INTO `vgsan01_process_log`.`contigs_and_large_singletons` (`seq_id` ,`length` ,`ws_id` ,`num_of_reads`) VALUES " + sb); sb.delete(0, sb.capacity()); count = 0; } //we basically build a giant VALUES (),(),()... string that we use for insert if (line.startsWith(">")) { count++; String res[] = parseHeader(line); sb.append("('" + res[0] + "','" + res[1] + "','" + id + "','" + "-1'" + "),"); } } //insert all the remaining stuff sb.deleteCharAt(sb.length() - 1); st.executeUpdate("INSERT INTO `vgsan01_process_log`.`contigs_and_large_singletons` (`seq_id` ,`length` ,`ws_id` ,`num_of_reads`) VALUES " + sb); con.commit(); } catch (FileNotFoundException ex) { Logger.getLogger(VelvetStats.class.getName()).log(Level.SEVERE, "Error opening file: " + file, ex); } catch (IOException ex) { Logger.getLogger(VelvetStats.class.getName()).log(Level.SEVERE, "Error reading from file: " + file, ex); } catch (SQLException ex) { Logger.getLogger(VelvetStats.class.getName()).log(Level.SEVERE, "Error working with mysql", ex); } } 

3 Answers 3

3

You have other solutions.

  1. Use LOAD DATE INFILE from mySQL Documentation.
  2. If you want to do it in Java, use only one Statement that inserts 1000 values in 2 order : "INSERT INTO mytable (col1,col2) VALUES (val1,val2),(val3,val4),..."

But I would recommend the 1st solution.

Sign up to request clarification or add additional context in comments.

6 Comments

unless db and webserver are in the same physical machine, otherwise not possible for solution1. for solution 2, speed is very good, however, you also need to take care of SQL injection.
1 is not viable since the application doing the inserts is running on a separate machine and tinkering with uploading the file is just not feasible. Regarding your second option - isn't this what the prepared statement and the batch family of methods supposed to do?
I think it's different ans speed is much better in solution 2. I'm going to check this. About other stuff, you can set connection.setAutoCommit(false); and then connection.commit(); to run faster.
the thing is I want to use ass less "hacking" ass possible. Solution 2 looks good but is there a way to do it without having to construct a large string and then just send it to the db ?
As far as I know, apart from StringBuilder, and if you need synchronization then StringBuffer, I don't know how you could do in another way.
|
1

The fastest way to do what you want to do is to load directly from the file (http://dev.mysql.com/doc/refman/5.5/en/load-data.html).

There are some problems with loading from a file - firstly, the file needs to be readable by the server, which often isn't the case. Error handling can be a pain, and you can end up with inconsistent or incomplete data if the data in your file does not meet the schema's expectations.

It also depends on what the actual bottleneck is - if the table you're inserting to has lots going on, you may be better using a combination of insert delayed (http://dev.mysql.com/doc/refman/5.5/en/insert-delayed.html).

The official line on speeding up inserts is here: http://dev.mysql.com/doc/refman/5.5/en/insert-speed.html

Comments

1

Depending on the structure of your data you have a potential bug in your "execute batch every 1000 iterations" logic.

If the frequency of lines beginning with ">" is low, then you could have an instance where the following happens (loading for lots of unnecessary executeBatch calls:

line in data file events in program ----------------------------------------------------------- > some data (count=999) > some more data (count=1000) another line (execute batch, count=1000) more unprocessed (execute batch, count=1000) some more (execute batch, count=1000) 

So I would move the if(count != 0 && count % 1000 == 0) inside the if (line.startsWith(">")) block.

Note, I'm note sure whether this can happen in your data or how much of a speed-up it would be.

1 Comment

Good spot, I accidentally run into this bug and fixed it by just making count = 0 everytime I'm in the if statement.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.