4

I am trying to populate a MS SQL 2005 database using python on windows. I am inserting millions of rows, and by 7 million I am using almost a gigabyte of memory. The test below eats up 4 megs of RAM for each 100k rows inserted:

import pyodbc connection=pyodbc.connect('DRIVER={SQL Server};SERVER=x;DATABASE=x;UID=x;PWD=x') cursor=connection.cursor() connection.autocommit=True while 1: cursor.execute("insert into x (a,b,c,d, e,f) VALUES (?,?,?,?,?,?)",1,2,3,4,5,6) mdbconn.close() 

Hack solution: I ended up spawning a new process using the multiprocessing module to return memory. Still confused about why inserting rows in this way consumes so much memory. Any ideas?

3
  • Have you tried manually committing the transactions? It looks a bit like none of this is being committed to the db. Commented Nov 3, 2010 at 15:59
  • Thanks. Setting connection.autocommit=False and doing a manual commit with connection.commit() has no effect on memory usage. Commented Nov 3, 2010 at 16:04
  • was this ever solved? I'm getting the same problem Commented Apr 9, 2020 at 6:31

5 Answers 5

9

I had the same issue, and it looks like a pyodbc issue with parameterized inserts: http://code.google.com/p/pyodbc/issues/detail?id=145

Temporarily switching to a static insert with the VALUES clause populated eliminates the leak, until I try a build from the current source.

Sign up to request clarification or add additional context in comments.

2 Comments

This solved it. Wish I had 15 points so I could vote this up. Thanks a lot!
It appears the latest code resolves the issue without a workaround.
1

Even I had faced the same problem.

I had to read more than 50 XML files each about 300 MB and load them into SQL Server 2005.

I tried the following :

Using the same cursor by dereferencing.

Closing /opening the connection

Setting the connection to None.

Finally ended up bootstrapping each XML file load using Process module.

Now I have replaced the process using IronPython - System.Data.SqlClient.

This give a better performance and also better interface.

Comments

0

Maybe close & re-open the connection every million rows or so?

Sure it doesn't solve anything, but if you only have to do this once you could get on with life!

1 Comment

Thanks. I have tried connection.close() and connection=pyodbc.connect() every 10,000 inserts. It looks like memory usage goes up if anything.
0

Try creating a separate cursor for each insert. Reuse the cursor variable each time through the loop to implicitly dereference the previous cursor. Add a connection.commit after each insert.

You may only need something as simple as a time.sleep(0) at the bottom of each loop to allow the garbage collector to run.

2 Comments

Thanks, freegnu. Creating a separate cursor doesn't have any effect. I tried time.sleep(1) after each 1000 inserts, and that didn't any have any effect either -- same for time.sleep(0) after each one.
I use pymssql for development and don't see half the problems and limitations I see when using mx.odbc.windows in production. I'm guessing pyodbc is problematic as well. You might want to give pymssql a try.
0

You could also try forcing a garbage collection every once in a while with gc.collect() after importing the gc module.

Another option might be to use cursor.executemany() and see if that clears up the problem. The nasty thing about executemany(), though, is that it takes a sequence rather than an iterator (so you can't pass it a generator). I'd try the garbage collector first.

EDIT: I just tested the code you posted, and I am not seeing the same issue. Are you using an old version of pyodbc?

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.