37

I am working the a very large data set, roughly 2 million records. I have the code below but get an out of memory exception after it has process around three batches, about 600,000 records. I understand that as it loops through each batch entity framework lazy loads, which is then trying to build up the full 2 million records into memory. Is there any way to unload the batch one I've processed it?

ModelContext dbContext = new ModelContext(); IEnumerable<IEnumerable<Town>> towns = dbContext.Towns.OrderBy(t => t.TownID).Batch(200000); foreach (var batch in towns) { SearchClient.Instance.IndexMany(batch, SearchClient.Instance.Settings.DefaultIndex, "Town", new SimpleBulkParameters() { Refresh = false }); } 

Note: The Batch method comes from this project: https://code.google.com/p/morelinq/

The search client is this: https://github.com/Mpdreamz/NEST

5
  • Huge amounts of data is a scenario in which I'm not sure an ORM is an appropriate tool... Commented Aug 11, 2013 at 8:11
  • @Vadim, ORM is an appropriate tool to process business logic without worrying about underlying data store, however there are easier ways to write batch processing with ORM. Commented Aug 11, 2013 at 8:18
  • 2
    @AkashKava, of course it is. The thing is, when you combine ORMs with large volumes of data, you'll always find yourself "cancelling" all kinds of ORM aspects which you would consider a feature in other scenarios. In other cases, you write code in a way which specifically addresses issues of the ORM you are using. All I'm saying is - given large volumes of data ORMs become problematic. Commented Aug 11, 2013 at 8:23
  • @Vadim, It becomes Problematic just because it is done wrongly, doesn't mean it should not be used. Large amount of data should never be processed all at once in memory, instead the correct way of batching should be done. See my answer, we are processing millions of records every day with ORM. Large Operation has to be broken down into collection of smaller steps. Commented Aug 11, 2013 at 8:26
  • 2
    How is Batch any different than Skip(batchNo*batchSize).Take(batchSize)? Commented Jun 4, 2015 at 15:05

2 Answers 2

85

The issue is that when you get data from EF there are actually two copies of the data created, one which is returned to the user and a second which EF holds onto and uses for change detection (so that it can persist changes to the database). EF holds this second set for the lifetime of the context and its this set thats running you out of memory.

You have 2 options to deal with this

  1. renew your context each batch
  2. Use .AsNoTracking() in your query eg:

    IEnumerable<IEnumerable<Town>> towns = dbContext.Towns.AsNoTracking().OrderBy(t => t.TownID).Batch(200000); 

this tells EF not to keep a copy for change detection. You can read a little more about what AsNoTracking does and the performance impacts of this on my blog: http://blog.staticvoid.co.nz/2012/4/2/entity_framework_and_asnotracking

Sign up to request clarification or add additional context in comments.

9 Comments

Does NoTracking preserve navigation properties? For example if I want to set/add related objects, will it save?
@AkashKava no it wont, if you want to save anything you will need to first attach it back to the context, if the entity has navigation properties these will also need to be attached. If you are actually wanting to modify the entities you load i would suggest using the first method.
Thanks @LukeMcGregor, I found that renewing the context on every batch worked really well
How do you renew your context? Just dbContext = new DbContext()?
@mrmashal an extention method on IEnumerable<T> which returns an IEnumerable<IEnumerable<T>> where the inner sets have 20000 items. I wrote this myself, but its super simple if you want something similar
|
0

I wrote a migration routine that reads from one DB and writes (with minor changes in layout) into another DB (of a different type) and in this case, renewing the connection for each batch and using AsNoTracking() did not cut it for me.

Note that this problem occurs using a '97 version of JET. It may work flawlessly with other DBs.

However, the following algorithm did solve the Out-of-memory issue:

  • use one connection for reading and one for writing/updating
  • Read with AsNoTracking()
  • every 50 rows or so written/updated, check the memory usage, recover memory + reset output DB context (and connected tables) as needed:

    var before = System.Diagnostics.Process.GetCurrentProcess().VirtualMemorySize64; if (before > 800000000) { dbcontextOut.SaveChanges(); dbcontextOut.Dispose(); GC.Collect(); GC.WaitForPendingFinalizers(); dbcontextOut = dbcontextOutFunc(); tableOut = Dynamic.InvokeGet(dbcontextOut, outputTableName); } 

1 Comment

I think the GC.Collect function is crucial, i had a similar routine and kept getting OOM errors, but this code fixes the issue.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.