Batch with Multiple GroupBy

Question

I have a CSV file with records that need to be sorted and then grouped into arbitrary-sized batches (e.g. 300 max records per batch). Each batch might have less than 300 records, because the contents of each batch must be homogenious (based on the contents of a couple different columns).

My LINQ statement, inspired by this answer on batching with LINQ, looks like this:

var query = (from line in EbrRecords let EbrData = line.Split('\t') let Location = EbrData[7] let RepName = EbrData[4] let AccountID = EbrData[0] orderby Location, RepName, AccountID). Select((data, index) => new { Record = new EbrRecord( AccountID = EbrData[0], AccountName = EbrData[1], MBSegment = EbrData[2], RepName = EbrData[4], Location = EbrData[7], TsrLocation = EbrData[8] ) , Index = index} ).GroupBy(x => new {x.Record.Location, x.Record.RepName, batch = x.Index / 100});

The "/ 100" gives me the arbitrary bucket size. The other elements of the groupby are intended to achieve the homogenaity between batches. I suspect this is almost what I want, but it gives me the following compiler error: A query body must end with a select clause or a group clause. I understand why I'm recieving the error, but overall I'm not sure how to fix this query. How would it be done?

UPDATE I very nearly achieved what I'm after, with the following:

List<EbrRecord> input = new List<EbrRecord> { new EbrRecord {Name = "Brent",Age = 20,ID = "A"}, new EbrRecord {Name = "Amy",Age = 20,ID = "B"}, new EbrRecord {Name = "Gabe",Age = 23,ID = "B"}, new EbrRecord {Name = "Noah",Age = 27,ID = "B"}, new EbrRecord {Name = "Alex",Age = 27,ID = "B"}, new EbrRecord {Name = "Stormi",Age = 27,ID = "B"}, new EbrRecord {Name = "Roger",Age = 27,ID = "B"}, new EbrRecord {Name = "Jen",Age = 27,ID = "B"}, new EbrRecord {Name = "Adrian",Age = 28,ID = "B"}, new EbrRecord {Name = "Cory",Age = 29,ID = "C"}, new EbrRecord {Name = "Bob",Age = 29,ID = "C"}, new EbrRecord {Name = "George",Age = 29,ID = "C"}, }; //look how tiny this query is, and it is very nearly the result I want!!! int i = 0; var result = from q in input orderby q.Age, q.ID group q by new { q.ID, batch = i++ / 3 }; foreach (var agroup in result) { Debug.WriteLine("ID:" + agroup.Key); foreach (var record in agroup) { Debug.WriteLine(" Name:" + record.Name); } }

The trick here is to bypass the select "index position" overlaod, by using a closure variable (int i in this case). The output results are as follows:

ID:{ ID = A, batch = 0 } Name:Brent ID:{ ID = B, batch = 0 } Name:Amy Name:Gabe ID:{ ID = B, batch = 1 } Name:Noah Name:Alex Name:Stormi ID:{ ID = B, batch = 2 } Name:Roger Name:Jen Name:Adrian ID:{ ID = C, batch = 3 } Name:Cory Name:Bob Name:George

While this answer is acceptable, it is just a fraction short of the ideal result. It should be that the first occurance of "batch 'B'" should have 3 entires in it (Amy, Gabe, Noah) - not two (Amy, Gabe). This is because the index position is not being reset when each group is identified. Anyone know how to reset my custom index position for each group?

UPDATE 2 I think I may have found an answer. First, make an additional function like this:

 public static bool BatchGroup(string ID, ref string priorID ) { if (priorID != ID) { priorID = ID; return true; } return false; }

Second, update the LINQ query like this:

int i = 0; string priorID = null; var result = from q in input orderby q.Age, q.ID group q by new { q.ID, batch = (BatchGroup(q.ID, ref priorID) ? i=0 : ++i) / 3 };

Now it does what I want. I just wish i did not need that separate function!

StriplingWarrior · Accepted Answer · 2011-06-02 20:40:39Z

2

Does this work?

var query = (from line in EbrRecords let EbrData = line.Split('\t') let Location = EbrData[7] let RepName = EbrData[4] let AccountID = EbrData[0] orderby Location, RepName, AccountID select new EbrRecord( AccountID = EbrData[0], AccountName = EbrData[1], MBSegment = EbrData[2], RepName = EbrData[4], Location = EbrData[7], TsrLocation = EbrData[8]) ).Select((data, index) => new { Record = data, Index = index }) .GroupBy(x => new {x.Record.Location, x.Record.RepName, batch = x.Index / 100}, x => x.Record);

edited Jun 2, 2011 at 20:40

answered Jun 2, 2011 at 19:00

StriplingWarrior

158k29 gold badges261 silver badges326 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Brent Arias Over a year ago

What I expect is a list of EbrRecord lists (a list of lists). But the above gives me a list for an anonymous type which contains only Location, RepName, and batch. I'm wondering if the post I linked to actually does what I thought or hoped.

StriplingWarrior Over a year ago

@Brent: GroupBy will create an IEnumerable of IGroupings, each of which has a Key with the Location, RepName, and batch, but is also an IEnumerable itself, containing the selected values. If you use the overload in my updated answer, you should effectively have an IEnumerable<IEnumerable<EbrRecord>>. However, I think it's likely that it doesn't do quite what you hoped. Be sure to read through David B's answer. He makes some excellent points.

Amy B · Accepted Answer · 2011-06-02 19:45:39Z

orderby Location, RepName, AccountID

There needs to be a select clause after the above, as demonstrated in StriplingWarrior's answer. Linq Comprehension Queries must end with select or group by.

Unfortunately, there is a logical defect... Suppose I have 50 accounts in the first group and 100 accounts in the second group with a batch size of 100. The original code will produce 3 batches of size 50, not 2 batches of 50, 100.

Here's one way to fix it.

IEnumerable<IGrouping<int, EbrRecord>> query = ... orderby Location, RepName, AccountID select new EbrRecord( AccountID = EbrData[0], AccountName = EbrData[1], MBSegment = EbrData[2], RepName = EbrData[4], Location = EbrData[7], TsrLocation = EbrData[8]) into x group x by new {Location = x.Location, RepName = x.RepName} into g from g2 in g.Select((data, index) => new Record = data, Index = index }) .GroupBy(y => y.Index/100, y => y.Record) select g2; List<List<EbrRecord>> result = query.Select(g => g.ToList()).ToList();

Also note that using GroupBy to batch is very slow due to redundant iterations. You can write a for loop that will do it in one pass over the ordered set and that loop will run much faster than the LinqToObjects.

My intellisense and compiler refuses to allow me to place a "group by" after the "select new", unless I switch to dot notation.
Fixed many embarrassing typos. Now I'm done (whether it works or not).
Ok, upon further analysis I see that your answer achives exactly the correct result. I've checked this as the answer. But, OH MAN is that a busy query you made!!! Please look at the update I've made to my question, where I've found a near perfect (and much simpler) solution. If you can make my update work without much effort, I'll give a thumb-up vote as well. :)
shrug you wanted a query with multiple group by's to do batching. I gave you one. Now you want a query using information about previous rows to do batching. That's so far afield from the original request it should be a separate question.

Collectives™ on Stack Overflow

Batch with Multiple GroupBy

2 Answers 2

2 Comments

5 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

5 Comments

Linked

Related