0

I have a CSV file with records that need to be sorted and then grouped into arbitrary-sized batches (e.g. 300 max records per batch). Each batch might have less than 300 records, because the contents of each batch must be homogenious (based on the contents of a couple different columns).

My LINQ statement, inspired by this answer on batching with LINQ, looks like this:

var query = (from line in EbrRecords let EbrData = line.Split('\t') let Location = EbrData[7] let RepName = EbrData[4] let AccountID = EbrData[0] orderby Location, RepName, AccountID). Select((data, index) => new { Record = new EbrRecord( AccountID = EbrData[0], AccountName = EbrData[1], MBSegment = EbrData[2], RepName = EbrData[4], Location = EbrData[7], TsrLocation = EbrData[8] ) , Index = index} ).GroupBy(x => new {x.Record.Location, x.Record.RepName, batch = x.Index / 100}); 

The "/ 100" gives me the arbitrary bucket size. The other elements of the groupby are intended to achieve the homogenaity between batches. I suspect this is almost what I want, but it gives me the following compiler error: A query body must end with a select clause or a group clause. I understand why I'm recieving the error, but overall I'm not sure how to fix this query. How would it be done?

UPDATE I very nearly achieved what I'm after, with the following:

List<EbrRecord> input = new List<EbrRecord> { new EbrRecord {Name = "Brent",Age = 20,ID = "A"}, new EbrRecord {Name = "Amy",Age = 20,ID = "B"}, new EbrRecord {Name = "Gabe",Age = 23,ID = "B"}, new EbrRecord {Name = "Noah",Age = 27,ID = "B"}, new EbrRecord {Name = "Alex",Age = 27,ID = "B"}, new EbrRecord {Name = "Stormi",Age = 27,ID = "B"}, new EbrRecord {Name = "Roger",Age = 27,ID = "B"}, new EbrRecord {Name = "Jen",Age = 27,ID = "B"}, new EbrRecord {Name = "Adrian",Age = 28,ID = "B"}, new EbrRecord {Name = "Cory",Age = 29,ID = "C"}, new EbrRecord {Name = "Bob",Age = 29,ID = "C"}, new EbrRecord {Name = "George",Age = 29,ID = "C"}, }; //look how tiny this query is, and it is very nearly the result I want!!! int i = 0; var result = from q in input orderby q.Age, q.ID group q by new { q.ID, batch = i++ / 3 }; foreach (var agroup in result) { Debug.WriteLine("ID:" + agroup.Key); foreach (var record in agroup) { Debug.WriteLine(" Name:" + record.Name); } } 

The trick here is to bypass the select "index position" overlaod, by using a closure variable (int i in this case). The output results are as follows:

ID:{ ID = A, batch = 0 } Name:Brent ID:{ ID = B, batch = 0 } Name:Amy Name:Gabe ID:{ ID = B, batch = 1 } Name:Noah Name:Alex Name:Stormi ID:{ ID = B, batch = 2 } Name:Roger Name:Jen Name:Adrian ID:{ ID = C, batch = 3 } Name:Cory Name:Bob Name:George 

While this answer is acceptable, it is just a fraction short of the ideal result. It should be that the first occurance of "batch 'B'" should have 3 entires in it (Amy, Gabe, Noah) - not two (Amy, Gabe). This is because the index position is not being reset when each group is identified. Anyone know how to reset my custom index position for each group?

UPDATE 2 I think I may have found an answer. First, make an additional function like this:

 public static bool BatchGroup(string ID, ref string priorID ) { if (priorID != ID) { priorID = ID; return true; } return false; } 

Second, update the LINQ query like this:

int i = 0; string priorID = null; var result = from q in input orderby q.Age, q.ID group q by new { q.ID, batch = (BatchGroup(q.ID, ref priorID) ? i=0 : ++i) / 3 }; 

Now it does what I want. I just wish i did not need that separate function!

2 Answers 2

2

Does this work?

var query = (from line in EbrRecords let EbrData = line.Split('\t') let Location = EbrData[7] let RepName = EbrData[4] let AccountID = EbrData[0] orderby Location, RepName, AccountID select new EbrRecord( AccountID = EbrData[0], AccountName = EbrData[1], MBSegment = EbrData[2], RepName = EbrData[4], Location = EbrData[7], TsrLocation = EbrData[8]) ).Select((data, index) => new { Record = data, Index = index }) .GroupBy(x => new {x.Record.Location, x.Record.RepName, batch = x.Index / 100}, x => x.Record); 
Sign up to request clarification or add additional context in comments.

2 Comments

What I expect is a list of EbrRecord lists (a list of lists). But the above gives me a list for an anonymous type which contains only Location, RepName, and batch. I'm wondering if the post I linked to actually does what I thought or hoped.
@Brent: GroupBy will create an IEnumerable of IGroupings, each of which has a Key with the Location, RepName, and batch, but is also an IEnumerable itself, containing the selected values. If you use the overload in my updated answer, you should effectively have an IEnumerable<IEnumerable<EbrRecord>>. However, I think it's likely that it doesn't do quite what you hoped. Be sure to read through David B's answer. He makes some excellent points.
1
orderby Location, RepName, AccountID 

There needs to be a select clause after the above, as demonstrated in StriplingWarrior's answer. Linq Comprehension Queries must end with select or group by.


Unfortunately, there is a logical defect... Suppose I have 50 accounts in the first group and 100 accounts in the second group with a batch size of 100. The original code will produce 3 batches of size 50, not 2 batches of 50, 100.

Here's one way to fix it.

IEnumerable<IGrouping<int, EbrRecord>> query = ... orderby Location, RepName, AccountID select new EbrRecord( AccountID = EbrData[0], AccountName = EbrData[1], MBSegment = EbrData[2], RepName = EbrData[4], Location = EbrData[7], TsrLocation = EbrData[8]) into x group x by new {Location = x.Location, RepName = x.RepName} into g from g2 in g.Select((data, index) => new Record = data, Index = index }) .GroupBy(y => y.Index/100, y => y.Record) select g2; List<List<EbrRecord>> result = query.Select(g => g.ToList()).ToList(); 

Also note that using GroupBy to batch is very slow due to redundant iterations. You can write a for loop that will do it in one pass over the ordered set and that loop will run much faster than the LinqToObjects.

5 Comments

My intellisense and compiler refuses to allow me to place a "group by" after the "select new", unless I switch to dot notation.
Fixed many embarrassing typos. Now I'm done (whether it works or not).
Added List of List conversion.
Ok, upon further analysis I see that your answer achives exactly the correct result. I've checked this as the answer. But, OH MAN is that a busy query you made!!! Please look at the update I've made to my question, where I've found a near perfect (and much simpler) solution. If you can make my update work without much effort, I'll give a thumb-up vote as well. :)
shrug you wanted a query with multiple group by's to do batching. I gave you one. Now you want a query using information about previous rows to do batching. That's so far afield from the original request it should be a separate question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.