Uploading a .csv to a NoSQL cluster - batch faster than consumer/producer

Question

I was tasked with making a program that uploads a .csv to a NoSQL cluster. The files are larger (typically 2-17GB). My program works in batch mode and can process a 17GB file in 6 hours.

I decided to make a consumer-producer multithreading structure. This caused it to be significantly slower. I want to know why the producer-consumer construct was slower than a batch produce, batch consume method.

The batch looks like this:

int count = 0; // Row r; while ((r = rm.getNextRow()) != null) { RowQueue.Enqueue(r); while (RowQueue.Count <= ROWMAX) { if ((r = rm.getNextRow()) != null) RowQueue.Enqueue(r); else break; } // int uniqueIdentifer = -1; if (count > 1000) { PrintAndSavePosition(count, rm, positionQueue, true); count = 0; } //give it some extra room to be safe while (RowQueue.Count != 0) { r = RowQueue.Dequeue(); while (uniqueIdentifer == -1) { uniqueIdentifer = nsqw.tryPut(r); if (uniqueIdentifer == -1) Thread.Sleep(1); } count++; } positionQueue.Add(new Tuple<int, long>(uniqueIdentifer, rm.Position)); }

As compared to

public void produceLoop(){ while (true) { while (RowQueue.Count <= ROWMAX && (r = rm.getNextRow()) != null){ RowQueue.Enqueue(r); } } } public void consumeLoop(){ while(true){ while (RowQueue.Count != 0) { RowQueue.TryDequeue(out r); while (uniqueIdentifer == -1) { uniqueIdentifer = nsqw.tryPut(r); if (uniqueIdentifer == -1) Thread.Sleep(1); } count++; } positionQueue.Add(new Tuple<int, long>(uniqueIdentifer, rm.Position)); } } }

The bottom half are infinite loops for a speed test.

Nick Udell · Accepted Answer · 2014-12-02 12:16:59Z

I'm not an expert in cluster operations, so I'll review the parts I do know.

Single-letter variable names

These are a big no-no unless the only thing you're worried about is not being fireable. Any maintenance programmer looking in the middle of a chunk of code is not going to be happy jumping back and forwards to definitions to work out what on earth r is when you could just as easily have written row.

var

You should use var when the right hand side of an assignment makes the type obvious. e.g.

int uniqueIdentifier = 1;

should be

var uniqueIdentifier = 1;

This is recommended because if you decide to change the type (e.g. to a GUID), you only have to change it in one place.

Style

In C# the general naming guideline for methods is to use PascalCase. This is, of course, optional (as with any point on style), but recommended to make code easier to read for another programmer.

Stack Exchange Network

Uploading a .csv to a NoSQL cluster - batch faster than consumer/producer

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Uploading a .csv to a NoSQL cluster - batch faster than consumer/producer

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions