0

I do simple rownumber calculation in InputStream (calc number of NewLines #10)

for (int i = 0; i < readBytes ; i++) { if ( b[ i + off ] == 10 ) { // New Line (10) rowCount++; } } 

Can I do it faster? Without iteration by one byte? Probably I am looking for some class which able to use CPU specific instructions (simd/sse).

All code:

@Override public int read(byte[] b, int off, int len) throws IOException { int readBytes = in.read(b, off, len); for (int i = 0; i < readBytes ; i++) { hadBytes = true; // at least once we read something lastByteIsNewLine = false; if ( b[ i + off ] == 10 ) { // New Line (10) rowCount++; lastByteIsNewLine = (i == readBytes - 1); // last byte in buffer was the newline } } if ( hadBytes && readBytes == -1 && ! lastByteIsNewLine ) { // file is not empty + EOF + last byte was not NewLine rowCount++; } return readBytes; } 
9
  • what is readBytes? Please post its content Commented Oct 4, 2019 at 15:46
  • readBytes == size of byte[] Commented Oct 4, 2019 at 16:06
  • Honestly, no. There is no faster way to do it. It’s already pretty much the same as it would be in assembly language. Commented Oct 4, 2019 at 16:38
  • 1
    @VGR Are you saying that the JIT is able to produce a vectorized version of this that compares 16+ bytes at a time like you could in assembly? Commented Oct 4, 2019 at 17:41
  • @thatotherguy: there's a reasonable chance that a good JIT can vectorize this loop. Commented Oct 4, 2019 at 17:58

4 Answers 4

1

On my system, just moving the lastByteIsNewLine and hasBytes parts out of the loop results in a ~10% improvement*:

 public int read(byte[] b, int off, int len) throws IOException { int readBytes = in.read(b, off, len); for (int i = 0; i < readBytes ; i++) { if ( b[ i + off ] == 10 ) { rowCount++; } } hadBytes |= readBytes > 0; lastByteIsNewLine = (readBytes > 0 ? b[readBytes+off-1] == 10 : false); if ( hadBytes && readBytes == -1 && ! lastByteIsNewLine ) { rowCount++; } return readBytes; } 

* 6000ms vs 6700ms for 1,000 iterations on 10MB buffers read from a ByteArrayInputStream filled with arbitrary text.

Sign up to request clarification or add additional context in comments.

1 Comment

Once it is set to true, hadBytes should never be set to false (on a subsequent call). Use hadBytes |= readBytes > 0.
1

I started with that other guy's improvements, and hoisted the array index calculation and the field access out of the for loop.

According to my JMH benchmark, this saved another 25%, with "that other guy's" implementation clocking 3.6 ms/op, and this version at 2.7 ms/op. (Here, one operation is reading a ~10 MB ByteArrayInputStream with around 5000 lines of random length).

public int read(byte[] buffer, int off, int len) throws IOException { int n = in.read(buffer, off, len); notEmpty |= n > 0; int count = notEmpty && n < 0 && !trailingLineFeed ? 1 : 0; trailingLineFeed = (n > 0) && buffer[n + off - 1] == '\n'; for (int max = off + n, idx = off; idx < max;) { if (buffer[idx++] == '\n') ++count; } rowCount += count; return n; } 

Things that really hurt performance: indexing backward over the array.

Things that don't matter: comparing values with the more readable '\n' instead of 10.

Surprisingly (to me anyway), using only one of these tricks by itself did not seem to improve performance. They only made a difference used together.

2 Comments

it's actually slightly wrong int count = should the second line (before trailingLineFeed = ) because the last call (n=-1) corrupts trailingLineFeed and makes it false. Thank you.
@DenisZhuravlev I think I fixed it.
0

You can easily search in readBytes after converting it in String:

String stringBytes = new String(readBytes); 

To get the amount of occurrences:

int rowCount = StringUtils.countMatches(stringBytes, "\n"); 

To only know if the \n is contained in readBytes:

boolean newLineFound = stringBytes.contains("\n"); 

3 Comments

Just tried this. It's 3 times slower. ``` String stringBytes = new String(b); int i = -1; while ( i < readBytes ) { i = stringBytes.indexOf("\n", i+1); if ( i == -1 ) break; rowCount ++; } ```
still 3 times slower String stringBytes = new String(b); rowCount += StringUtils.countOccurrencesOf(stringBytes, "\n");
I looked inside StringUtils.countOccurrencesOf. It uses String.indexOf which does the same iteration byte by byte.
0

Well, rather than trying to speed up that one specific portion (which I don't think you can), you can try using a different method. Here's a class which you can use to keep track of the number of rows while reading from an InputStream.

public class RowCounter { private static final int LF = 10; private int rowCount = 0; private int lastByte = 0; public int getRowCount() { return rowCount; } public void addByte(int b) { if (lastByte == LF) { rowCount++; } lastByte = b; } public void addBytes(byte[] b, int offset, int length) { if (length <= 0) return; if (lastByte == LF) rowCount++; int lastIndex = offset + length - 1; for (int i = offset; i < lastIndex; i++) { if (b[i] == LF) rowCount++; } lastByte = b[lastIndex]; } } 

Then when reading an InputStream, you can use it like this.

InputStream is = ...; byte[] b = new byte[...]; int bytesRead; RowCounter counter = new RowCounter(); while ((bytesRead = is.read(b)) != -1) { counter.addBytes(b, 0, bytesRead); } int rowCount = counter.getRowCount(); 

or you can easily adapt it to whatever situation you need it for.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.