How I can search for byte in byte[] faster?

Question

I do simple rownumber calculation in InputStream (calc number of NewLines #10)

for (int i = 0; i < readBytes ; i++) { if ( b[ i + off ] == 10 ) { // New Line (10) rowCount++; } }

Can I do it faster? Without iteration by one byte? Probably I am looking for some class which able to use CPU specific instructions (simd/sse).

All code:

@Override public int read(byte[] b, int off, int len) throws IOException { int readBytes = in.read(b, off, len); for (int i = 0; i < readBytes ; i++) { hadBytes = true; // at least once we read something lastByteIsNewLine = false; if ( b[ i + off ] == 10 ) { // New Line (10) rowCount++; lastByteIsNewLine = (i == readBytes - 1); // last byte in buffer was the newline } } if ( hadBytes && readBytes == -1 && ! lastByteIsNewLine ) { // file is not empty + EOF + last byte was not NewLine rowCount++; } return readBytes; }

Honestly, no. There is no faster way to do it. It’s already pretty much the same as it would be in assembly language. — VGR
– VGR, Commented Oct 4, 2019 at 16:38
@VGR Are you saying that the JIT is able to produce a vectorized version of this that compares 16+ bytes at a time like you could in assembly? — that other guy
– that other guy, Commented Oct 4, 2019 at 17:41
@thatotherguy: there's a reasonable chance that a good JIT can vectorize this loop. — Joachim Sauer
– Joachim Sauer, Commented Oct 4, 2019 at 17:58

that other guy · Accepted Answer · 2019-10-04 23:05:14Z

On my system, just moving the lastByteIsNewLine and hasBytes parts out of the loop results in a ~10% improvement*:

 public int read(byte[] b, int off, int len) throws IOException { int readBytes = in.read(b, off, len); for (int i = 0; i < readBytes ; i++) { if ( b[ i + off ] == 10 ) { rowCount++; } } hadBytes |= readBytes > 0; lastByteIsNewLine = (readBytes > 0 ? b[readBytes+off-1] == 10 : false); if ( hadBytes && readBytes == -1 && ! lastByteIsNewLine ) { rowCount++; } return readBytes; }

* 6000ms vs 6700ms for 1,000 iterations on 10MB buffers read from a ByteArrayInputStream filled with arbitrary text.

Once it is set to true, hadBytes should never be set to false (on a subsequent call). Use hadBytes |= readBytes > 0.

erickson · Accepted Answer · 2019-10-04 23:46:55Z

I started with that other guy's improvements, and hoisted the array index calculation and the field access out of the for loop.

According to my JMH benchmark, this saved another 25%, with "that other guy's" implementation clocking 3.6 ms/op, and this version at 2.7 ms/op. (Here, one operation is reading a ~10 MB ByteArrayInputStream with around 5000 lines of random length).

public int read(byte[] buffer, int off, int len) throws IOException { int n = in.read(buffer, off, len); notEmpty |= n > 0; int count = notEmpty && n < 0 && !trailingLineFeed ? 1 : 0; trailingLineFeed = (n > 0) && buffer[n + off - 1] == '\n'; for (int max = off + n, idx = off; idx < max;) { if (buffer[idx++] == '\n') ++count; } rowCount += count; return n; }

Things that really hurt performance: indexing backward over the array.

Things that don't matter: comparing values with the more readable '\n' instead of 10.

Surprisingly (to me anyway), using only one of these tricks by itself did not seem to improve performance. They only made a difference used together.

it's actually slightly wrong int count = should the second line (before trailingLineFeed = ) because the last call (n=-1) corrupts trailingLineFeed and makes it false. Thank you.

user2342558 · Accepted Answer · 2019-10-04 16:06:44Z

0

You can easily search in readBytes after converting it in String:

String stringBytes = new String(readBytes);

To get the amount of occurrences:

int rowCount = StringUtils.countMatches(stringBytes, "\n");

To only know if the \n is contained in readBytes:

boolean newLineFound = stringBytes.contains("\n");

edited Oct 4, 2019 at 16:06

answered Oct 4, 2019 at 15:52

user2342558

6,8046 gold badges42 silver badges67 bronze badges

3 Comments

Denny Crane Over a year ago

Just tried this. It's 3 times slower. ``` String stringBytes = new String(b); int i = -1; while ( i < readBytes ) { i = stringBytes.indexOf("\n", i+1); if ( i == -1 ) break; rowCount ++; } ```

Denny Crane Over a year ago

still 3 times slower String stringBytes = new String(b); rowCount += StringUtils.countOccurrencesOf(stringBytes, "\n");

Denny Crane Over a year ago

I looked inside StringUtils.countOccurrencesOf. It uses String.indexOf which does the same iteration byte by byte.

Leo Aso · Accepted Answer · 2019-10-04 18:47:12Z

Well, rather than trying to speed up that one specific portion (which I don't think you can), you can try using a different method. Here's a class which you can use to keep track of the number of rows while reading from an InputStream.

public class RowCounter { private static final int LF = 10; private int rowCount = 0; private int lastByte = 0; public int getRowCount() { return rowCount; } public void addByte(int b) { if (lastByte == LF) { rowCount++; } lastByte = b; } public void addBytes(byte[] b, int offset, int length) { if (length <= 0) return; if (lastByte == LF) rowCount++; int lastIndex = offset + length - 1; for (int i = offset; i < lastIndex; i++) { if (b[i] == LF) rowCount++; } lastByte = b[lastIndex]; } }

Then when reading an InputStream, you can use it like this.

InputStream is = ...; byte[] b = new byte[...]; int bytesRead; RowCounter counter = new RowCounter(); while ((bytesRead = is.read(b)) != -1) { counter.addBytes(b, 0, bytesRead); } int rowCount = counter.getRowCount();

or you can easily adapt it to whatever situation you need it for.

Collectives™ on Stack Overflow

How I can search for byte in byte[] faster?

4 Answers 4

1 Comment

2 Comments

3 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

2 Comments

3 Comments

Comments

Linked

Related