45

Given a byte array, how can I find within it, the position of a (smaller) byte array?

This documentation looked promising, using ArrayUtils, but if I'm correct it would only let me find an individual byte within the array to be searched.

(I can't see it mattering, but just in case: sometimes the search byte array will be regular ASCII characters, other times it will be control characters or extended ASCII characters. So using String operations would not always be appropriate)

The large array could be between 10 and about 10000 bytes, and the smaller array around 10. In some cases I will have several smaller arrays that I want found within the larger array in a single search. And I will at times want to find the last index of an instance rather than the first.

2
  • How big is the big array, and how much smaller is the smaller array? Depending on this, different approaches may be applicable. Commented Jan 24, 2014 at 19:40
  • Thank you for your question - I've clarified my question Commented Jan 24, 2014 at 19:44

10 Answers 10

50

The simpelst way would be to compare each element:

public int indexOf(byte[] outerArray, byte[] smallerArray) { for(int i = 0; i < outerArray.length - smallerArray.length+1; ++i) { boolean found = true; for(int j = 0; j < smallerArray.length; ++j) { if (outerArray[i+j] != smallerArray[j]) { found = false; break; } } if (found) return i; } return -1; } 

Some tests:

@Test public void testIndexOf() { byte[] outer = {1, 2, 3, 4}; assertEquals(0, indexOf(outer, new byte[]{1, 2})); assertEquals(1, indexOf(outer, new byte[]{2, 3})); assertEquals(2, indexOf(outer, new byte[]{3, 4})); assertEquals(-1, indexOf(outer, new byte[]{4, 4})); assertEquals(-1, indexOf(outer, new byte[]{4, 5})); assertEquals(-1, indexOf(outer, new byte[]{4, 5, 6, 7, 8})); } 

As you updated your question: Java Strings are UTF-16 Strings, they do not care about the extended ASCII set, so you could use string.indexOf()

Sign up to request clarification or add additional context in comments.

2 Comments

Shouldn't the line ` if (outerArray[i] != innerArray[j]) {` be ` if (outerArray[i + j] != innerArray[j]) {`?
Oh, and I get an array out of bounds message still - I think the first for loop should be: for(int i = 0; i < outerArray.length - smallerArray.length; ++i) {. Last but not least, return false won't work if it also returns an integer. Instead, return -1 would work
34

Google's Guava provides a Bytes.indexOf(byte[] array, byte[] target).

7 Comments

... which is implemented as a double for-loop ... don't need a library for that, do you?
If you have it on the classpath and you know its there why wouldn't you use it?
..."implemented using a double for-loop" and they are using a goto. haven't seen this since my BASIC times :-o
Isn't a goto just a jump to instruction? Seems like this is faster than an additional logical statement after the end of the outer loop or an extra variable, which removes a few lines of code as well . Goto has few good uses: jumping out of nested loops and jumping to cleanup on error are the ones I've seen used effectively. Often times they make the code more readable, easier to debug, and faster.. Use your tools wisely!
While double loop (explicit or optimized otherwise) is inevitable in this search I'd rather see Google mimic implementation of String.indexOf as it is slightly more optimized. Instead they re-invented the algorithm and not in a better way. Still it's better to use this one (assuming it is already on classpath) than encoding to String first.
|
8

To save your time in testing:

http://helpdesk.objects.com.au/java/search-a-byte-array-for-a-byte-sequence

gives you code that works if you make computeFailure() static:

public class KPM { /** * Search the data byte array for the first occurrence * of the byte array pattern. */ public static int indexOf(byte[] data, byte[] pattern) { int[] failure = computeFailure(pattern); int j = 0; for (int i = 0; i < data.length; i++) { while (j > 0 && pattern[j] != data[i]) { j = failure[j - 1]; } if (pattern[j] == data[i]) { j++; } if (j == pattern.length) { return i - pattern.length + 1; } } return -1; } /** * Computes the failure function using a boot-strapping process, * where the pattern is matched against itself. */ private static int[] computeFailure(byte[] pattern) { int[] failure = new int[pattern.length]; int j = 0; for (int i = 1; i < pattern.length; i++) { while (j>0 && pattern[j] != pattern[i]) { j = failure[j - 1]; } if (pattern[j] == pattern[i]) { j++; } failure[i] = j; } return failure; } } 

Since it is always wise to test the code that you borrow, you may start with:

public class Test { public static void main(String[] args) { do_test1(); } static void do_test1() { String[] ss = { "", "\r\n\r\n", "\n\n", "\r\n\r\nthis is a test", "this is a test\r\n\r\n", "this is a test\r\n\r\nthis si a test", "this is a test\r\n\r\nthis si a test\r\n\r\n", "this is a test\n\r\nthis si a test", "this is a test\r\nthis si a test\r\n\r\n", "this is a test" }; for (String s: ss) { System.out.println(""+KPM.indexOf(s.getBytes(), "\r\n\r\n".getBytes())+"in ["+s+"]"); } } } 

Comments

8

Using the Knuth–Morris–Pratt algorithm is the most efficient way.

StreamSearcher.java is an implementation of it and is part of Twitter's elephant-bird project.

It is not recommended to include this library since it is rather sizable for using just a single class.

import java.io.IOException; import java.io.InputStream; import java.util.Arrays; /** * An efficient stream searching class based on the Knuth-Morris-Pratt algorithm. * For more on the algorithm works see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm. */ public class StreamSearcher { private byte[] pattern_; private int[] borders_; // An upper bound on pattern length for searching. Results are undefined for longer patterns. @SuppressWarnings("unused") public static final int MAX_PATTERN_LENGTH = 1024; StreamSearcher(byte[] pattern) { setPattern(pattern); } /** * Sets a new pattern for this StreamSearcher to use. * * @param pattern the pattern the StreamSearcher will look for in future calls to search(...) */ public void setPattern(byte[] pattern) { pattern_ = Arrays.copyOf(pattern, pattern.length); borders_ = new int[pattern_.length + 1]; preProcess(); } /** * Searches for the next occurrence of the pattern in the stream, starting from the current stream position. Note * that the position of the stream is changed. If a match is found, the stream points to the end of the match -- i.e. the * byte AFTER the pattern. Else, the stream is entirely consumed. The latter is because InputStream semantics make it difficult to have * another reasonable default, i.e. leave the stream unchanged. * * @return bytes consumed if found, -1 otherwise. */ long search(InputStream stream) throws IOException { long bytesRead = 0; int b; int j = 0; while ((b = stream.read()) != -1) { bytesRead++; while (j >= 0 && (byte) b != pattern_[j]) { j = borders_[j]; } // Move to the next character in the pattern. ++j; // If we've matched up to the full pattern length, we found it. Return, // which will automatically save our position in the InputStream at the point immediately // following the pattern match. if (j == pattern_.length) { return bytesRead; } } // No dice, Note that the stream is now completely consumed. return -1; } /** * Builds up a table of longest "borders" for each prefix of the pattern to find. This table is stored internally * and aids in implementation of the Knuth-Moore-Pratt string search. * <p> * For more information, see: http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm. */ private void preProcess() { int i = 0; int j = -1; borders_[i] = j; while (i < pattern_.length) { while (j >= 0 && pattern_[i] != pattern_[j]) { j = borders_[j]; } borders_[++i] = ++j; } } } 

Comments

7

Is this what you are looking for?

public class KPM { /** * Search the data byte array for the first occurrence of the byte array pattern within given boundaries. * @param data * @param start First index in data * @param stop Last index in data so that stop-start = length * @param pattern What is being searched. '*' can be used as wildcard for "ANY character" * @return */ public static int indexOf( byte[] data, int start, int stop, byte[] pattern) { if( data == null || pattern == null) return -1; int[] failure = computeFailure(pattern); int j = 0; for( int i = start; i < stop; i++) { while (j > 0 && ( pattern[j] != '*' && pattern[j] != data[i])) { j = failure[j - 1]; } if (pattern[j] == '*' || pattern[j] == data[i]) { j++; } if (j == pattern.length) { return i - pattern.length + 1; } } return -1; } /** * Computes the failure function using a boot-strapping process, * where the pattern is matched against itself. */ private static int[] computeFailure(byte[] pattern) { int[] failure = new int[pattern.length]; int j = 0; for (int i = 1; i < pattern.length; i++) { while (j>0 && pattern[j] != pattern[i]) { j = failure[j - 1]; } if (pattern[j] == pattern[i]) { j++; } failure[i] = j; } return failure; } } 

1 Comment

Beware that this implementation contains a strange "wildcard" feature.
6

Copied almost identical from java.lang.String.

indexOf(char[],int,int,char[]int,int,int)

static int indexOf(byte[] source, int sourceOffset, int sourceCount, byte[] target, int targetOffset, int targetCount, int fromIndex) { if (fromIndex >= sourceCount) { return (targetCount == 0 ? sourceCount : -1); } if (fromIndex < 0) { fromIndex = 0; } if (targetCount == 0) { return fromIndex; } byte first = target[targetOffset]; int max = sourceOffset + (sourceCount - targetCount); for (int i = sourceOffset + fromIndex; i <= max; i++) { /* Look for first character. */ if (source[i] != first) { while (++i <= max && source[i] != first) ; } /* Found first character, now look at the rest of v2 */ if (i <= max) { int j = i + 1; int end = j + targetCount - 1; for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++) ; if (j == end) { /* Found whole string. */ return i - sourceOffset; } } } return -1; } 

4 Comments

There are already good solutions for this question. Why do you provide a copy-paste code example?
I found it better for me and though I could share it in case anyone looking for the same thing wanted an alternative
Is this an algorithm? I found it a bit faster than KMP, Sunday or brute force two while loop in my case.
This is a so-called "brute-force" algorithm. KMP (en.wikipedia.org/wiki/…) is not expected to be faster than this in the case of JSON strings. Where KMP would be faster would be, for example, if source was AAAAAAAAAAAAAA, and search was AAAAAAAAAB. With brute force there would be a lot of unnecessary comparisons.
2

Several (or all?) of the examples posted here failed some Unit tests so I am posting my version along with the aforementioned tests over here. All of the Unit tests are BASED upon the requirement that Java's String.indexOf() always gives us the right answer!

// The Knuth, Morris, and Pratt string searching algorithm remembers information about // the past matched characters instead of matching a character with a different pattern // character over and over again. It can search for a pattern in O(n) time as it never // re-compares a text symbol that has matched a pattern symbol. But, it does use a partial // match table to analyze the pattern structure. Construction of a partial match table // takes O(m) time. Therefore, the overall time complexity of the KMP algorithm is O(m + n). public class KMPSearch { public static int indexOf(byte[] haystack, byte[] needle) { // needle is null or empty if (needle == null || needle.length == 0) return 0; // haystack is null, or haystack's length is less than that of needle if (haystack == null || needle.length > haystack.length) return -1; // pre construct failure array for needle pattern int[] failure = new int[needle.length]; int n = needle.length; failure[0] = -1; for (int j = 1; j < n; j++) { int i = failure[j - 1]; while ((needle[j] != needle[i + 1]) && i >= 0) i = failure[i]; if (needle[j] == needle[i + 1]) failure[j] = i + 1; else failure[j] = -1; } // find match int i = 0, j = 0; int haystackLen = haystack.length; int needleLen = needle.length; while (i < haystackLen && j < needleLen) { if (haystack[i] == needle[j]) { i++; j++; } else if (j == 0) i++; else j = failure[j - 1] + 1; } return ((j == needleLen) ? (i - needleLen) : -1); } } import java.util.Random; class KMPSearchTest { private static Random random = new Random(); private static String alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"; @Test public void testEmpty() { test("", ""); test("", "ab"); } @Test public void testOneChar() { test("a", "a"); test("a", "b"); } @Test public void testRepeat() { test("aaa", "aaaaa"); test("aaa", "abaaba"); test("abab", "abacababc"); test("abab", "babacaba"); } @Test public void testPartialRepeat() { test("aaacaaaaac", "aaacacaacaaacaaaacaaaaac"); test("ababcababdabababcababdaba", "ababcababdabababcababdaba"); } @Test public void testRandomly() { for (int i = 0; i < 1000; i++) { String pattern = randomPattern(); for (int j = 0; j < 100; j++) test(pattern, randomText(pattern)); } } /* Helper functions */ private static String randomPattern() { StringBuilder sb = new StringBuilder(); int steps = random.nextInt(10) + 1; for (int i = 0; i < steps; i++) { if (sb.length() == 0 || random.nextBoolean()) { // Add literal int len = random.nextInt(5) + 1; for (int j = 0; j < len; j++) sb.append(alphabet.charAt(random.nextInt(alphabet.length()))); } else { // Repeat prefix int len = random.nextInt(sb.length()) + 1; int reps = random.nextInt(3) + 1; if (sb.length() + len * reps > 1000) break; for (int j = 0; j < reps; j++) sb.append(sb.substring(0, len)); } } return sb.toString(); } private static String randomText(String pattern) { StringBuilder sb = new StringBuilder(); int steps = random.nextInt(100); for (int i = 0; i < steps && sb.length() < 10000; i++) { if (random.nextDouble() < 0.7) { // Add prefix of pattern int len = random.nextInt(pattern.length()) + 1; sb.append(pattern.substring(0, len)); } else { // Add literal int len = random.nextInt(30) + 1; for (int j = 0; j < len; j++) sb.append(alphabet.charAt(random.nextInt(alphabet.length()))); } } return sb.toString(); } private static void test(String pattern, String text) { try { assertEquals(text.indexOf(pattern), KMPSearch.indexOf(text.getBytes(), pattern.getBytes())); } catch (AssertionError e) { System.out.println("FAILED -> Unable to find '" + pattern + "' in '" + text + "'"); } } } 

Comments

1
package org.example; import java.util.List; import org.riversun.finbin.BinarySearcher; public class Sample2 { public static void main(String[] args) throws Exception { BinarySearcher bs = new BinarySearcher(); // UTF-8 without BOM byte[] srcBytes = "Hello world.It's a small world.".getBytes("utf-8"); byte[] searchBytes = "world".getBytes("utf-8"); List<Integer> indexList = bs.searchBytes(srcBytes, searchBytes); System.out.println("indexList=" + indexList); } } 

so it results in

indexList=[6, 25] 

So,u can find the index of byte[] in byte[]

Example here on Github at: https://github.com/riversun/finbin

Comments

0

Java strings are composed of 16-bit chars, not of 8-bit bytes. A char can hold a byte, so you can always make your byte arrays into strings, and use indexOf: ASCII characters, control characters, and even zero characters will work fine.

Here is a demo:

byte[] big = new byte[] {1,2,3,0,4,5,6,7,0,8,9,0,0,1,2,3,4}; byte[] small = new byte[] {7,0,8,9,0,0,1}; String bigStr = new String(big, StandardCharsets.UTF_8); String smallStr = new String(small, StandardCharsets.UTF_8); System.out.println(bigStr.indexOf(smallStr)); 

This prints 7.

However, considering that your large array could be up to 10,000 bytes, and the small array is only ten bytes, this solution may not be the most efficient, for two reasons:

  • It requires copying your big array into an array that is twice as large (same capacity, but with char instead of byte). This triples your memory requirements.
  • String search algorithm of Java is not the fastest one available. You may get sufficiently faster if you implement one of the advanced algorithms, for example, the Knuth–Morris–Pratt one. This could potentially bring the execution speed down by a factor of up to ten (the length of the small string), and will require additional memory that is proportional to the length of the small string, not the big string.

6 Comments

It's worth it to note that the single byte[] constructor uses the platforms default charset. That's probably some kind of UTF-8 but you can't be too sure. It might be better to use the other constructor that lets you specify it IE new String(bytes, "UTF-8").
even better: new String(bytes, StandardCharsets.UTF_8)
Don't do this; this is not how encoding works. There are two problems in this answer:
First: not all byte sequences are valid UTF-8 streams, and this code will fail in those cases. As an example, try looking for {-61, 40} in the string {-16, -112, 40, -68}: your code returns 0, because both of these are invalid sequences that Java replaces by the default replacement character for UTF-8.
Second, even for valid UTF-8 streams, this won't return correct results: when you decode a sequence of ASCII characters, for example, Java will pad each byte to get chars; it won't pack two bytes per char, unless they're part of the encoding of a single Unicode character. This means that the code will be wrong when a pattern you search for gets split across multiple characters in different ways in smallStr and largeStr. (conceptually, you'd look for x, y, z, packed as x, yz in a sequence containing xy, z).
|
0

For a little HTTP server I am currently working on, I came up with the following code to find boundaries in a multipart/form-data request. Hoped to find a better solution here, but i guess I'll stick with it. I think it is as efficent as it can get (quite fast and uses not much ram). It uses the input bytes as ring buffer, reads the next byte as soon as it does not match the boundary and writes the data after the first full cycle into the output stream. Of course can it be changed for byte arrays instead of streams, as asked in the question.

 private boolean multipartUploadParseOutput(InputStream is, OutputStream os, String boundary) { try { String n = "--"+boundary; byte[] bc = n.getBytes("UTF-8"); int s = bc.length; byte[] b = new byte[s]; int p = 0; long l = 0; int c; boolean r; while ((c = is.read()) != -1) { b[p] = (byte) c; l += 1; p = (int) (l % s); if (l>p) { r = true; for (int i = 0; i < s; i++) { if (b[(p + i) % s] != bc[i]) { r = false; break; } } if (r) break; os.write(b[p]); } } os.flush(); return true; } catch(IOException e) {e.printStackTrace();} return false; } 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.