Revisions to What is the fastest substring search algorithm?

Question Protected by Richard J. Ross III

occurred Jun 24, 2012 at 14:31

typo

edited Jul 7, 2010 at 13:48

217.1k
36
404
744

My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way. I'm going to hold off on saying what algorithms it's using though

Update: My current optimal algorithm is as follows:

For needles of length 1, use strchr.

For needles of length 2-4, use machine words to compare 2-4 bytes at once as follows: Preload needle in a 16- or 32-bit integer with bitshifts and cycle old byte out/new bytes in from the haystack at each iteration. Every byte of the haystack is read exactly once and incurs a check against 0 (end of string) and one 16- or 32-bit comparison.

For needles of length >4, use Two-Way algorithm with a bad shift table (like Boyer-Moore) which is applied only to the last byte of the window. To avoid the overhead of initializing a 1kb table, which would be a net loss for many moderate-length needles, I keep a bit array (32 bytes) marking which entries in the shift table are initialized. Bits that are unset correspond to byte values which never appear in the needle, for which a full-needle-length shift is possible.

The big questions left in hopes of getting some fresh, unbiased ideas.my mind are:

Is there a way to make better use of the bad shift table? Boyer-Moore makes best use of it by scanning backwards (right-to-left) but Two-Way requires a left-to-right scan.

The only two viable candidate algorithms I've found for the general case (no out-of-memory or quadratic performance conditions) are Two-Way and String Matching on Ordered Alphabets. But are there easily-detectable cases where different algorithms would be optimal? Certainly many of the O(m) (where m is needle length) in space algorithms could be used for m<100 or so. It would also be possible to use algorithms which are worst-case quadratic if there's an easy test for needles which provably require only linear time.

Can you improve performance by assuming the needle and haystack are both well-formed UTF-8? (With characters of varying byte lengths, well-formed-ness imposes some string alignment requirements between the needle and haystack and allows automatic 2-4 byte shifts when a mismatching head byte is encountered. But do these constraints buy you much/anything beyond what maximal suffix computations, good suffix shifts, etc. already give you with various algorithms?)

Edit:Note: I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html

My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way. I'm going to hold off on saying what algorithms it's using though in hopes of getting some fresh, unbiased ideas.

Can you improve performance by assuming the needle and haystack are both well-formed UTF-8?

Edit: I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html

My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way.

Update: My current optimal algorithm is as follows:

For needles of length 1, use strchr.

For needles of length 2-4, use machine words to compare 2-4 bytes at once as follows: Preload needle in a 16- or 32-bit integer with bitshifts and cycle old byte out/new bytes in from the haystack at each iteration. Every byte of the haystack is read exactly once and incurs a check against 0 (end of string) and one 16- or 32-bit comparison.

For needles of length >4, use Two-Way algorithm with a bad shift table (like Boyer-Moore) which is applied only to the last byte of the window. To avoid the overhead of initializing a 1kb table, which would be a net loss for many moderate-length needles, I keep a bit array (32 bytes) marking which entries in the shift table are initialized. Bits that are unset correspond to byte values which never appear in the needle, for which a full-needle-length shift is possible.

The big questions left in my mind are:

Is there a way to make better use of the bad shift table? Boyer-Moore makes best use of it by scanning backwards (right-to-left) but Two-Way requires a left-to-right scan.

The only two viable candidate algorithms I've found for the general case (no out-of-memory or quadratic performance conditions) are Two-Way and String Matching on Ordered Alphabets. But are there easily-detectable cases where different algorithms would be optimal? Certainly many of the O(m) (where m is needle length) in space algorithms could be used for m<100 or so. It would also be possible to use algorithms which are worst-case quadratic if there's an easy test for needles which provably require only linear time.

Can you improve performance by assuming the needle and haystack are both well-formed UTF-8? (With characters of varying byte lengths, well-formed-ness imposes some string alignment requirements between the needle and haystack and allows automatic 2-4 byte shifts when a mismatching head byte is encountered. But do these constraints buy you much/anything beyond what maximal suffix computations, good suffix shifts, etc. already give you with various algorithms?)

Note: I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html

added 4 characters in body

Source Link

edited Jul 6, 2010 at 5:23

R.. GitHub STOP HELPING ICE

217.1k
36
404
744

OK, so I don't sound like an idiot I'm going to state the problem/requirements more explicitly:

Needle (pattern) and haystack (text to search) are both C-style null-terminated strings. No length information is provided; if needed, it must be computed.
Function should return a pointer to the first match, or NULL if no match is found.
Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).
Implementation is to be in C, although a good description of the algorithm (or link to such) without code is fine too.

...as well as what I mean by "fastest":

Deterministic O(n) where n = haystack length. (But it may be possible to use ideas from algorithms which are normally O(nm) (for example rolling hash) if they're combined with a more robust algorithm to give deterministic O(n) results).
Never performs (measurably; a couple clocks for if (!needle[1]) etc. are okay) worse than the naive brute force algorithm, especially on very short needles which are likely the most common case. (Unconditional heavy preprocessing overhead is bad, as is trying to improve the linear coefficient for pathological needles at the expense of likely needles.)
Given an arbitrary needle and haystack, comparable or better performance (no worse than 50% longer search time) versus any other widely-implemented algorithm.
Aside from these conditions, I'm leaving the definition of "fastest" open-ended. A good answer should explain why you consider the approach you're suggesting "fastest".

My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way. I'm going to hold off on saying what algorithms it's using though in hopes of getting some fresh, unbiased ideas.

Bonus points for:

Can you improve performance by assuming the needle and haystack are both well-formed UTF-8?

Edit: I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html

OK, so I don't sound like an idiot I'm going to state the problem/requirements more explicitly:

Needle (pattern) and haystack (text to search) are both C-style null-terminated strings. No length information is provided; if needed, it must be computed.
Function should return a pointer to the first match, or NULL if no match is found.
Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).
Implementation is to be in C, although a good description of the algorithm (or link to such) without code is fine too.

...as well as what I mean by "fastest":

Deterministic O(n) where n = haystack length. (But it may be possible to use ideas from algorithms which are normally O(nm) (for example rolling hash) if they're combined with a more robust algorithm to give deterministic O(n) results).
Never performs (measurably; a couple clocks for if (!needle[1]) etc. are okay) worse than the naive brute force algorithm, especially on very short needles which are likely the most common case. (Unconditional heavy preprocessing overhead is bad, as is trying to improve the linear coefficient for pathological needles at the expense of likely needles.)
Given an arbitrary needle and haystack, comparable or better performance (no worse than 50% longer search time) versus any other widely-implemented algorithm.
Aside from these conditions, I'm leaving the definition of "fastest" open-ended. A good answer should explain why you consider the approach you're suggesting "fastest".

My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way. I'm going to hold off on saying what algorithms it's using though in hopes of getting some fresh, unbiased ideas.

Bonus points for:

Can you improve performance by assuming the needle and haystack are both well-formed UTF-8?

Edit: I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html

OK, so I don't sound like an idiot I'm going to state the problem/requirements more explicitly:

Needle (pattern) and haystack (text to search) are both C-style null-terminated strings. No length information is provided; if needed, it must be computed.
Function should return a pointer to the first match, or NULL if no match is found.
Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).
Implementation is to be in C, although a good description of the algorithm (or link to such) without code is fine too.

...as well as what I mean by "fastest":

Deterministic O(n) where n = haystack length. (But it may be possible to use ideas from algorithms which are normally O(nm) (for example rolling hash) if they're combined with a more robust algorithm to give deterministic O(n) results).
Never performs (measurably; a couple clocks for if (!needle[1]) etc. are okay) worse than the naive brute force algorithm, especially on very short needles which are likely the most common case. (Unconditional heavy preprocessing overhead is bad, as is trying to improve the linear coefficient for pathological needles at the expense of likely needles.)
Given an arbitrary needle and haystack, comparable or better performance (no worse than 50% longer search time) versus any other widely-implemented algorithm.
Aside from these conditions, I'm leaving the definition of "fastest" open-ended. A good answer should explain why you consider the approach you're suggesting "fastest".

My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way. I'm going to hold off on saying what algorithms it's using though in hopes of getting some fresh, unbiased ideas.

Bonus points for:

Can you improve performance by assuming the needle and haystack are both well-formed UTF-8?

Edit: I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html

added 259 characters in body

Source Link

edited Jul 6, 2010 at 5:16

R.. GitHub STOP HELPING ICE

217.1k
36
404
744

OK, so I don't sound like an idiot I'm going to state the problem/requirements more explicitly:

Needle (pattern) and haystack (text to search) are both C-style null-terminated strings. No length information is provided; if needed, it must be computed.
Function should return a pointer to the first match, or NULL if no match is found.
Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).
Implementation is to be in C, although a good description of the algorithm (or link to such) without code is fine too.

...as well as what I mean by "fastest":

Deterministic O(n) where n = haystack length. (But it may be possible to use ideas from algorithms which are normally O(nm) (for example rolling hash) if they're combined with a more robust algorithm to give deterministic O(n) results).
Never performs (measurably; a couple clocks for if (!needle[1]) etc. are okay) worse than the naive brute force algorithm, especially on very short needles which are likely the most common case. (Unconditional heavy preprocessing overhead is bad, as is trying to improve the linear coefficient for pathological needles at the expense of likely needles.)
Given an arbitrary needle and haystack, comparable or better performance (no worse than 50% longer search time) versus any other widely-implemented algorithm.
Aside from these conditions, I'm leaving the definition of "fastest" open-ended. A good answer should explain why you consider the approach you're suggesting "fastest".

My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way. I'm going to hold off on saying what algorithms it's using though in hopes of getting some fresh, unbiased ideas.

Bonus points for:

Can you improve performance by assuming the needle and haystack are both well-formed UTF-8?

Edit: I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html

OK, so I don't sound like an idiot I'm going to state the problem/requirements more explicitly:

Needle (pattern) and haystack (text to search) are both C-style null-terminated strings. No length information is provided; if needed, it must be computed.
Function should return a pointer to the first match, or NULL if no match is found.
Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).
Implementation is to be in C, although a good description of the algorithm (or link to such) without code is fine too.

...as well as what I mean by "fastest":

Deterministic O(n) where n = haystack length. (But it may be possible to use ideas from algorithms which are normally O(nm) (for example rolling hash) if they're combined with a more robust algorithm to give deterministic O(n) results).
Never performs (measurably; a couple clocks for if (!needle[1]) etc. are okay) worse than the naive brute force algorithm, especially on very short needles which are likely the most common case. (Unconditional heavy preprocessing overhead is bad, as is trying to improve the linear coefficient for pathological needles at the expense of likely needles.)
Given an arbitrary needle and haystack, comparable or better performance (no worse than 50% longer search time) versus any other widely-implemented algorithm.
Aside from these conditions, I'm leaving the definition of "fastest" open-ended. A good answer should explain why you consider the approach you're suggesting "fastest".

My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way. I'm going to hold off on saying what algorithms it's using though in hopes of getting some fresh, unbiased ideas.

Bonus points for:

Can you improve performance by assuming the needle and haystack are both well-formed UTF-8?

OK, so I don't sound like an idiot I'm going to state the problem/requirements more explicitly:

Needle (pattern) and haystack (text to search) are both C-style null-terminated strings. No length information is provided; if needed, it must be computed.
Function should return a pointer to the first match, or NULL if no match is found.
Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).
Implementation is to be in C, although a good description of the algorithm (or link to such) without code is fine too.

...as well as what I mean by "fastest":

Deterministic O(n) where n = haystack length. (But it may be possible to use ideas from algorithms which are normally O(nm) (for example rolling hash) if they're combined with a more robust algorithm to give deterministic O(n) results).
Never performs (measurably; a couple clocks for if (!needle[1]) etc. are okay) worse than the naive brute force algorithm, especially on very short needles which are likely the most common case. (Unconditional heavy preprocessing overhead is bad, as is trying to improve the linear coefficient for pathological needles at the expense of likely needles.)
Given an arbitrary needle and haystack, comparable or better performance (no worse than 50% longer search time) versus any other widely-implemented algorithm.
Aside from these conditions, I'm leaving the definition of "fastest" open-ended. A good answer should explain why you consider the approach you're suggesting "fastest".

My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way. I'm going to hold off on saying what algorithms it's using though in hopes of getting some fresh, unbiased ideas.

Bonus points for:

Can you improve performance by assuming the needle and haystack are both well-formed UTF-8?

Edit: I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html

Source Link

asked Jul 6, 2010 at 4:58

R.. GitHub STOP HELPING ICE

217.1k
36
404
744

Loading

Collectives™ on Stack Overflow

Return to Question