OK, so I don't sound like an idiot I'm going to state the problem/requirements more explicitly:
- Needle (pattern) and haystack (text to search) are both C-style null-terminated strings. No length information is provided; if needed, it must be computed.
- Function should return a pointer to the first match, or
NULLif no match is found. - Failure cases are not allowed. This means any algorithm with non-constant (or large constant) storage requirements will need to have a fallback case for allocation failure (and performance in the fallback care thereby contributes to worst-case performance).
- Implementation is to be in C, although a good description of the algorithm (or link to such) without code is fine too.
...as well as what I mean by "fastest":
- Deterministic
O(n)wheren= haystack length. (But it may be possible to use ideas from algorithms which are normallyO(nm)(for example rolling hash) if they're combined with a more robust algorithm to give deterministicO(n)results). - Never performs (measurably; a couple clocks for
if (!needle[1])etc. are okay) worse than the naive brute force algorithm, especially on very short needles which are likely the most common case. (Unconditional heavy preprocessing overhead is bad, as is trying to improve the linear coefficient for pathological needles at the expense of likely needles.) - Given an arbitrary needle and haystack, comparable or better performance (no worse than 50% longer search time) versus any other widely-implemented algorithm.
- Aside from these conditions, I'm leaving the definition of "fastest" open-ended. A good answer should explain why you consider the approach you're suggesting "fastest".
My current implementation runs in roughly between 10% slower and 8 times faster (depending on the input) than glibc's implementation of Two-Way. I'm going to hold off on saying what algorithms it's using though in hopes of getting some fresh, unbiased ideas.
Bonus points for:
- Can you improve performance by assuming the needle and haystack are both well-formed UTF-8?
Edit: I'm well aware of most of the algorithms out there, just not how well they perform in practice. Here's a good reference so people don't keep giving me references on algorithms as comments/answers: http://www-igm.univ-mlv.fr/~lecroq/string/index.html