Skip to main content
added 283 characters in body; added 45 characters in body; deleted 13 characters in body
Source Link
NealB
  • 17k
  • 2
  • 43
  • 61

The http://www-igm.univ-mlv.fr/~lecroq/string/index.html link you point to is an excellent source and summary of some of the best known and researched string matching algorithms.

Solutions to most search problems involve trade offs with respect to pre-processing overhead, time and space requirements. No single algorithm will be optimal or practical in all cases.

If you objective is to design a specific algorithm for string searching, then ignore the rest of what I have to say, If you want to develop a generalized string searching service routine then try the following:

Spend some time reviewing the specific strengths and weaknesses of the algorithms you have already referenced. Conduct the review with the objective of finding a set of algorithms that cover the range and scope of string searches you are interested in. Then, build a front end search selector based on a classifier function to target the best algorithm for the given inputs. This way you may employ the most efficient algorithm to do the job. This is particularly effective when an algorithm is very good for certain searches but degrades poorly. For example, brute force is probably the best for needles of length 1 but quickly degrades as needle length increases, whereupon the sustik-moore algoritim may become more efficient (over small alphabets), then for longer needles and larger alphabets, the KMP or Boyer-Moore algorithms may be better. These are just examples to illustrate a possible strategy.

The multiple algorithm approach not a new idea. I believe it has been employed by a few commercial Sort/Search packages (e.g. SYNCSORT commonly used on mainframes implements several sort algorithms and uses heuristics to choose the "best" one for the given inputs)

Each search algorithm comes in several variations that can make significant differences to its performance, as, for example, this paper illustrates.

Benchmark your service to categorize the areas where additional search strategies are needed or to more effectively tune your selector function. This approach is not quick or easy but if done well can produce very good results.

The http://www-igm.univ-mlv.fr/~lecroq/string/index.html link you point to is an excellent source and summary of some of the best known and researched string matching algorithms.

Solutions to most search problems involve trade offs with respect to pre-processing overhead, time and space requirements. No single algorithm will be optimal or practical in all cases.

If you objective is to design a specific algorithm for string searching, then ignore the rest of what I have to say, If you want to develop a generalized string searching service routine then try the following:

Spend some time reviewing the specific strengths and weaknesses of the algorithms you have already referenced. Conduct the review with the objective of finding a set of algorithms that cover the range and scope of string searches you are interested in. Then, build a front end search selector based on a classifier function to target the best algorithm for the given inputs. This way you may employ the most efficient algorithm to do the job. This is particularly effective when an algorithm is very good for certain searches but degrades poorly. For example, brute force is probably the best for needles of length 1 but quickly degrades as needle length increases.

The multiple algorithm approach not a new idea. I believe it has been employed by a few commercial Sort/Search packages (e.g. SYNCSORT commonly used on mainframes implements several sort algorithms and uses heuristics to choose the "best" one for the given inputs)

Each search algorithm comes in several variations that can make significant differences to its performance, as, for example, this paper illustrates.

Benchmark your service to categorize the areas where additional search strategies are needed or to more effectively tune your selector function. This approach is not quick or easy but if done well can produce very good results.

The http://www-igm.univ-mlv.fr/~lecroq/string/index.html link you point to is an excellent source and summary of some of the best known and researched string matching algorithms.

Solutions to most search problems involve trade offs with respect to pre-processing overhead, time and space requirements. No single algorithm will be optimal or practical in all cases.

If you objective is to design a specific algorithm for string searching, then ignore the rest of what I have to say, If you want to develop a generalized string searching service routine then try the following:

Spend some time reviewing the specific strengths and weaknesses of the algorithms you have already referenced. Conduct the review with the objective of finding a set of algorithms that cover the range and scope of string searches you are interested in. Then, build a front end search selector based on a classifier function to target the best algorithm for the given inputs. This way you may employ the most efficient algorithm to do the job. This is particularly effective when an algorithm is very good for certain searches but degrades poorly. For example, brute force is probably the best for needles of length 1 but quickly degrades as needle length increases, whereupon the sustik-moore algoritim may become more efficient (over small alphabets), then for longer needles and larger alphabets, the KMP or Boyer-Moore algorithms may be better. These are just examples to illustrate a possible strategy.

The multiple algorithm approach not a new idea. I believe it has been employed by a few commercial Sort/Search packages (e.g. SYNCSORT commonly used on mainframes implements several sort algorithms and uses heuristics to choose the "best" one for the given inputs)

Each search algorithm comes in several variations that can make significant differences to its performance, as, for example, this paper illustrates.

Benchmark your service to categorize the areas where additional search strategies are needed or to more effectively tune your selector function. This approach is not quick or easy but if done well can produce very good results.

Source Link
NealB
  • 17k
  • 2
  • 43
  • 61

The http://www-igm.univ-mlv.fr/~lecroq/string/index.html link you point to is an excellent source and summary of some of the best known and researched string matching algorithms.

Solutions to most search problems involve trade offs with respect to pre-processing overhead, time and space requirements. No single algorithm will be optimal or practical in all cases.

If you objective is to design a specific algorithm for string searching, then ignore the rest of what I have to say, If you want to develop a generalized string searching service routine then try the following:

Spend some time reviewing the specific strengths and weaknesses of the algorithms you have already referenced. Conduct the review with the objective of finding a set of algorithms that cover the range and scope of string searches you are interested in. Then, build a front end search selector based on a classifier function to target the best algorithm for the given inputs. This way you may employ the most efficient algorithm to do the job. This is particularly effective when an algorithm is very good for certain searches but degrades poorly. For example, brute force is probably the best for needles of length 1 but quickly degrades as needle length increases.

The multiple algorithm approach not a new idea. I believe it has been employed by a few commercial Sort/Search packages (e.g. SYNCSORT commonly used on mainframes implements several sort algorithms and uses heuristics to choose the "best" one for the given inputs)

Each search algorithm comes in several variations that can make significant differences to its performance, as, for example, this paper illustrates.

Benchmark your service to categorize the areas where additional search strategies are needed or to more effectively tune your selector function. This approach is not quick or easy but if done well can produce very good results.