What is the main difference between the Knuth-Morris-Pratt and Boyer-Moore algorithms

Asked

Viewed 359 times

4

I know that the KMP(Knuth-Morris-Pratt) is used to find a Y text in X, tries to define a pattern in Y, and then saves this pattern in an array. And I also know that the BM(Boyer-Moore) works best for small words.

But what is the main difference in its functioning, which is better?

1 answer

2


Functional example in English

In a coarse explanation

The approach of Boyer-Moore is to try to match the last character of the pattern instead of the first with the assumption that if there is no match at the end, it is not necessary to try to match at the beginning. This allows "great jumps", therefore the BM works best when the pattern and the text you are looking for are similar to "natural text" (i.e., English)

Knuth-Morris-Pratt looks for occurrences of a "word" W within a main "text string", employing the remark that when a match fails, the word itself incorporates enough information to determine where the next match could start, thus ignoring the re-matchexamination of previously corresponding characters. (Source: Wiki)

This means that KMP is more suitable for small sets like DNA (ACTG)

According to the creator:

The classic algorithm of Boyer-Moore suffers from the phenomenon that tends not to function as efficiently in small alphabets as DNA. A jump distance tends to stop growing with the length of the pattern because substrings reappear frequently. While remembering more than has already been agreed, one can achieve larger jumps through the text. One can even get the "perfect memory" and, therefore, look at each character to the fullest, while the Boyer-Moore algorithm, While linear, you can inspect a character from the text several times. This idea of remembering more was explored in the literature by others. This suffers from the need for very large tables or state machines.

When to use:

In theory, both algorithms will have a "similar" performance; KMP will make about 2n comparisons in the research phase and Boyer-Moore will make about 3n comparisons in the research phase in the worst case. In either case, you need to repeat the pre-processing when you receive a new text.

But the real answer is that you should not use any of them practice.

The linear auxiliary storage required by both algorithms leads to considerably lower performance on modern architectures because of all the extra memory accesses.

However, the ideas behind Boyer-Moore and KMP support most string fast matching algorithms. Something like the "failure function" of KMP is used by all practically effective string matching algorithms I know; it turns out that you can calculate a "fault function" "suboptimal" for a pattern on-the-fly which still gives you a linear time match while just needs additional space constant. Boyer-Moore is faster than linear in the "average case" of combining a fixed pattern against random noise, and this occurs in many practical situations.

Sources: 1 2

Browser other questions tagged

You are not signed in. Login or sign up in order to post.