How does Bitap’s algorithm work?

Asked

Viewed 202 times

5

According to Wikipedia in English:

The bitap algorithm (also known as o-shift or, shift and or Baeza-Yates-Gonnet algorithm) is a matching sequence approximate algorithm. The algorithm indicates whether a given text contains a subsequence that is "roughly equal" to a given standard, where the approximate equality is defined in terms of Levenshtein distance - if the undercoat and pattern are within a given distance k from each other, then the algorithm considers them equal.

I would like to know how this algorithm works exactly, because the contents I found are very complex.

1 answer

3


I extracted the code contained in this answer of this article, and the analysis of the code is my own.

Below is the code in C to follow the reasoning of the analysis. Note: In this code the values of 1 and 0 are semantically opposite.

#include <stdlib.h>
#include <string.h>
#include <limits.h>

const char *bitap_fuzzy_bitwise_search(const char *text, const char *pattern, int k)
{
  const char *result = NULL;
  int m = strlen(pattern);
  unsigned long *R;
  unsigned long pattern_mask[CHAR_MAX+1];
  int i, d;

  if (pattern[0] == '\0') return text;
  if (m > 31) return "The pattern is too long!";

  /* Initialize the bit array R */
  R = malloc((k+1) * sizeof *R);
  for (i=0; i <= k; ++i)
      R[i] = ~1;

  /* Initialize the pattern bitmasks */
  for (i=0; i <= CHAR_MAX; ++i)
      pattern_mask[i] = ~0;
  for (i=0; i < m; ++i)
      pattern_mask[pattern[i]] &= ~(1UL << i);

  for (i=0; text[i] != '\0'; ++i) {
      /* Update the bit arrays */
      unsigned long old_Rd1 = R[0];

      R[0] |= pattern_mask[text[i]];
      R[0] <<= 1;

      for (d=1; d <= k; ++d) {
          unsigned long tmp = R[d];
          /* Substitution is all we care about */
          R[d] = (old_Rd1 & (R[d] | pattern_mask[text[i]])) << 1;
          old_Rd1 = tmp;
      }

      if (0 == (R[k] & (1UL << m))) {
          result = (text+i - m) + 1;
          break;
      }
  }

  free(R);
  return result;
}

Analyzing

Initialization occurs until the third loop for. The third loop causes all error analysis positions for each attempt to indicate "failure" for boot effect.

The last for, who owns a for nested, makes character-by-character comparisons to verify errors. The for internal limits the scope of checks to the maximum distance established in the call of the algorithm.

Once the vector R is fully filled with 1, the word result is returned.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.