Return the "period" size of a bit string

Asked

Viewed 1,032 times

12

Recently I did a very interesting online test (from Codility). Unfortunately this test was timed (30 minutes) and I ended up not being able to reach a satisfactory solution to one of the questions.

Context

A decimal number can be seen as a bit sequence.

Examples:

955 = 1110111011
10 =  1010
627 = 1001110011
102 = 1100110
104 = 1101000

These bit strings can be periodic, i.e., composed of successive bit substring repetitions.

Update

From the official definition of the problem. Given a String S composed of Q bits. The period P is the smallest positive integer such that:

  • P Q / 2 and
  • S[K] = S[K+P] for all K, where 0 K < Q P.

Examples:

The binary 1010 is composed of the substring 10 repeated twice. The binary 1110111011 is composed of the substring 1110 repeated two and a half times (as the 11 left at the end after two repetitions). Already the decimal 1101000 is not periodic (there is no substring with Q / 2 forming a period).

The period of a binary string is nothing more than the size of the repeating bit sequence.

955-> 1110 (período 4)
10 ->  10  (período 2)
627 -> 10011 (período 5)
102 -> 1100 (4, mas esse não é um período válido pois `P > Q / 2`)
104 -> não tem

The problem

Integer n, write an algorithm that returns the period of the binary string from n or -1 in case the binary string is not periodic:

public int periodo(int n) 

Expected complexity

  • Worst temporal case O(log(N)²)
  • Worst space case O(log(N))

A possible sub-optimal solution of brute force

I spent a lot of time on that question and I don’t even know if I could come up with a solution naive correct (mine feeling was that this was a bit manipulation question, in the end I ended up using Integer.toBinaryString(n) and going by brute force).

I ended up doing the following:

  • Generate each candidate substring to be a period of n. Starting from 1 to [n] bit string size/ 2

     final String binaryString = Integer.toBinaryString(n);
     for (int i = 1; i < binaryString.length() / 2; i++) {
         final String period = binaryString.substring(0, i);
    
  • For each of these substrings, check whether the main string could be generated from substring repeats (using a loop and comparing character to character):

    private boolean checkSolution(String binaryString, String period) {
        boolean isPeriod = true;
        for (int i = 0; isPeriod && i < binaryString.length(); i++) {
            if (binaryString.charAt(i) != period.charAt(i % period.length())) {
                isPeriod = false;
            }
        }
    
        return isPeriod;
    }
    
  • Return the shortest period found (or -1 which is the default if no period has been found).

But I was not satisfied with this solution... Mainly because it is basically quadratic in relation to Q, far from the expected time complexity.

My question is, has anyone ever solved this problem and knows a better algorithm? It does not need to be in Java, a pseudo-code or implementation in any language is ok.

  • 2

    I call shenaningans! 110100 is the 110100 sequence repeated exactly once!

  • 1

    This problem is vaguely familiar... I’m sure there’s a solution involving finite automata, and it’s very similar to the problem of marrying a string with a regular expression. I’ll try to remember better, but I’ve got the hint.

  • 1

    Yeah, I also thought about Kleene’s closure, but I couldn’t think of any algorithm to infer the alphabet given to the word, especially since it allows cases of partial composition like the 905 = 1110111011. Later I also ran into the algorithms of cycle detection but I could not formulate the problem exactly according to the proposed algorithms.

  • 1

    Disregard what I said about automata, I thought of them because I’ve recently been rereading this. But the problem I remembered was another (string search), has nothing to do with repeated patterns no... As for your solution proposal, I believe it is not far from the optimal solution no, although maybe can to reduce a little the space of solutions (I have some ideas here, but as I am lousy in math I don’t know it’s in my head or not... if I have something concrete, then I come back here).

  • P.S. I agree that the repetitions have to be whole, otherwise it opens up a lot of room for "gambiarra"... Ex.: 101110010 is 1011100 repeated 1 time and 2/7.

  • @Anthonyaccioly But one thing has to be clear, in the first case there are 2.5 periods of 4? And so there would be a multitude of periods and it seems to me that it was very difficult to resolve. You could even have a decimal with several periods. .

  • 1

    @Anthonyaccioly Although I don’t really have an answer (I gave up!), here’s an additional question - to help you clear up the ideas: you want the greater repeating pattern, or the minor? Because if your jail is 0000000... then there is a pattern of size 1, one of size 2, one of size 3... and although I can’t prove it, I strongly suspect that if there are two patterns with sizes A and B then there is also one with size mdc(A,B) - even if they are cousins to each other (will have a size pattern 1). That is, when you find the smallest, stop - because all will be multiples of it.

  • @Jorge B. for 2.5 periods I don’t really mean a fraction, but there may be an incomplete "rest" of the binary string (See the example of 905 = 1110 1110 11), that is, the last repetition may be partial and not complete (this case was foreseen in the exercise).

  • 1

    @mgibsonbr I think I have found a solution http://stackoverflow.com/a/21171365/664577. (Page 340 of this PDF here http://www-igm.univ-mlv.fr/~mac/REC/text-Algorithms.pdf), let me try to remember a little bit of Pascal as soon as I can understand the solution put a response hehe.

  • Everyone, I received today the "results" of the exam, which showed the definition of the problem, as well as the worst cases expected. Unfortunately the solution I found above is also not great, so the problem remains open. I cleaned up the comments and includes official definitions. @utluiz, see the settings the exercise is looking for the lowest standard even (good feeling).

  • 1

    @Jorgeb., also see the official definitions, really partial patterns are acceptable, since they P < Q / 2, that is, that the pattern is repeated at least twice.

  • 2

    This expected complexity is expressed as a function of what? Remember that for an integer n, the size of your bit string is [the ceiling of] log2(n). That is, a "quadratic solution in relation to Q" corresponds exactly to log2(n)^2...

  • 2

    Hmm... it makes sense @mgibsonbr. In my results I exceeded the time limit of some unit tests (but my algorithm had a stupid problem... I only got 38 out of 100 points). Maybe you’re right and I want to optimize a solution already "good enough".

  • 1

    This paper here seems to be a good way...

Show 9 more comments

3 answers

4

I particularly believe that the best (or most reliable) solution is to apply autocorrelation, it is the same method used to find the period in signals, a very practical example of autocorrelation is when you need to find the frequency (which conversely is the same as the period) of a given audio signal the principle is very similar to that of the proposed problem, the autocorrelation can be defined as:

inserir a descrição da imagem aqui

The steps to achieve the expected result is:

  1. Apply autocorrelation to the bit vector.
  2. The autocorrelation will return the most similar points.
  3. Now with the vector returned by autocorrelation find all the peaks of this vector.
  4. now that you know all the locations(peaks), subtract(find the difference) between the peaks.
  5. Your period is the result of the first difference.

I did here quickly on Matlab to demonstrate the concept, follow some results of the presented by the problem:


vector = [1 1 1 0 1 1 1 0 1 1];

R = xcorr(vector);

[pks,locs]=findpeaks(R);

resultado=diff(locs);

resultado(1)

ans =

     4

vector = [1 0 1 0];

R = xcorr(vector);

[pks,locs]=findpeaks(R);

resultado=diff(locs);

resultado(1)

ans =

     2

vector =[1 0 0 1 1 1 0 0 1 1];

R = xcorr(vector);

[pks,locs]=findpeaks(R);

resultado=diff(locs);

resultado(1)

ans =

     5
  • 1

    Very nice Elder (and Mathematics really allows a very succinct programming for this type of problem). Would you have any idea of the asymptotic complexity of this solution?

  • 1

    @ederwander I believe that when it says "asymptotic complexity" the OP refers to the order of complexity of the algorithm.

  • 1

    Ow is true, I quickly looked :-) confused, I thought he was asking for aperiodicity, in really real applications this is used in real-time, ie guitar tuners, guitar, piano, etc., are used to give the frequency of a note in real time using this principle... of course we use fixed blocks ie 1024, 2048 in size :-)

  • 1

    Thank you Eder (you work with something really cool!). And yes, I was thinking in terms of notation Grande-O.

  • 5

    "Big-O" is what it’s called in English? Sounds Funny. Could be Ozão :-) @Anthonyaccioly

  • 3

    Until Wikipedia has two articles on the subject in Portuguese (Big-O and Big-O), I think in these cases it is better to call of superior temporal asymptotic complexity (I know that this is call bovine cow quadruped ruminant, but sometimes it is inevitable hehehe).

Show 1 more comment

3


At first I considered applying the algorithm of Detection of Floyd cycles, also known as the "turtle and rabbit" algorithm, but the problem seems to be simpler than that.

Logarithmic complexity features usually involve a comparison that decreases over time. For example, the binary search, which divides the vector always in two.

The fact that it is squared could also be associated with comparisons between the elements of the vector, as occurs with the bubble ordering algorithm (Bubble Sort).

All in all, plus the definition of the problem:

S[K] = S[K+P] for all K, where 0 K < Q P

I thought of a simple algorithm that works basically as follows:

  1. Calculates the longest period by dividing the bit string size by 2, making P = T / 2, being T the bit string size.
  2. Calculates the number of strings to compare by Q = LEN / P, always rounding up

  3. For each K = 1..P and J = 2..K, compares each element in position K with the correspondent in position K * J + P, only when K * J + P <= T

    If you find a different element

    • If P > 1, decreases the period P and returns to the step 2 again
    • Otherwise, just keep going 4
  4. If P <= 0 returns -1, else, return P

Practical example

Given the example number 955, whose binary value is 1110111011, we can calculate the initial values of Steps 1 and 2 as follows:

T = 10
P = 10 / 2 = 5
Q = 10 / 5 = 2

Then the comparison in Step 3 occurs as follows:

K = 1: 1 1 1 0 1 | 1 1 0 1 1 (equal)

K = 2: 1 1 1 0 1 | 1 1 0 1 1 (equal)

K = 3: 1 1 1 0 1 | 1 1 0 1 1 (different)

When comparing the third element of each period, a different element was found. Then, updated the variables like this:

P = P - 1 = 4
Q = 10 / 3 = 4 (arredondando para cima)

And we continue the comparison:

K = 1: 1 1 1 0 | 1 1 1 0 | 1 1 (equal)

K = 2: 1 1 1 0 | 1 1 1 0 | 1 1 (equals)

K = 3: 1 1 1 0 | 1 1 1 0 | 1 1 (equal)

K = 4: 1 1 1 0 | 1 1 1 0 | 1 1 (iguals)

As there is no difference, then we return the value of P, that is 4.

Java implementation

public static int periodo( int n ) {

  // validação inicial: precisa ser positivo e com pelo menos duas casas binárias
  if( n < 2 ) return -1;

  // inicialização de variáveis
  String s = Integer.toBinaryString( n );
  byte[ ] bytes = s.getBytes();
  int t = bytes.length;

  // período inicial
  int p = t / 2;

  // verifica se o período se repete pelo menos uma vez
  boolean diferente;
  do {
    diferente = false;
    int qp = t / p; // quantidade de períodos
    if (t % p > 0) qp++; //somar 1 para comparar os caracteres de resto no final

    // verifica se a repetição ocorre
    for2: for( int k = 0 ; k < p ; k++ ) {
      for (int j = 1; j < qp; j++) {
        if( k + p * j < t && bytes[ k ] != bytes[ k + p * j ] ) {
          diferente = true;
          break for2;
        }
      }
    }

    //se não encontrou repetição, tenta um período menor
    if( diferente ) {
      p--;
    }
  } while( diferente && p > 0 );

  return p <= 0 ? -1 : p;
}

Some Considerations

I’m not very good at judging the complexity of an algorithm, but I believe in the worst case the above algorithm will do N / 2 iterations and, in each, it makes iterations in shorter periods, which leads to a logarithmic temporal complexity.

As for the special (memory) complexity, the algorithm is O(1) because it does not add new data structures beyond the original, although the implementation is O(N) because it creates an auxiliary vector of the same bit string size.

Something that leaves me with a flea behind the ear is that the problem suggested a special complexity also logarithmic, so maybe there is some data structure that can optimize the temporal complexity of the solution.

One solution I thought of with logarithmic time complexity is to split the bit string into several strings and then compare to see if they have the same content. However, I suspect that this solution would be much less efficient, since it would execute several substrings seems to me something unnecessary, besides which the comparison of Strings would also occur character to character.

Another idea is to recover the binary section of each period and compare the equivalent in its numerical format. Perhaps still with binary operations the algorithm could be very efficient. But I haven’t analyzed the complexity of converting each period from binary to decimal format.

Yet another idea to optimize this further would be to make all this comparison using binary operators to check the equality of periods, for example, using bit chains like this: 1000010000, 1000100010, 1001001001, 1010101010. It’s an idea.

Finally, I do not know if the solution I proposed meets the requirement of the question, but I hope to have helped at least with some ideas.

And, before concluding, just one warning: beware of language performance when participating in an online challenge. I have already solved some of these problems using Java and Python. Unfortunately, I often got one timeout even with a good algorithm. Some sites add speed multipliers, for example Java 1,5 and Python 2,5 to try to match the efficiency of the algorithm with faster languages like C++. However, this is not always fair and just look at the ranking to see which languages are always at the top.


Updating

Based on gist from Anthony Accioly, I made an attempt to further optimize code using bit-only manipulation. Follow the code:

public class BitPeriod5 {

    public int solution(int n) {

        // validação inicial: precisa ser positivo e com pelo menos duas casas binárias
        if (n < 2) {
            return -1;
        }

        // inicialização de variáveis
        final int t = 32 - Integer.numberOfLeadingZeros(n);

        // descarta bits não significantes
        final int s = n >>> Integer.numberOfTrailingZeros(n);

        // período inicial
        int p = 1;
        int maxp = t / 2;
        int vp = 0;

        // verifica se o período se repete pelo menos uma vez
        boolean diferente;
        do {
            diferente = false;
            int qp = t / p + 1;
            if (t != p * qp) {
                qp++; //somar 1 para comparar os caracteres de resto no final
            }

            vp = vp << 1 | 1;
            int vi = n & vp;
            for (int i = 1; i < qp; i++) {
                if (((n >> (i*p)) & vp) != (vi & vp)) {
                    diferente = true;
                }
            }

             //se não encontrou repetição, tenta um período menor
            if (diferente) {
                p++;
            }
        } while (diferente && p <= maxp);

        return p > maxp ? -1 : p;

    }

    public static void main(String[] args) {
        final BitPeriod5 bitPeriod = new BitPeriod5();
        for (int i = 0; i < 100; i++) {
            long nano = System.nanoTime();
            for (int j = 0; j < 100000; j++) {
                bitPeriod.solution(10);
                bitPeriod.solution(187);
                bitPeriod.solution(955);
                bitPeriod.solution(1000000000);
                bitPeriod.solution(0b10101000001010100000);
            }
            System.out.printf("%.2f\n", ((double) (System.nanoTime()) - nano) * 1e-6);
        }
    }

}

However, I noticed that this solution is slower than the previous one, even though it has a for unless.

Lesson 1

Manipulating bits is more "expensive" than accessing vector elements and comparing their value.

I also noticed that removing the module operation, the performance of the code greatly improves.

Changing the section:

if (t % qp > 0) ...

That’s why:

if (t != p * qp) ...

The performance gain was in the 25% range. It is very interesting to see how each operation most drastically affects the performance.

Lesson 2

The division module operation is very "expensive" from the performance point of view.

  • 1

    Thanks utluiz, your solution runs much faster than mine (even if the asymptotic complexity seems to be the same). As for the complexity of space log(N), I go by the mgibsonbr theory, I believe that log(N) =~ Q, or that you can use any extra structure to contain the bits of the string. I don’t know why, but I still get the impression that with a BitSet or even manipulations / logical bit operations would be possible to solve this problem more efficiently. Maybe part of the steps can be rewritten with binary shits and operation XOR / AND NOT.

  • 1

    And thanks for the tip from timeout also, I wish I could run this test again :).

  • 1

    I created a gist with its slightly modified algorithm (starting with p=1 and going up to p == t / 2, in order to return always as little time as possible). According to my flawed microbenchmark your solution can run 500,000 tests around 3 times faster than mine. I’ll still pursue the bit question, but for all intents and purposes your answer is correct.

  • @Very good anthonyaccioly. I had not attacked myself to the requirement of being the "least positive integer". :)

  • 1

    utluiz, to close the subject at once, updated the above Gist with a brute force version and a version of its algorithm using :) bit manipulation only. I also think on the line if (t % q > 0) qp++; you probably meant if (t % **qp** > 0) qp++; right?

  • Well, summary of the opera. Bitperiod4.java = Variation of its implementation only with bit manipulation, Bitperiod3.java = Implementation naive with bits. In terms of performance: Its implementation improves much only with bits (consumes half the time of the original version), but for some reason my algorithm described below (version naive that concatenating periods) was the best. The final version was approximately 6 times faster than the original microbenchmark :D. And that being said, I don’t want to brush bits anymore for a while hehehe.

  • @Anthonyaccioly Era if (t % q > 0) qp++; same. The previous division will truncate if there is a rest. This would mean that if the last period has fewer digits than the period size, it will not be checked. So, if there’s the rest of that room, add one. I would still make a division always rounding up, but I ended up doing it this way to not use additional rounding classes. At least it’s my understanding.

  • 1

    @Anthonyaccioly I just found out how expensive module operation and bit manipulation in java can be. I tried in a frustrated way to make a bigger optimization with bit manipulation, but nothing done. I will update the answer with this information.

  • 1

    Is that in your code there is no variable q (I think the size = t, period = p). About the performance, microbenchmarks are the devil... Maybe the conclusions are hardware dependent / OS / VM. On my machine (I7 @2.4, 12 Gigas, SSD, openSUSE 13.1 64 Bits, Java 8). Its original version runs 500k operations on an average of 89ms (change the module operation by a multiplication increases the time to ~= 120ms). Bitperiod5 ~= 90ms (and failure for cases like 955); Bitperiod4 ~= 55 ms and Bitperiod3 ~= 37ms (the implementation described in my reply).

  • @Anthonyaccioly I’m not with my note now, but the result is quite different from that with an I7, 8GB, Windows 7 and Java 7. :S

  • @Anthonyaccioly It really wasn’t q there, but the idea was to be the p. I wanted the rest of the division executed earlier.

  • if (t % p > 0)then right? Then put your result there (or send me by other means hehehe). I opened a can of worms with the microbenchmarks, but deep down my conclusion is that your algorithm actually performs less instructions than mine, if the version naive is being faster is due to the execution time of the same instructions (as well as prediction of branches and things like that), if we run this *microbenchmark on 30 different machines we will have 30 different conclusions hehehe.

Show 7 more comments

2

A week after the "30 minutes exercise" I finally managed to reach a solution using bit manipulation.

This algorithm continues to be brute force; but it achieved the best times as my tests (on average 6 times faster than the original algorithm).

The algorithm

  1. For every candidate for a term P amid 1 and Q / 2
    1. Boot K: P top order bits in N.
    2. Find S: a chain of Q bits using successive repetitions of K ("remains" should be considered).
    3. If S == N, the period is P
  2. If no period was found, returns -1.

Implementation

public int periodo(int n) {
    if (n <= 0) {
        throw new IllegalArgumentException("n precisa ser positivo");
    }
    int result = - 1;
    // tamanho da cadeia de bits
    final int q = 32 - Integer.numberOfLeadingZeros(n);
    // cada periodo entre p 1 e tamanho da cadeia / 2
    for (int p = 1; p <= q / 2 && result == -1; p++) {
        // primeiros p bits
        final int k = n >> q - p;
        if (n == constroiPeriodo(k, p, q)) {
            result = p;
        }
    }
    return result;
}

Where the method constroiPeriodo can be implemented as follows:

private int constroiPeriodo(int k, int p, int q) {
    int result = 0;
    // periodos inteiros
    for (int i = 0; i < q; i += p) {
        result = result << p | k;
    }
    // ultimo periodo parcial
    final int mod = q % p;
    if (mod != 0) {
        result = result >> p - mod;
    }

    return result;
}

P.S. While my adaptation of utluiz algorithm failed to achieve the performance of that implementation naive, his algorithm is much smarter and should be preferred.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.