Recursive binary string search in C

Asked

Viewed 658 times

-4

The exercise statement asks me to look for a string in an ordered array of strings, using binary search (compare the string with the element in the middle of the array and then compare the string with an element on the left or right). It also asks to make one code recursively and another iteratively. Then the code:

int procuraBinRec( char *p, char *ps[], int N, int offset ) 
{
    if( N < 1 ) return 0;
    int xN = N / 2;
    int x = strcmp( p,ps[ xN ] );
    if( x == 0 ) return offset + xN + 1;
    char* ps1[ xN ];

    for( int i = 0; i < xN; i++ ) 
    {
        ps1[ i ] = malloc( MAX_LINE + 1 );
        strcpy( ps1[ i ], ps[ i ] );
    }

    char* ps2[N - xN];

    for( int i = 0; i < N - xN - 1; i++ ) {
        ps2[ i ] = malloc( MAX_LINE + 1 );
        strcpy( ps2[ i ], ps[ i + xN + 1 ] );
    }
    if( x < 0 ) return procuraBinRec( p, ps1, xN, offset );
    else return procuraBinRec( p, ps2, N - xN - 1, offset + xN + 1 );
}

I found this resolution of the exercise but I can’t understand:

  • What is the offset.
  • What the cycles for do.
  • The MAX_LINE is already defined earlier.

1 answer

3

This recursive binary search function has several problems and some atypities.

The function unnecessarily creates two subvectors ps1 e ps2 on each run, using the loops for to copy the data from the original vector to these subvectors (which answers your first question). These subvectors were created to be passed on in recursive calls, but this is unnecessary since one can continue searching in the original vector. There is also a large memory leak, since this bunch of vectors created at each recursive call is never out of focus. Finally, we must eliminate loops for and use the original vector itself.

Usually in a binary search for p, one tries to discover j such that v[j-1] < p <= v[j]. That is, even if you do not find the index of the exact value, you find the index of the nearest greater value. In the case of its function, the intention is different, because it returns 0 when you are not p.

Vector indices in C range from 0..N-1, but in its function the returned value is not the position of the vector in the C allotment, but rather the position in the vector with indexes of 1..N.

At each recursive call the function will act on a subvector, the lower half or the upper half of the position N/2. When acting on the subvector in the upper half, it is necessary to know which position that subvector started on the original vector. This is the offset, i.e., the dislocation of the subvector to be processed (which answers your second question).

I hope I have clarified your doubts. I suggest then an optimized and quite commented version of the function:

/* --- Função de busca binária ---
Parâmetros:
    p      -> elemento buscado
    ps     -> subvetor ordenado de busca
    N      -> quantidade de elementos no subvetor
    offset -> deslocamento do subvetor no vetor original
Retorno:
    se elemento não encontrado -> 0
    se elemento encontrado     -> índice no vetor entre 1..N */

int procuraBinRec(char *p, char *ps[], int N, int offset)
{
    // retorna 0 se tamanho do vetor é zero elementos (elemento p não encontrado)
    if(N < 1) return 0;

    // índice para divisão binária do vetor
    int xN = N / 2;

    // 0 se p == ps[xN], negativo se p < ps[xN], positivo se p > ps[xN]
    int x = strcmp(p, ps[xN]);

    // se p é exatamente o elemento do meio (p == ps[xN]), então retorna:
    // = o deslocamento do subvetor dentro do vetor original (offset)
    // + intervalo até o meio do subvetor (xN) +
    // + 1 porque se quer que os índices na resposta sejam de 1 a N
    if(x == 0) return offset + xN + 1;

    // se p < ps[xN] processa subvetor da metade inferior
    else if( x < 0 ) return procuraBinRec(p, ps, xN, offset);

    // se p > ps[xN] processa subvetor da metade superior
    else return procuraBinRec(p, ps + xN + 1, N - xN - 1, offset + xN + 1);
}

The performance of the two functions, if compared, will present a big difference in the running time and memory consumption. We can do some simple test to verify this statement.

During all the tests I used the following parameters:

  • MAX_LINE = 80 characters
  • N = 50,000,000 elements
  • p = the first vector string (bad case)

I used a code similar to the one outlined below to test the execution time of the functions:

#include <time.h>
...
clock_t t;
t = clock();
int x = procuraBinRec(p, ps, N, 0);
t = clock() - t;
printf("tempo = %.5lf s\n", ((double) t)/CLOCKS_PER_SEC);

On my computer, no swap occurs, compiling with the gcc -O0, I obtained the following results:

Função da pergunta = 5.76911 s
Função da resposta = 0.00011 s

The memory consumption difference between the two functions was also large. Making measurements using the program time:

$ command time -f MaxRSS=%M ./bb

I got the results:

Função da pergunta: MaxRSS = 9.962 MB
Função da resposta: MaxRSS =   197 MB

Another way I used to measure memory consumption was with the program valgrind. In case, I used the massif-visualizer to generate the following charts:

valgrind --tool=massif ./bb
massif-visualizer massif.out.pid

Função da Pergunta

Função da Resposta

  • 1

    very good answer! very good code!

  • Thanks @zentrunix, I added at the end of the reply a part about comparing the performance of the functions.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.