Identifying common snippets in two PHP strings

Asked

Viewed 2,028 times

7

I need to compare non-standard strings in PHP. I have 2 strings as below:

$primeira = 'asdasdasdTESTEasdasdasdasd';

$segunda = 'lkijlikjTESTEilkjik';

How do I dynamically know if the first and second variables contain the same sequence of equal characters? In this case exemplified by the string "TEST".

4 answers

7


I created a function that compares segments of strings and returns the same words in an array:

function palavras_iguais($string1, $string2, $minlen = 5) {
    $strlen1 = strlen($string1);
    $strlen2 = strlen($string2);
    $palavras = array();
    for($i=0; $i < $strlen1; $i++) {
        $palavra = substr($string1, $i, $minlen);
        if (strpos($string2, $palavra) !== false) {
            $j = $minlen;
            do {
                $j++;
            } while (strpos($string2, substr($string1, $i, $j)) !== false && $j < $strlen2);
            $palavra = substr($string1, $i, $j-1);
            $i += strlen($palavra)-1;
            $palavras[] = $palavra;
        }
    }
    return $palavras;   
}

Test 1:

$primeira = 'asdasdasdTESTEasdasdasdasd';
$segunda = 'lkijlikjTESTEilkjik';

print_r( palavras_iguais($primeira, $segunda) );

// Retorno:

Array
(
    [0] => TESTE
)

Test 2:

$primeira = 'asdFINALasdasdTESTEaTESTE2sdasdasdasdTESTENOFINAL';
$segunda = 'lkiTESTE2jlikjTESTEilkjTESikTESTENOFINALjhfdgkFINAL';

print_r( palavras_iguais($primeira, $segunda) );

// Retorno:

Array
(
    [0] => FINAL
    [1] => TESTE
    [2] => TESTE2
    [3] => TESTENOFINAL
)

Test 3:

$primeira = 'asdaTSCsdasdTESTEasdasdasdasd';
$segunda = 'lkijlikjTESTEilkjTSCik';

print_r( palavras_iguais($primeira, $segunda, 3) );

// Retorno:

Array
(
    [0] => TSC
    [1] => TESTE
)

4

I thought about an approach a little different from the others. I wanted to avoid nested loops, but I didn’t test if this has a positive impact on performance. It works that way:

  • Creates an array of character groups from the first string. For example, with $minlen=2, the string "abcde" is divided into ["ab", "bc", "cd", "de"].
  • Checks whether each pair occurs in the second string. If it occurs then consider a single word (for example, if the second string contains "abc", the first two pairs are found in sequence).

I think it’s easier to understand in code form:

function matchingSubstrings($str1, $str2, $minlen=2) {
    $grupos = [];
    for($i=1; $i<strlen($str1); $i++) {
        array_push($grupos, substr($str1, $i-1, $minlen));
    }

    $palavras = [];
    $temp = '';
    $i = 0;
    $j = 0;

    do {
        if($k = strpos($str2, $grupos[$i], $j) !== false) {
            $j += $k;
            $temp .= $temp === '' ? $grupos[$i] : substr($grupos[$i], -1);
        } else {
            if($temp !== '') array_push($palavras, $temp); 
            $temp = '';
            $j = 0;
        }
        $i++;
    } while($i<count($grupos));

    return $palavras;
}

A test with repetitions:

matchingSubstrings('nnnabcnnnabcnnn', 'kkkabckkkabc');

Return:

Array
(
    [0] => abc
    [1] => abc
)

If the repetition is not desired at the return, just change the last line of the function by return array_unique($palavras);.

This function also worked with the tests of jader’s answer (the exit was identical).

Demo no ideone

3

I don’t think there is a native php function that does this.

I found a solution on google that solves what you need.

function longest_common_substring($words)
{
    $words = array_map('strtolower', array_map('trim', $words));
    $sort_by_strlen = create_function('$a, $b', 'if (strlen($a) == strlen($b)) { return strcmp($a, $b); } return (strlen($a) < strlen($b)) ? -1 : 1;');
    usort($words, $sort_by_strlen);

    // We have to assume that each string has something in common with the first
    // string (post sort), we just need to figure out what the longest common
    // string is. If any string DOES NOT have something in common with the first
    // string, return false.
    $longest_common_substring = array();
    $shortest_string = str_split(array_shift($words));
    while (sizeof($shortest_string)) {
        array_unshift($longest_common_substring, '');
        foreach ($shortest_string as $ci => $char) {
            foreach ($words as $wi => $word) {
                if (!strstr($word, $longest_common_substring[0] . $char)) {
                    // No match
                    break 2;
                }
            }

            // we found the current char in each word, so add it to the first longest_common_substring element,
            // then start checking again using the next char as well
            $longest_common_substring[0].= $char;
        }

        // We've finished looping through the entire shortest_string.
        // Remove the first char and start all over. Do this until there are no more
        // chars to search on.
        array_shift($shortest_string);
    }

    // If we made it here then we've run through everything
    usort($longest_common_substring, $sort_by_strlen);
    return array_pop($longest_common_substring);
}

This solution returns the largest set of similar characters among an array of strings.

The implementation is very simple:

$primeira = 'asdasdasdTEStEasdasdasdasd';
$segunda = 'lkijlikjTESTEilkjik';

echo longest_common_substring([$primeira, $segunda]);

2

You can use the function strpos.For more details click here

if (strpos($primeira, $segunda) !== false)
    echo 'true';
  • 1

    Reading the question better I don’t think it solves the problem.

  • 1

    This function does not meet my need because I need to check if there is in the two variables an iguai part, represented by "TEST", independent of the characters coming before and after the "TEST".

  • 1

    This is not a criticism of the answer, but rather of those who voted without reading the question: I am surprised that two people voted in favour, since it does not go anywhere close to answering what was asked.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.