Regular expression to reorder string

Asked

Viewed 156 times

9

I have the result of a query with several joins, and in a row I have the concatenation of 2 groups. The line returns me a string as: 1,4|a1,b4.

What I need is to regroup ID and Valor as follows: array( 1 => a1 , 4 => b4 ), or similarly, but without cluttering with the use of explode and recombine the array.

I thought I’d use preg_replace, but I couldn’t get a working rule.

  • Can numbers have how many digits? Start at one or zero? Can zero occur at the start? a1 and B4 are random values or will always be a, b, c etc. followed by the same numbers that occurred earlier?

  • @Pablo, no occurrences of zeros and random values.

  • 3

    I would suggest taking a look at the source code that implements Regex, to get an idea of the complexity. " To stuff with use of explodes and to recombine the array" is fichinha close to that. Now, if in the same source you are going to process the same Regex a very high number of times, maybe the pre-compilation of it will bring some advantage.

  • @Bacco, yeah. With explode can do and recombine the arrays, but is "ugly". This function repetition does not please me much.

  • 3

    @Papacharlie I understand, you’re thinking about code aesthetics, not processing. I think it is valid that you use the style that gives less maintenance and better readability to you (and this is really very personal). I often opt for readability as well, in parts of the code that do not significantly affect performance. I just found it interesting to comment, because many people have no idea how complex Regex is "under the hood".

  • Unfortunately I didn’t even get to try with regex, but I do have a solution, let’s say worked.

  • I broke the input string, and regrouped the members one by one, but in this case, it’s up to you - "but without clutter using explode and recombine the array"- to accept the solution I propose.

  • @Bacco, I know of the 'care' of the use with the Ers, but in the case came the doubt whether it compensates an ER or 3 explodes + reorganization of the array. With explode already done and working, just wanted 'a line' more practice :)

  • @Papacharlie is, it took two lines to make the version stuffed, but I still suspect that only with Regex anyway it won’t be simple either. Let’s see what comes of alternative, suddenly comes out something ingenious.

  • @Bacco I think an ER for this will have a lot of diabolical condition rs. I will follow the razor of Ulcer for now...

  • 3

    @Papacharlie I will post the version explodes, and another alternative without Regex more as repository for other users even. Anyway, we will wait for Regex solutions

  • @Bacco I didn’t understand the "little file" part. A compiled regular expression is O(n). That’s pretty fast.

  • 3

    @Pablo O(n) is a measure of complexity, not speed. sleep( 10000 ) is O(n) too. And as you well said, once compiled She’s really fast, so I said if you’re going to use it multiple times in the same source, it might have some advantage. Just remember that in PHP it will be compiled again at each access.

  • 2

    @Pablo complementing what Bacco said, has O(N) that takes nanoseconds and has O(N) that takes years. It depends on the value of N and the time needed to process each item. In terms of performance, any solution there will be fast.

  • @Bacco The same goes for the code posted in the reply. All will be compiled every time. And measure complexity serves precisely to analyze speed. If you know a more suitable measure to analyze algorithms without running them (which is all the AP seems willing to do), please introduce me.

  • 3

    @Pablo does not forget that calling a function in PHP is calling all the code that runs behind it. It’s no use just wanting to apply Big-O to that layer and thinking you did your homework. Another thing, about the "measure of complexity serves to analyze speed", it would be nice to review the theory a little better, because this is not the concept, nor simple like this. I hope you don’t take it as a criticism, but as an incentive to rethink the whole thing. Perhaps even reread the comments to separate what was actually said from what you understood from what was said. As for performance, it doesn’t seem to be the focus of the issue.

  • @Bacco I am always willing to learn. So I’m asking you if you know a better way to estimate performance in an analytical way without appealing to asymptotic complexity. I agree with you on the issue of function calls, but I don’t see how you escape function calls using the explode method. You’re dismissing my intuitive analysis of the problem simply by saying it’s not quite so, but I’d like to know how it is, so.

  • 2

    @Pablo is not ruling out, but questioning the method applied. Compare the workflow of an explosion and the compilation of a regex, you wouldn’t even need Big-O to see that the problem is different. Anyway, better than opinion would be you look at the PHP fonts and the libs involved. Suddenly you discover something interesting. If you find that Regex isn’t that complex, comment here. I’m open to learning as well. What I don’t feel like doing is debating here.

Show 13 more comments

3 answers

11


Alternate version, without Regex

Not as an answer to the main problem, but as an alternative for other users who need to parse of strings in this format, follows a solution without Regex:

$in   = '1,4|a1,b4';

$pair = explode('|',$in);
$out  = array_combine(explode(',',$pair[0]),explode(',',$pair[1]));

See working on IDEONE.


As posted by bfavaretto us comments, follows an alternative that reconstructs the data in a single line:

$s='1,4|a1,b4';
$o=call_user_func_array("array_combine", array_chunk(str_word_count($s,1,'1234567890'),2));
print_r( $o );

I replaced this change because I found it interesting as a showcase for some less known PHP functions. For normal use, the explode gets to the point.

See working on IDEONE.

2

without clutter using explode and recombine the array

I’m sorry to say, but with regex it doesn’t get simpler. In fact, I would say it gets even more "stuffed", and there’s no escape from recombining the array.

I recommend you continue using explode, as suggested in another answer. But just to show how simple it wouldn’t be:

// testando com mais chaves e valores
$in = '1,4,6,9|a1,b4,c6,d9';
if (preg_match_all('/[^|,]+/', $in, $matches)) {
    $len = count($matches[0]) / 2;
    $out = array_combine(array_slice($matches[0], 0, $len), array_slice($matches[0], $len));
    print_r($out);
}

I mean, I’m taking one or more occurrences (+) of anything that nay be it | comma ([^|,]). The problem is that this regex does not check the format and simply accepts anything between commas.

With this, the array $matches contains a single array that has all keys and values. If I did print_r($matches);, the result would be:

Array
(
    [0] => Array
        (
            [0] => 1
            [1] => 4
            [2] => 6
            [3] => 9
            [4] => a1
            [5] => b4
            [6] => c6
            [7] => d9
        )

)

So I need to recombine this array: the first half of the elements are the keys, and the second half are the respective values. That’s what the array_combine above is doing. The end result is:

Array
(
    [1] => a1
    [4] => b4
    [6] => c6
    [9] => d9
)

One problem is that regex does not check the format as it could have several | in the string separating the fields (and no comma), or any other combination. Already the solution with explode "guarantees" at least that it must have a | separating the fields.

Of course if the entries are controlled and you are sure that the format is always correct, there is not so much problem. But it is the caveat.


But if you want the regex to be more "rigid" and only accept the format indicated, it would have to be something like:

if (preg_match_all('/(?<=^|,)\d+|(?<=[|,])[a-z]\d/', $in, $matches)) {
    // resto do código é igual
}

I’m assuming that keys are always digits (\d+: one or more digits) and the values are always a letter followed by a digit ([a-z]\d). If it’s not, just adjust it (for example, if it’s multiple letters followed by multiple digits, switch to [a-z]+\d+).

Also use lookbehinds, that verify if something occurs before a certain stretch. In the case, before the numbers have the lookbehind (?<=^|,), that checks if before them has the start of the string (^) or a comma. That is, it takes the numbers that are at the beginning (before the |). And before the values (a1, b2, etc) has the lookbehind (?<=[|,]), which checks whether you have a | or a comma.

The array of pouch will have the same format as the previous code, with all keys and then all values.

Of course, this is an option if you want to force a little more rigid verification. But if you "know" that the format of the input string is always valid, just take everything that is between the commas (and then, in my opinion, it is still better to use explode).

Another alternative is to use capture groups for keys and values:

if (preg_match_all('/(?<=^|,)(\d+)|(?<=[|,])([a-z]\d)/', $in, $matches)) {
    $out = array_combine(array_filter($matches[1]), array_filter($matches[2]));
    print_r($out);
}

Now the keys and values are in parentheses, which creates capture groups. In this case, the keys will be in $matches[1] and their values in $matches[2]. But as in each match only one of them is filled, these arrays have several empty strings (make a print_r($matches) to check), so the need to use array_filter to eliminate them.


Anyway, although it is possible, has it gotten "better"? Compare with the solution of another answer, which in my opinion was better in several respects: not only shorter (which should not be the main criterion) but also - in my opinion - clearer, simpler and easier to understand and maintain, not to mention that probably is more efficient (these should be the main criteria).

Of course, I haven’t tested the performance, but my suspicion is that explode is more efficient yes, because regex hides an enormous complexity behind (like the fact that need to be compiled, outside the whole structure behind the engine, see here as it is nothing simple). Compare with the explode to see how simple it is.

Remember that smaller code is not necessarily better (and worse than in this case even became smaller).

0

I believe this expression captures what you want. Just set up the capture groups to extract the data you want.

([1-9][0-9]*,)*[1-9][0-9]*\|[a-z]*(,[a-z]+)*

An important detail is that there is no way, using regular expression, to ensure that the left side has the same amount of items as the right side. That would be work for at least one Grammar, Free of Context.

Browser other questions tagged

You are not signed in. Login or sign up in order to post.