I think it’s easier for you to make one split
, separating the fields by |
and then concatenating what you need, but if you want to use regex, let’s go.
If your entries are always separated by |
and are always in this order, you can be more specific, saying exactly what you want and what you don’t want.
If you only want the lines that start with "PRO" and have "GASOLINE ADDITIVE", you can use these texts explicitly. Otherwise, you can use [^|]
, which means "anything that is not |
".
Use the dot (.
) is too comprehensive as it means "any character". Using explicitly |
for the field separator and [^|]
for "anything other than the separator", regex is more specific for your case.
Another detail is deciding whether to use +
instead of *
. That’s because *
means "zero or more occurrences", that is, if you have nothing, it is also valid. Already the +
means "one or more occurrences", meaning the field cannot be empty.
The same goes for numbers, because \d*
will accept the empty field. Use \d+
, that checks if it has at least one digit. Or, if you know the exact quantity, use for example \d{8}
for exactly 8 digits, or \d{8,}
for "8 digits or more" or \d{8,20}
for "between 8 and 20 digits". Choose the one that fits best in your use cases and adapt the quantities according to what you need.
Anyway, a regex option would be:
^PRO\|\d+\|[^|]+\|\d+\|.*$
Note that the |
must be escaped and written as \|
, since only one |
means alternation (ie, PRO|\d+
means "PRO" or digits). With this we have:
^PRO\|
: begins with "PRO", followed by |
\d+\|
: digits, followed by |
[^|]+\|
: one or more characters that are not |
, followed by |
\d+\|
: digits, followed by |
.*$
: zero or more characters, until the end of the string ($
)
To do the substitution, it depends on the language you are using, as each has its own functions for substituting strings with regex.
Anyway, this is usually done by using parentheses to group the parts you want to capture, so the regex would look like this:
^(PRO\|)(\d+\|)([^|]+\|)\d+\|(.*)$
The first pair of parentheses is (PRO\|)
, then this will be the first group, the second pair of parentheses is (\d+\|)
(the digits plus the |
), then this will be the second group and so on.
To make the substitution, you use the syntax $1
to refer to the first group, $2
for the second etc. Depending on the language/engine, the syntax is \1
, \2
, etc. Therefore, the result would be $1$2$3$2$4
(group 2 is repeated in place of the fourth field). See here an example.
Will you always have "PRO" and "GASOLINE ADDITIVE" or you can have other texts? Which language you are using?
– hkotsubo
will have other texts, PRO is product name, will always be PRO.. now GASOLINE ADDITIVE is an example.. could be LUB SELENIA API SN15W40 SEMI-SINT1L
– Marcos Correia