REGEX - find expressions that DO NOT contain specific words

Asked

Viewed 15,821 times

8

We are using REGEX to normalize pharmaceutical data from a string field and we need to distinguish very similar strings from an exception command.

For example, in a very simple way, we have the following records:

0,5 MG WITH CT BL AL/AL X 30 ----> WITH = Simple Pill

0,4 MG WITH REV CT BL AL X 90 ----> WITH REV = Coated Tablet

0,7 MG LIBLY CT BL AL X 30 ----> LIBLY = Compressed Extended Release

To identify a coated tablet, we use the syntax: WITH sREV s

To identify the Liber Tablet. Prolong., we use the syntax: COM sLIB sPROL s

In this example simplified we need to identify a Simple Pill and for that we need to look for an expression where there is only WITH, without the existence of whole words REV and LIB. Something like syntax:

COM s[ (REV|LIB)]

.. but that syntax didn’t work. Someone can help?

EDITED

Not always the REV shall be immediately after the WITH. The string may come, for example:

0,4 MG WITH CT REV BL AL AL X 90 ---> or with any other words.

The point is that you can’t exist REV at no point in the string.

EDITED 27/07

The syntax bcom b s(?!.*REV|.*LIB) worked well for cases that REV and LIB are after WITH, however, you cannot find the expressions below because there is REV and LIB before the WITH

0,4 MG REV COM CT BL AL X 90

0,7 MG LIB PROL COM CT BL AL X 30

And then the syntax needs to be comprehensive to identify the COM and discard any REV or LIB

Something like: (?!. *REV|. *LIB) bcom b s(?!. *REV|. *LIB)

It is possible?

  • Can you give an example of the result you want to get? Do you want to organize it into an object for example? What language are you using?

  • @Sergio, we’ll use java to build the code. In this case, we need to scan an entire table and sort the records according to a description field, string, in which all the information is mixed. I was responsible for constructing the REGEX syntax to identify the records. So, for example, when reading the string field, when you find WITH we know it’s a pill and when to find WITH REV we know it’s a coated tablet, and so on.

3 answers

2

If you need to do an exact search for a word to anchor(Boundary) \b and with the negative Lookahead(?!) to deny the group.

regex for the example of the question:

\bCOM\b\s(?!REV|LIB)

The return is four characters, COM_ or COM followed by a space.

Related:

Meaning of ?: ?= ?! ?<= ?

What good is a b oundary in a regular expression?

  • Correct @rray, had not updated here for me with your reply and ended up answering almost equal.

  • @Denisedamaro, \bCOM\b\s(?!.*REV|.*LIB) resolves?

  • Yes, thank you @rray. What would this sentence look like if I needed to eliminate REV or LIB before COM? Something like (?!. *REV|. *LIB) bcom b s(?!. *REV|. *LIB)

2

You can do the following:

COM\s(?!REV|LIB)

Example of the expression in operation.

This expression will only select COM that is not preceded by REV or LIB.

Explanation (Simple because I have no advanced knowledge in Regular Expressions):

  • ? = indicates that there is zero or an occurrence of the previous element

  • ! = sign of different

  • (?!) = Negation of (?=), house the absence of the current pattern from the current position to the end, and also does not include the standard in marriage. For example, the standard car(?!yellow) will match in "A cheap blue sports car." meanwhile car(?!blue) will not marry.

Source: Regular expression

Edit (Confirm new scenario)

If the REV and LIB can be at any point in the string maybe adding wildcards (.*) before and after the denied expression already solve. Something like this:

COM\s(?!.*(REV|LIB).*)

Functional online example.

  • 2

    Your answer is almost correct, in which case you don’t need the [ and ], so vc is generating a capture group that includes the characters that are inside it, see. Adjustment to (?!REV|LIB)

  • Guys, it’s almost that. It works if REV and LIB are immediately after COM. But in the case of COM CT FR REV LIB, it will continue to find.

  • Correct, you are right @Guilhermelautert, I will update the reply.

  • Thank you Guilherme Lautert and Fernando. It worked in the examples I gave but still failed in a case. I edited the post. You can look again pls?

2


Taking into account that the sentences will be separated by \n, and that you don’t want to capture the ones that don’t have the words REV and LIB, note that then REVENDEDOR and LIBERADO would capture.

The sentence could be ^(?!.* (REV|LIB) .*).*$.

Applying with the modifiers gm.

See working in REGEX101.

Explanation

  • ^ ... $ - should the sentence go from start to end of line.
  • (?!) - Lookback of denial, marrying this sentence then of ignoring.
  • .* (REV|LIB) .* any sentence that has REV or LIB.
  • .* anything.
  • Modifier g - global, all you can find
  • Modifier m - multiline, which says that each \n he regards as new sentence.

Applying in PHP

$content = "
0,5 MG COM CT BL AL/AL X 30
0,4 MG COM REV CT BL AL AL X 90
0,7 MG COM LIB PROL CT BL AL AL X 30 
0,4 MG COM CT REV BL AL AL X 90
";

preg_match('~^(?!.* (REV|LIB) .*).*$~m', $content, $matchs);

Editing

As commented I end up forgetting the COM.

The new expression would look like this ^(?!.* (REV|LIB) .*).* COM .*$

Explanation

  • (?!.* (REV|LIB) .*) - says what "should not marry".
  • .* COM .* - says what "should marry".

Note the spaces in COM and in the (REV|LIB) this restricts so that it is only these sentences.

How it comes to being two expressions, that of "shall not marry" and "shall marry", no matter if REV|LIB are present or after COM, will not be captured.

See working on REGEX101

  • Thank you William, however I need to identify the positive COM. I’ve got the idea. Thank you so much for your help.

  • Thank you @Guilhermelautert worked as I needed

Browser other questions tagged

You are not signed in. Login or sign up in order to post.